I. What are reptiles?
Web crawlers (also known as web pages) spider It is a kind of automatic grabbing according to certain rules. Web A program or script for information. If the Internet is compared to a large spider web, each node in the spider web has a lot of data. A crawler is like a small spider who finds the website through the address of the web page and gets information: HTML code / JSON data / binary data (pictures, videos).
2. URL s - The address crawler knows where to crawl
The address corresponding to the browser is the URL, for example https://www.baidu.com/?tn=02049043_62_pg
The format of the URL: (optional with square brackets [] protocol: // hostname [: port]/ path [; parameters][? Query] fragment
(1) Protocol: A designated transport protocol
(2) hostname: refers to the server that stores resources Domain name system (DNS) Host name or IP address. Sometimes, the username and password needed to connect to the server can be included in front of the host name (format: username:password@hostname).
(3) port (port number): integer, optional, default port of the scheme when omitted, various transport protocol There are default port numbers, such as http's default port of 80.
(4) path: A string separated by zero or more "/" symbols, usually used to represent a directory or file address on the host.
Three, urllib
In python, the module urllib is often used by learning crawlers. Here, we introduce urllib with some examples.
Urllib includes four modules: urllib. request, urllib. error, urllib. parse, urllib. robot parser
urllib.request: Used to open and read URL s
urllib.error: Includes exceptions generated by urllib.request
urllib.parse: Used to parse and process URL s
urllib.robotparse: robots.txt file for parsing pages
(1)urllib.request module - open and read websites
Simple Opening and Reading Websites
urllib.request.urlopen(url): Used to open a website (actually there are other parameters, you can check the help in python).
It returns as a text, which can be read with read ().
import urllib.request response = urllib.request.urlopen("https://www.baidu.com") html = response.read() print(html)
The result of this time is binary scrambling.
b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https:// ","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http- equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>' >>>
At this time, decode () can be added to parse the code of the web page, but to find the encoding format in the code of the web page is utf-8 (review element, open at the head point, charset there).
import urllib.request response = urllib.request.urlopen("https://www.baidu.com") html = response.read() html = html.decode("utf-8") print(html)
<html> <head> <script> location.replace(location.href.replace("https://","http://")); </script> </head> <body> <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript> </body> </html>
The whole result will be much more neat.
Additional problem: Getting the encoding method
There is a problem here. It's a bit troublesome to get the encoding method. Here we look at other ways to get the encoding method:
Using chardet.detect () in the third-party library chardet, you can get a dictionary containing the encoding method.
import chardet import urllib url="https://www.baidu.com" content = urllib.urlopen(url).read() result= chardet.detect(content)#What you get here is a dictionary. encoding = result['encoding']#Get the encoding in the dictionary print(encoding)
Other ways to read a website besides read ()
geturl (): A string that returns the URL
info (): Information returned by meta tags (an important html tag in html web source code, such as author, date and time, web description, keywords, page refresh, etc.)
getcode (): The status code of http is returned, and if 200 is returned, the request is successful. The server 3-digit code for HTTP response status, defined by RFC 2616 specification)
import urllib.request url="https://www.baidu.com" html = urllib.request.urlopen(url) print("geturl:%s\n"%(html.geturl())) print("info:%s\n"%(html.info())) print("getcode:%s\n"%(html.getcode))
geturl:https://www.baidu.com info:Accept-Ranges: bytes Cache-Control: no-cache Content-Length: 227 Content-Type: text/html Date: Sun, 22 Jul 2018 03:01:02 GMT Etag: "5b3c3650-e3" Last-Modified: Wed, 04 Jul 2018 02:52:00 GMT P3p: CP=" OTI DSP COR IVA OUR IND COM " Pragma: no-cache Server: BWS/1.1 Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300 Set-Cookie: BIDUPSID=98CE4FF720D964752DFFF7F8757770E8; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Set-Cookie: PSTM=1532228462; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Strict-Transport-Security: max-age=0 X-Ua-Compatible: IE=Edge,chrome=1 Connection: close getcode:<bound method HTTPResponse.getcode of <http.client.HTTPResponse object at 0x0415C290>>
Send data to server with data parameter of urlopen
Data is an optional parameter in urlopen that can send data to the server (according to http specifications, GET is used to get information, POST submits data to the server). If you pass this data parameter, it is no longer a GET request, but a POST request. But data is the content of byte stream encoding format, that is, bytes type, which can be automatically converted to the above format by the urllib.parse.urlencode() function that you learn next.
The next article continues with the use of other modules of urllib...