The common base libraries for crawlers in python3 are urllib and requests
This article mainly describes the relevant content of urllib
urllib consists of four modules: requests - Simulate sending requests
error - exception handling module
parse - Tool Module on URL Processing
Robot Parser - Determine crawlable content on a site by identifying the site robot.txt
1. Send Request
The urllib library sends requests using two main components of the request module: the urlopen() method and the Requests class, where the Requests class is used in conjunction with the urlopen() method.
First, take a look at the API for the urlopen() method:
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,context=None)
Detailed description of parameters: url - required content
The parameter needs to be passed when requesting data-POST.When passing this parameter, care needs to be taken to convert the parameter to byte stream encoding format (bytes type)
Timeout - Sets the timeout time, and throws a timeout exception when it exceeds.There are two ways to use this parameter: timeout=5 or timeout=(5,30), which means connect+read time is 5s, and connect time is 5s+read time is 30s
context - The parameter value must be of type ssl.SSLContext
cafile,capath - specify CA integer and path
Note: Use of the bytes() method - bytes(string,code). The first parameter specifies the string and the second parameter specifies the encoding format
urllib.parse.urlencode(dict)
Instance applications:
1 import urllib.parse 2 import urllib.request 3 4 url = 'http://httpbin.org/post' 5 data = bytes(urllib.parse.urlencode({'name':'value'}), encoding='utf8') 6 timeout = (3, 10) 7 8 response = urllib.request.urlopen(url,data=data,timeout=timeout) 9 10 # output response Type of 11 print(type(response)) 12 # Output Page Content 13 print(response.read().decode('utf8'))
Through type(response), we find that urlopen() returns an HTTPResponse type object, which mainly includes the following methods and properties
read() - Return to web content
getheaders() - Return response header information
getheader(name) - Returns the property value corresponding to the name in the response header with the property name
msg, version, status (status code), reason, debuglevel, closed
Next, take a look at the Request class construction method, which primarily addresses the request construction issues that are not easily solved by the urlopen() method, such as adding information such as Headers
API of Request class:
urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
url - required parameter
data - bytes() type
Headers - dictionary, request header information, commonly used User-Agent information to disguise request headers
origin_req_host - Requestor's host method or IP address
unverifiable - refers to whether the request is not authenticated, and when we do not have grab rights, the value of this parameter is True, which defaults to False?
Method - The parameter value is a string specifying the method used by the request, such as GET, POST, etc.
Instance applications:
1 from urllib import request,parse 2 3 url = 'http://httpbin.org/post' 4 headers = { 5 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; WIndows NT)' 6 'Host': 'httpbin.org' 7 } 8 dict = { 9 'name': 'Germey' 10 } 11 data = bytes(parse.urlcode(dict), encoding='utf8') 12 13 req = request.Request(url=url,data=data,headers=headers,method='POST') 14 response = request.urlopen(req)
Additionally, some more advanced operations (Cookies handling, proxy settings, etc.) require the help of the Handler tool
In the urllib.request module, the BaseHandler class provides several of the most basic methods: default_open(), protocol_request(), and so on, which are inherited by all other Handler subclasses.
HTTPDefaultErrorHandler: Handle HTTP response error, error throws HTTPError type exception
HTTPRedirectHandler: Processing redirection
HTTPCookieProcessor: Used to process Cookies
ProxyHandler: used to set proxy, default proxy is empty
HTTPPasswordMgr: A table for managing passwords and maintaining user names and passwords
HTTPBasicAuthHandler: Used to manage authentication and to use this subclass when opening a link that requires authentication
The OpenerDirector class, known as Opener, is required to use these subclasses.The urlopen() method described above is actually the Opener provided to us by urllib, and when we use the operations of these subclasses, we need to build the Opener with the help of Handler
Instance applications:
# Authentication from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener from urllib.error import URLError username = 'username' password = 'password' url = 'http://localhost:5000/' # Build password management p = HTTPPasswordMgrWithDefaultRealm() p.add_password(None, url, username, paaword) # Build certification management auth_handler = HTTPBasicAuthHandler(p) opener = build_opener(auth_handler) try: response = opener.open(url) html = response.read(),decode('utf8') print(html) except URLError as e: print(e.reason) # agent from urllib.request import ProxyHandler, build_opener proxy_handler = ProxyHandler({ 'http': 'http://127.0.0.1:9743' 'https': 'https://127.0.0.1:9743' }) opener = builder_opener(proxy_handler) # Cookies Obtain import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url) for item in cookie: print(item.name+'='+item.value) # Cookie output to a file filename = 'cookies.txt' cookie = http.cookiejar.MozillaCookieJar(filename) #Save as Mozilla Browser Cookie format cookie = http.cookiejar.LWPCookieJar(filename) #Save as LWP format handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url) cookie.save(ignore_discard=True, ignore_expires=True) # Read from File cookie cookie = http.cookiejar.LWPCookieJar() cookie.load(filename, ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url)
2. Exception handling
By properly catching exceptions, more accurate exception judgments can be made, which makes the program more robust.
Common exception handling classes are the URLError class and the HTTPError class, where the HTTPError class is a subclass of the URLError class.Exceptions generated by the request module can be accomplished by capturing the URLError class, which has an attribute reason that returns the cause of the error.HTTPError is designed to handle HTTP request errors. It has three properties: code, reason, and headers. Code returns an HTTP status code, reason returns the cause of the error, and headers returns the start.
The reason property returns either a string or an object.
For specific use, we can choose to catch subclass errors before parent errors.
Instance applications:
1 from urllib import request,error 2 ''' 3 The above references are equivalent to 4 import urllib.request 5 import urllib.error 6 ''' 7 8 url = 'http://www.baidu.com' 9 try: 10 response = request.urlopen(url) 11 except error.HTTPError as e: 12 print(e.reason, e.code, e.headers, sep='\n') 13 except error.URLError as e: 14 print(e.reason) 15 else: 16 print('Reuqest Successfully')
3. url Link Resolution
The parse module provided in the urllib library, which defines the standard interface for handling URL s, is described in the following sections.
urlparse() - Implements URL recognition and segmentation, which splits a URL into six parts.They are scheme, netloc, path, params, query, and fragment, which form a URL link.
scheme: //netloc/path;params?query#fragment
The api usage of urlparse() is as follows:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Detailed parameters:
urlstring: required, URL to resolve
scheme: Optional, default protocol, which will be used when the link has no protocol information
Allo_fragments: Optional, whether or not the parameter is fragment.When fragment=False, the original fragement part is resolved to a path, params, or query part, and the fragment part becomes empty
In addition, the return value of urlparse() is a tuple, so we can get what we need according to the index order or the attribute name
Instance applications:
1 from urllib.parse import urlparse 2 3 url = 'http://www.baidu.com/index.html#comment' 4 result = urlparse(url,allow_fragments=False) 5 6 print(result.scheme, result[0], sep='\n')
Next, let's look at other ways to resolve links:
urlunparse() - Based on the parameter-to-link, this is equivalent to the reversal of the urlparse() method.It is worth noting that the Iterable Object parameter length for this method must be 6
from urllib.parse import urlunparse data = ['http',www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data))
urlsplit() - Similar to urlparse(), it is used to decompose urls.The difference is that the return value of this method is only 5 length, where params are no longer resolved separately but fall into path
urlunsplit() - Similar to urlunparse(), it is a reversal of urlsplit(), with a parameter length of 5
urljoin() - Another way to generate links, API s are as follows:
urllib.parse.urljoin(base_url,url)
Base_url: Basic link.This method resolves the scheme, netlocal, and path of base_url
url: a new link to process.It can be in a variety of forms, including all six parts of a url or just a few consecutive parts.
When the content of a new link is missing, the method supplements the missing part of the base_url based on its parsing information and returns the supplemented new link or the pending link that does not need to be supplemented
Application examples:
from urllib.parse import urljoin urljoin('http://baidu.com/index.html','FAQ.html')
It is worth noting that even if the base_url contains params, query, and fragment s, it does not work at all
urlencode() - Serialize GET request parameters.Serialization is the string required to convert a dictionary into a parameter
1 from urllib.parse import urlencode 2 3 params = { 4 'name': 'germey' 5 'age': 22 6 } 7 base_url = 'http://www.baidu.com' 8 url = base_url+urlencode(params) 9 print(url)
parse_qs() - Deserialization
parse_qsl() - A list of parameterized tuples
1 from urllib.parse import parse_qs,parse_qsl 2 3 query = 'name=germey&age=22' 4 print(parse_qs(query)) 5 print(parse_qsl(query))
quote() - Convert Chinese characters into URL encoding
unquote() - decode URL
1 from urllib.parse import quote 2 3 keyword = 'wallpaper' 4 url = 'http://www.baidu.com/s?wd=' + quote(keyword) 5 print(url) 6 print(unquote(url))
IV. Robots Agreement
After so long learning, we finally reached the last module of the urllib library, robotparser module, through which we can analyze the Robots protocol of the website
First, let's look at what the Robots protocol is
The Robots protocol is also known as the crawler protocol, and its full name is the exclusion criteria for network crawlers.The purpose is to tell crawlers and search engines which pages to grab and which not.This is usually a text file called robots.txt.Usually placed in the root directory of a website.When a search crawler visits a site, it first checks to see if the file exists, and then crawls information based on the defined crawl range.
Example of robots.txt
User-agent: *
Disallow: /
Allow: /public/
Where User-agent describes the name of a search crawler and its value can be BaiduSpider, Googlebot, etc. Disallow specifies a directory that is not allowed to be captured,'/'denotes that all pages are not allowed to be captured, Allow is used to exclude certain exceptions, usually in combination with Disallow, /public/denotes that a public directory can be captured, which is equivalent to a whitelist
After learning about the Robots protocol, we can parse robots.txt through the robotparser module.The robotparser module API is as follows:
urllib.robotparser.RobotFileParser(url='')
* When using this module, we can either pass in the URL directly or set it using the set_url method.Let's take a look at the common methods used in this module:
set_url() - Set links to robots.txt file locations
read() - Read robots.txt and analyze it. It is worth noting that we must execute this method to complete the reading of files, although it does not return a value
parse() - parses the robots.txt file, and the incoming parameter is the contents of some lines of robots.txt
can_fetch() - This method passes in two parameters, the first is User-agent and the second is the URL to grab.Returns a Boolean result indicating whether the User-agent can grab the page
mtime() - The last time robots.txt was captured and analyzed to periodically check the robots.txt file
modtime() - Set the current time to the last capture and analysis time
Instance applications:
1 from urllib.robotparser import RobotFileParser 2 3 rp = RobotFileParser() 4 rp.set_url('http://www.jianshu.com/robots.txt') 5 rp.read() 6 # rp.parse(rp.read().decode('utf8').split('\n')) 7 print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d')) 8 print(rp.can_fetch('*','http://www.jianshu.com/search?q=python&page=1&type=collections'))
Finally, Xiao Bian has finished writing this part and is ready to take a rest.
Here, Xiaobian pushes on his new public number and welcomes you to come to explore the issue actively.