Pthon3 Web Crawler System Learning: First Lecture on Basic Library urllib

Keywords: Python Fragment encoding Attribute SSL

The common base libraries for crawlers in python3 are urllib and requests

This article mainly describes the relevant content of urllib

urllib consists of four modules: requests - Simulate sending requests

error - exception handling module

parse - Tool Module on URL Processing

Robot Parser - Determine crawlable content on a site by identifying the site robot.txt

1. Send Request

The urllib library sends requests using two main components of the request module: the urlopen() method and the Requests class, where the Requests class is used in conjunction with the urlopen() method.

First, take a look at the API for the urlopen() method:

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,context=None)

Detailed description of parameters: url - required content

The parameter needs to be passed when requesting data-POST.When passing this parameter, care needs to be taken to convert the parameter to byte stream encoding format (bytes type)

Timeout - Sets the timeout time, and throws a timeout exception when it exceeds.There are two ways to use this parameter: timeout=5 or timeout=(5,30), which means connect+read time is 5s, and connect time is 5s+read time is 30s

context - The parameter value must be of type ssl.SSLContext

cafile,capath - specify CA integer and path

Note: Use of the bytes() method - bytes(string,code). The first parameter specifies the string and the second parameter specifies the encoding format

urllib.parse.urlencode(dict)

Instance applications:

 1 import urllib.parse
 2 import urllib.request
 3 
 4 url = 'http://httpbin.org/post'
 5 data = bytes(urllib.parse.urlencode({'name':'value'}), encoding='utf8')
 6 timeout = (3, 10)
 7 
 8 response = urllib.request.urlopen(url,data=data,timeout=timeout)
 9 
10 # output response Type of
11 print(type(response))
12 # Output Page Content
13 print(response.read().decode('utf8'))

Through type(response), we find that urlopen() returns an HTTPResponse type object, which mainly includes the following methods and properties

read() - Return to web content

getheaders() - Return response header information

getheader(name) - Returns the property value corresponding to the name in the response header with the property name

msg, version, status (status code), reason, debuglevel, closed

 

Next, take a look at the Request class construction method, which primarily addresses the request construction issues that are not easily solved by the urlopen() method, such as adding information such as Headers

API of Request class:

urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

url - required parameter

data - bytes() type

Headers - dictionary, request header information, commonly used User-Agent information to disguise request headers

origin_req_host - Requestor's host method or IP address

unverifiable - refers to whether the request is not authenticated, and when we do not have grab rights, the value of this parameter is True, which defaults to False?

Method - The parameter value is a string specifying the method used by the request, such as GET, POST, etc.

Instance applications:

 1 from urllib import request,parse
 2 
 3 url = 'http://httpbin.org/post'
 4 headers = {
 5     'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; WIndows NT)'
 6     'Host': 'httpbin.org'
 7 }
 8 dict = {
 9     'name': 'Germey'
10 }
11 data = bytes(parse.urlcode(dict), encoding='utf8')
12 
13 req = request.Request(url=url,data=data,headers=headers,method='POST')
14 response = request.urlopen(req)

 

Additionally, some more advanced operations (Cookies handling, proxy settings, etc.) require the help of the Handler tool

In the urllib.request module, the BaseHandler class provides several of the most basic methods: default_open(), protocol_request(), and so on, which are inherited by all other Handler subclasses.

HTTPDefaultErrorHandler: Handle HTTP response error, error throws HTTPError type exception

HTTPRedirectHandler: Processing redirection

HTTPCookieProcessor: Used to process Cookies

ProxyHandler: used to set proxy, default proxy is empty

HTTPPasswordMgr: A table for managing passwords and maintaining user names and passwords

HTTPBasicAuthHandler: Used to manage authentication and to use this subclass when opening a link that requires authentication

The OpenerDirector class, known as Opener, is required to use these subclasses.The urlopen() method described above is actually the Opener provided to us by urllib, and when we use the operations of these subclasses, we need to build the Opener with the help of Handler

Instance applications:

# Authentication
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

# Build password management
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, paaword)
# Build certification management
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    response = opener.open(url)
    html = response.read(),decode('utf8')
    print(html)
except URLError as e:
    print(e.reason)


# agent
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743'
    'https': 'https://127.0.0.1:9743'
})
opener = builder_opener(proxy_handler)


# Cookies Obtain
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
for item in cookie:
    print(item.name+'='+item.value)
# Cookie output to a file
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)   #Save as Mozilla Browser Cookie format
cookie = http.cookiejar.LWPCookieJar(filename)   #Save as LWP format
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
cookie.save(ignore_discard=True, ignore_expires=True)
# Read from File cookie
cookie = http.cookiejar.LWPCookieJar()
cookie.load(filename, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

 

2. Exception handling

By properly catching exceptions, more accurate exception judgments can be made, which makes the program more robust. 

Common exception handling classes are the URLError class and the HTTPError class, where the HTTPError class is a subclass of the URLError class.Exceptions generated by the request module can be accomplished by capturing the URLError class, which has an attribute reason that returns the cause of the error.HTTPError is designed to handle HTTP request errors. It has three properties: code, reason, and headers. Code returns an HTTP status code, reason returns the cause of the error, and headers returns the start.

The reason property returns either a string or an object.

For specific use, we can choose to catch subclass errors before parent errors.

Instance applications:

 1 from urllib import request,error
 2 '''
 3 The above references are equivalent to
 4 import urllib.request
 5 import urllib.error
 6 '''
 7 
 8 url = 'http://www.baidu.com'
 9 try:
10     response = request.urlopen(url)
11 except error.HTTPError as e:
12     print(e.reason, e.code, e.headers, sep='\n')
13 except error.URLError as e:
14     print(e.reason)
15 else:
16     print('Reuqest Successfully')

 

3. url Link Resolution

The parse module provided in the urllib library, which defines the standard interface for handling URL s, is described in the following sections.

urlparse() - Implements URL recognition and segmentation, which splits a URL into six parts.They are scheme, netloc, path, params, query, and fragment, which form a URL link.

          scheme: //netloc/path;params?query#fragment

The api usage of urlparse() is as follows:

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

Detailed parameters:

urlstring: required, URL to resolve

scheme: Optional, default protocol, which will be used when the link has no protocol information

Allo_fragments: Optional, whether or not the parameter is fragment.When fragment=False, the original fragement part is resolved to a path, params, or query part, and the fragment part becomes empty

In addition, the return value of urlparse() is a tuple, so we can get what we need according to the index order or the attribute name

Instance applications:

1 from urllib.parse import urlparse
2 
3 url = 'http://www.baidu.com/index.html#comment'
4 result = urlparse(url,allow_fragments=False)
5 
6 print(result.scheme, result[0], sep='\n')

Next, let's look at other ways to resolve links:

urlunparse() - Based on the parameter-to-link, this is equivalent to the reversal of the urlparse() method.It is worth noting that the Iterable Object parameter length for this method must be 6

from urllib.parse import urlunparse

data = ['http',www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

urlsplit() - Similar to urlparse(), it is used to decompose urls.The difference is that the return value of this method is only 5 length, where params are no longer resolved separately but fall into path

urlunsplit() - Similar to urlunparse(), it is a reversal of urlsplit(), with a parameter length of 5

urljoin() - Another way to generate links, API s are as follows:

urllib.parse.urljoin(base_url,url)

Base_url: Basic link.This method resolves the scheme, netlocal, and path of base_url

url: a new link to process.It can be in a variety of forms, including all six parts of a url or just a few consecutive parts.

When the content of a new link is missing, the method supplements the missing part of the base_url based on its parsing information and returns the supplemented new link or the pending link that does not need to be supplemented

Application examples:

from urllib.parse import urljoin
urljoin('http://baidu.com/index.html','FAQ.html')

It is worth noting that even if the base_url contains params, query, and fragment s, it does not work at all

urlencode() - Serialize GET request parameters.Serialization is the string required to convert a dictionary into a parameter

1 from urllib.parse import urlencode
2 
3 params = {
4     'name': 'germey'
5     'age': 22       
6 }
7 base_url = 'http://www.baidu.com'
8 url = base_url+urlencode(params)
9 print(url)

parse_qs() - Deserialization

parse_qsl() - A list of parameterized tuples

1 from urllib.parse import parse_qs,parse_qsl
2  
3 query = 'name=germey&age=22'
4 print(parse_qs(query))
5 print(parse_qsl(query))

quote() - Convert Chinese characters into URL encoding

unquote() - decode URL

1 from urllib.parse import quote
2 
3 keyword = 'wallpaper'
4 url = 'http://www.baidu.com/s?wd=' + quote(keyword)
5 print(url)
6 print(unquote(url))

  

IV. Robots Agreement

After so long learning, we finally reached the last module of the urllib library, robotparser module, through which we can analyze the Robots protocol of the website

First, let's look at what the Robots protocol is

The Robots protocol is also known as the crawler protocol, and its full name is the exclusion criteria for network crawlers.The purpose is to tell crawlers and search engines which pages to grab and which not.This is usually a text file called robots.txt.Usually placed in the root directory of a website.When a search crawler visits a site, it first checks to see if the file exists, and then crawls information based on the defined crawl range.

Example of robots.txt

User-agent: *
Disallow: /
Allow: /public/

 

Where User-agent describes the name of a search crawler and its value can be BaiduSpider, Googlebot, etc. Disallow specifies a directory that is not allowed to be captured,'/'denotes that all pages are not allowed to be captured, Allow is used to exclude certain exceptions, usually in combination with Disallow, /public/denotes that a public directory can be captured, which is equivalent to a whitelist

After learning about the Robots protocol, we can parse robots.txt through the robotparser module.The robotparser module API is as follows:

urllib.robotparser.RobotFileParser(url='')

* When using this module, we can either pass in the URL directly or set it using the set_url method.Let's take a look at the common methods used in this module:

set_url() - Set links to robots.txt file locations

read() - Read robots.txt and analyze it. It is worth noting that we must execute this method to complete the reading of files, although it does not return a value

parse() - parses the robots.txt file, and the incoming parameter is the contents of some lines of robots.txt

can_fetch() - This method passes in two parameters, the first is User-agent and the second is the URL to grab.Returns a Boolean result indicating whether the User-agent can grab the page

mtime() - The last time robots.txt was captured and analyzed to periodically check the robots.txt file

modtime() - Set the current time to the last capture and analysis time

Instance applications:

1 from urllib.robotparser import RobotFileParser
2 
3 rp = RobotFileParser()
4 rp.set_url('http://www.jianshu.com/robots.txt')
5 rp.read()
6 # rp.parse(rp.read().decode('utf8').split('\n'))
7 print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
8 print(rp.can_fetch('*','http://www.jianshu.com/search?q=python&page=1&type=collections'))

Finally, Xiao Bian has finished writing this part and is ready to take a rest.

Here, Xiaobian pushes on his new public number and welcomes you to come to explore the issue actively.

Posted by tempa on Thu, 02 May 2019 21:40:38 -0700