Crawler: introduction to basic parsing library urllib

Keywords: Python crawler http

catalogue

1. Send request  

  data parameter

  timeout parameter

Other parameters  

  Comparison between urlopen and Request

Use Request for flexible configuration

Advanced Usage

Password verification

Use of agents

Cookie processing

In Python 2, there are urllib and urllib libraries to send requests. In Python 3, urllib 2 library no longer exists. It is unified as urllib library. It is Python's built-in HTTP request library, which mainly has four modules

Request: it is the most basic HTTP request module. It can be used to simulate sending a request, just like entering a network program in the browser and then entering a carriage. You can simulate the process by passing RL and additional numbers to the library method

Error: exception handling module. If there are request errors, you can catch these exceptions and retry or to ensure that the program will not terminate unexpectedly

Parse: a tool module that provides many URL processing methods, such as splitting, parsing, and merging

Robotparser: it is mainly used to identify the robots.txt file of the website, and then judge which websites can be crawled and which websites can be crawled. In fact, it is less used

1. Send request  

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

The urlib request module provides the most basic method for constructing HTTP requests. It can be used to simulate a request initiation process of the browser. At the same time, it also handles authorization authentication, redirection, browser Cookies and other contents.

import urllib.request

response_data = urllib.request.urlopen("https://www.baidu.com")
print(type(response_data))    #  <class ' http.client.HTTPResponse ' >
print(response_data.read().decode("utf-8"))   #Here is the entire source code web page
print(response_data.status)    #Get status value
print(response_data.getheaders())    #Get header of type list
print(response_data.getheader("Server"))    #Get the data information in the header
print(dir(response_data))                    #View all objects and properties in class

  data parameter

The data parameter is optional. If you want to add this parameter, and if it is the content of the byte stream encoding format, that is, the byte type, you need to convert it through the bytes () method

import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({"word":"hello"}),encoding="utf-8")
response_data = urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response_data.read())

They passed a parameter word with the value hello, which needs to be transcoded into bytes (byte stream). The byte stream adopts the bytes () method. The first parameter of this method needs to be str (string) type and urencode () in urllib.parse module Method to convert the parameter dictionary into a string; the second parameter specifies the encoding format, which is specified here as utf8

  timeout parameter

The timeout parameter is used to set the timeout, in seconds, which means that if the request exceeds the set time, an exception will be thrown if there is no response

import urllib.request
import urllib.parse

response_data = urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
print(response_data.read())

The result will throw: urllib.error.URLError: <urlopen error timed out>

After throwing an exception, you can use the: Tyr exception statement to implement it

import urllib.request
import socket
import urllib.error
import urllib.parse

try:
    response_data = urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
    print(response_data.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("TIME OUT")

socket.timeout type (meaning timeout exception)

Other parameters  

In addition to the data parameter and timeout parameter, there is also a context parameter, which must be of type ssl.SSLContext to specify SSL settings

In addition, cafile and capath parameters specify the CA certificate and its path respectively, which will be used when requesting HTTPS connection

The cadefault parameter has been deprecated and its default value is False

  Comparison between urlopen and Request

The urlopen () method can initiate the most basic request, but these simple parameters are not enough to build a complete request

If you need to add Headers and other information to the Request, you can use a more powerful Request class to build it

urlopen: it simply sends a request and cannot handle the request header

Request: the sent request header information can be changed, so it is widely used

Use Request for flexible configuration

   urllib.request.Requset(url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):

URL: the URL used for the request   This is a required parameter, and other parameters are optional

Data: if data is to be transmitted, it must be of bytes (byte stream) type. If it is a dictionary, it can be encoded with urlencode () in urllib.parse module first

  Headers: is a dictionary, which is the request header. We can construct the request directly through the headers parameter when constructing the request, or add the request header by calling the add_header() method of the request instance. The most common usage is to modify the user agent to disguise the browser. The default user agent is   Python urllib, we can modify it to disguise the browser. For example, to disguise Firefox, you can set it to:   Mozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11

origin_req_host: refers to the host name or IP address of the requestor

Unverifiable: whether the request is unverifiable or not. The default is False, which means that the user does not have sufficient permission to choose to receive the result of the request. For example, we request images in HTML documents, but we do not have the permission to automatically capture images. In this case, the value of unverifiable is True

Method:   A string indicating the method used in the request, such as GET \POST\ PUT, etc

import urllib.request,parse

url = "http://httpbin.org/post"
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
    "Host": "httpbin.org"
}
dict = {
    "name" : "zhangsan"
}

data = bytes(urllib.parse.urlencode(dict),encoding="utf-8")
requse_data = urllib.request.Request(url=url,data=data,headers=headers,method="POST")
respnse_data = urllib.request.urlopen(requse_data)
print(respnse_data.read().decode("utf-8"))

In addition, headers can also be added with the add_header() method:

req = request.Request(url=url, data=data, method= ' POST ') 
req .add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36') 

Advanced Usage

Handling Cookies, proxy settings, etc   The BaseHandler class in the urllib.request module, which is the parent class of all other handlers, provides the most basic methods, such as default_open(), protocol_request()    

Hitpdefaultarrorhandler: used to handle response errors. All errors will throw HTTP Error type exceptions. HTTPRedirectHandler: used to handle redirection
HTTPCookieProcessor  : Used to process Cookies
ProxyHandler: used to set the proxy. The default proxy is empty
HTTPPasswordMgr: used to manage passwords. It maintains a table of user names and passwords
HTTP basic authhandler: used to manage authentication. If authentication is required when a link is opened, it can be used to solve the authentication problem

Another important class is OpenerDirector, which we can call Opener. We have used urlopen() method before. In fact, it is an Opener provided by urllib;

Opener can use the open() method and return the same type as urlopen().   So, what does it have to do with Handler? In short, it uses Handler to build opener

Password verification

Some websites will pop up a prompt box when opening, which directly prompts you to enter your user name and password. You can view the page only after successful verification

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

user_name = "username"
password = "password"
url = 'http://localhost:5000/'
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,user_name,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result =opener.open(url)
    html = result.read().decode("utf-8")
    print(html)
except URLError as e:
    print(e.reason)
    

1. Here, first instantiate the HTTPBasicAuthHandler object, whose parameter is the HTTPPasswordMgrWithDefaultRealm object. He uses add_password() to add the user name and password, so as to establish a Handler to handle authentication

2. Use this Handler and use the build_opener() method to build an Opener, which has been verified successfully when sending the request

3. Using Opener's open() method to open the link, you can complete the verification. The result obtained here is the verified page source code content

Use of agents

When you are a crawler, you have to use an agent. If you want to add an agent, you can do it here

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler = ProxyHandler({
 "http":"http://127.0.0.1:9743",
 "https":"https://127.0.0.1:9743"
})
opener = build_opener(proxy_handler)
try:
    response = opener.open("https://www.baidu.com")
    print(response.read().decode("utf-8"))
except URLError as e:
    print(e.reason)

1. Here we set up a local agent, which runs on port 9743.

2. Using ProxyHandler, the parameter is a dictionary. When the key name is the protocol type (such as HTTP or HTTPS), the key value is the proxy link, and multiple proxies can be added

3. Use Handler and build here_ The Opener () method constructs an Opener, and then sends a request

Cookie processing

We also need relevant handler s for Cookies processing

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name + "=" + item.value)
    

We show that a CookieJar object is declared. Next, we need to use HTTP cookieprocessor to build a Handler, and finally use build_ The Opener () method constructs Opener and executes the open() function

#Operation results:
BAIDUID=30172F7439FC0FEED244E455C6D8204E:FG=1
BIDUPSID=30172F7439FC0FEE227F1F995DE95884
H_PS_PSSID=35292_34442_35104_35240_35048_35096_34584_34504_35167_35318_26350_35145_35301
PSTM=1638079765
BDSVRTM=0
BD_HOME=1

Cookies are actually saved as text

import http.cookiejar,urllib.request

filename = "cookies.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

Cookie jar needs to be replaced by Mozilla cookie jar, which will be used when generating files. The subclass of cookie jar can be used to display Cookies and file related events, such as reading and saving Cookies, and can save Cookies into the cookie format of Mozilla browser. The execution found that a cookie.txt file was generated, the contents of which are as follows

  LWPCookieJar can also read and save Cookies, but the saved format is different from Mozilla cookiejar. It will save Cookies in libwww Perl (LWP) format.

filename = "cookies.txt"
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

Cookie format file content

  Read Cookies file

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load("cookies.txt",ignore_expires=True,ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
print(response.read().decode("utf-8"))

Here, the load() method is called to read the local cookie file and obtain the contents of the Cookies. However, on the premise, we first generate a file in LWPCookies format and save it as a file, and then use the same method to build the Handler and Opener after reading the Cookies

Posted by feri_soft on Sat, 27 Nov 2021 23:32:30 -0800