New crawler requests Library

Keywords: Python crawler

1, Installation of requests Library

WIN platform: run cmd with win+R and execute

pip install requests

Official documents: https://docs.python-requests.org/zh_CN/latest/
Mine is already installed

Test after successful installation

import requests
r = requests.get("http://www.baidu.com")
print(r.text)

2, Seven main methods and simple use of requests Library

(1) Seven methods

methodexplain
requests.request()Construct a request to support the basic methods of the following methods
requests.get()The main method to obtain HTML web pages corresponds to HTTP GET
requests.head()The method for obtaining HTML web page header information corresponds to HTTP HEAD
requests.post()The method of submitting POST requests to HTML web pages corresponds to HTTP POST
requests.put()The method of submitting a PUT request to an HTML web page corresponds to the PUT of HTTP
requests.patch()Submit a local modification request to an HTML web page, corresponding to the PATCH of HTTP
requests.delete()Submit a DELETE request to the HTML page, corresponding to the DELETE of HTTP

The most commonly used method is request(),get(),post()

(2) Simple use

1.request() method

r = requests.request(method, url, **kwargs)  # Construct a request to support the basic methods of the following methods
  • Method: request method, corresponding to 7 types such as get/put/post

  • url: the url link of the page to be obtained

  • **kwargs: 13 parameters controlling access

    • Method: request method
      r = requests.request('GET', url, **kwargs)
      r = requests.request('HEAD', url, **kwargs)
      r = requests.request('POST', url, **kwargs)
      r = requests.request('PUT', url, **kwargs)
      r = requests.request('PATCH', url, **kwargs)
      r = requests.request('delete', url, **kwargs)
      r = requests.request('OPTIONS', url, **kwargs)
  • params: dictionary or byte sequence, which is added to the url as a parameter

Access control parameter, that is, the third parameter kwargs example

import requests
kv1 = {'key1': 'value1', 'key2': 'value2'}
rs1 = requests.request('GET', 'http://python123.io/ws',params = kv1)
print(rs1.text)
rs2 = requests.request('POST', 'http://python123.io/ws',data = kv1)
print(rs2.text)
kv2 = {'key1': 'value1'}
rs3 = requests.request('POST', 'http://python123.io/ws', json=kv2)
print(rs3.text)
hd = {'user‐agent': 'Chrome/10'}
rs4 = requests.request('POST', 'http://python123.io/ws', headers=hd)
rs5 = requests.request('post','',cookies = 'Dictionary or cookieJar')
rs6 = requests.request('post','',auth = 'tuple')
fs = {'file':open('data.xls','rd')}
rs7= requests.request('POST','',files = fs) # transfer files
rs8 = requests.request('GET','http://www.baidu.com',timeout = 10)
pxs = { 'http': 'http://user:pass@10.10.10.1:1234','https': 'https://10.10.10.1:4321' }
rs9 = requests.request('GET','http://Www.baidu. Com ', proxies = Pxs) # proxies dictionary type, set access proxy server
parameterexplain
filesDictionary type, transfer file
timeoutSet the timeout in seconds
proxiesDictionary type, set access proxy server, and add login authentication
allow_redirectsTrue/False, default to true, redirect switch
streamTrue/False, the default is true, and the get content download now switch
verifyTrue/False, the default is true, and the SSL certificate authentication switch
certLocal SSL certificate path

2. get() method

r = requests.get(url, params = None, **kwargs)
r = requests.get(url) # The main method to obtain HTML web pages corresponds to HTTP GET
r = requests.delete(url)
print(r.url)
parameterexplain
urlurl link of the page to be obtained
paramsAdditional parameters in url, dictionary or byte stream format, optional
**kwargs12 control access parameters

3.head() method

r = requests.head('http://httpbin.org/get ') # gets the header information of HTML pages, which corresponds to the HTTP HEAD
print(r.headers)

4.post() method

r = requests.post(url, data=None, json=None, **kwargs)
parameterexplain
urlurl link of the page to be updated
dataDictionary, byte sequence or file, contents of Request
jsonJSON format data, Request content

Example

payload = {'key1':'value1','key2':'value2'}
r = requests.post('http://httpbin.org/post',data=payload) # the method of submitting POST requests to HTML web pages, corresponding to HTTP POST
r = requests.post('http://httpbin.org/post',data='abc')
print(r4.text)

5.put() method

r = requests.put(url, data=None, **kwargs)
r = requests.put('http://httpbin.org/put',data=payload) # the method of submitting a PUT request to an HTML web page, which corresponds to the PUT of HTTP
print(r.text)

6.patch() method

r = requests.patch(url) # Submit a local modification request to an HTML web page, corresponding to the PATCH of HTTP `

7.delete() method

r = requests.delete(url, **kwargs)
r = requests.delete(url)   # Submit a DELETE request to the HTML page, corresponding to the DELETE of HTTP

url: the url link of the page to be deleted

3, Two important objects of requests Library

Response and Request

(1)Request

  • r == response returns a response object containing server resources
  • Request constructs a request object that requests resources from the server
  • From request to response

(2) Response object

The Response object contains all the information returned by the server and the Request information of the Request

attributeexplain
r.status_codeThe return status of the HTTP request. 200 indicates successful connection and 404 indicates failure
r.textThe string form of HTTP response content, that is, the page content corresponding to the url
r.encodingGuess the encoding of the response content from the HTTP header
r.apparent_encodingResponse content encoding method analyzed from the content (alternative encoding method)
r.contentBinary form of HTTP response content


Sometimes the printed page content is garbled, as shown below

Therefore, you need to understand the encoding of Response

r.encodingGuess the encoding of the response content from the HTTP header
r.apparent_encodingResponse content encoding method analyzed from the content (alternative encoding method)

1) r.encoding: if there is no charset in the header, it is considered as ISO ‐ 8859 ‐ 1
2) r.text display web page content according to r.encoding
3)r.apparent_encoding: the encoding method analyzed according to the web page content can be regarded as an alternative to r.encoding

When we change the coding method, the printed content is
There are l two ways

r.encoding = 'utf-8'   #Direct change coding
r.encoding = r.apparent_encoding  #Use the alternative of r.encoding

4, Exception of requests Library

Network connection is risky, and exception handling is very important
(1) Requests exception

abnormalexplain
requests.ConnectionErrorNetwork connection errors and exceptions, such as DNS query failure, connection rejection, etc
requests.HTTPErrorHTTP error exception
requests.URLRequiredURL missing exception
requests.TooManyRedirectsExceeding the maximum number of redirections, a redirection exception is generated
requests.ConnectTimeoutConnection to remote server timeout exception
requests.TimeoutThe request URL timed out, resulting in a timeout exception

(2) Response exception

r.raise_for_status()If it is not 200, an exception requests.HTTPError is generated

r.raise_for_status() determines r.status inside the method_ Whether code is equal to 200 is unnecessary
Add an additional if statement, which is convenient for exception handling with try ‐ exception

5, HTTP protocol

(1) HTTP is a stateless application layer protocol based on "request and response" mode

  • HTTP protocol uses URL as the identification for locating network resources. The URL format is as follows:
    - http://host[:port][path]
    -Host: Legal Internet host domain name or IP address
    -Port: port number. The default port is 80
    -Path: the path of the requested resource

URL is the Internet path to access resources through HTTP protocol. A URL corresponds to a data resource

(2) HTTP operation on resources

Manage resources through URL s and commands, operate independently and stateless, and network channels and servers become black boxes

6, General code framework for crawling web pages

import requests

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()  # If the status is not 200, an HTTPError exception is thrown
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return 'raise an exception'

if __name__ == '__main__':
    url = input('Please enter url address:')
    print(getHTMLText(url))

7, Restrictions on web crawlers

Robots protocol

Use of Robots protocol

Web crawler: automatically or manually identify robots.txt, and then crawl the content
Binding: Robots protocol is recommended but not binding. Web crawlers may not abide by it, but there are legal risks

Tip:
As a computer technology, crawler determines its neutrality. Crawler itself is not prohibited by law. If crawler collects public data, it is OK. Of course, improper operation may involve the risk of breaking the law or even committing a crime. Therefore, it is necessary to use reasonably and standardize crawling.

Posted by javamint on Thu, 07 Oct 2021 18:42:13 -0700