New crawler requests Library

1, Installation of requests Library

WIN platform: run cmd with win+R and execute

pip install requests

Official documents: https://docs.python-requests.org/zh_CN/latest/
Mine is already installed

Test after successful installation

import requests
r = requests.get("http://www.baidu.com")
print(r.text)

2, Seven main methods and simple use of requests Library

(1) Seven methods

method	explain
requests.request()	Construct a request to support the basic methods of the following methods
requests.get()	The main method to obtain HTML web pages corresponds to HTTP GET
requests.head()	The method for obtaining HTML web page header information corresponds to HTTP HEAD
requests.post()	The method of submitting POST requests to HTML web pages corresponds to HTTP POST
requests.put()	The method of submitting a PUT request to an HTML web page corresponds to the PUT of HTTP
requests.patch()	Submit a local modification request to an HTML web page, corresponding to the PATCH of HTTP
requests.delete()	Submit a DELETE request to the HTML page, corresponding to the DELETE of HTTP

The most commonly used method is request(),get(),post()

(2) Simple use

1.request() method

r = requests.request(method, url, **kwargs)  # Construct a request to support the basic methods of the following methods

Method: request method, corresponding to 7 types such as get/put/post
url: the url link of the page to be obtained
**kwargs: 13 parameters controlling access
- Method: request method
  r = requests.request('GET', url, **kwargs)
  r = requests.request('HEAD', url, **kwargs)
  r = requests.request('POST', url, **kwargs)
  r = requests.request('PUT', url, **kwargs)
  r = requests.request('PATCH', url, **kwargs)
  r = requests.request('delete', url, **kwargs)
  r = requests.request('OPTIONS', url, **kwargs)
params: dictionary or byte sequence, which is added to the url as a parameter

Access control parameter, that is, the third parameter kwargs example

import requests
kv1 = {'key1': 'value1', 'key2': 'value2'}
rs1 = requests.request('GET', 'http://python123.io/ws',params = kv1)
print(rs1.text)
rs2 = requests.request('POST', 'http://python123.io/ws',data = kv1)
print(rs2.text)
kv2 = {'key1': 'value1'}
rs3 = requests.request('POST', 'http://python123.io/ws', json=kv2)
print(rs3.text)
hd = {'user‐agent': 'Chrome/10'}
rs4 = requests.request('POST', 'http://python123.io/ws', headers=hd)
rs5 = requests.request('post','',cookies = 'Dictionary or cookieJar')
rs6 = requests.request('post','',auth = 'tuple')
fs = {'file':open('data.xls','rd')}
rs7= requests.request('POST','',files = fs) # transfer files
rs8 = requests.request('GET','http://www.baidu.com',timeout = 10)
pxs = { 'http': 'http://user:pass@10.10.10.1:1234','https': 'https://10.10.10.1:4321' }
rs9 = requests.request('GET','http://Www.baidu. Com ', proxies = Pxs) # proxies dictionary type, set access proxy server

parameter	explain
files	Dictionary type, transfer file
timeout	Set the timeout in seconds
proxies	Dictionary type, set access proxy server, and add login authentication
allow_redirects	True/False, default to true, redirect switch
stream	True/False, the default is true, and the get content download now switch
verify	True/False, the default is true, and the SSL certificate authentication switch
cert	Local SSL certificate path

2. get() method

r = requests.get(url, params = None, **kwargs)

r = requests.get(url) # The main method to obtain HTML web pages corresponds to HTTP GET
r = requests.delete(url)
print(r.url)

parameter	explain
url	url link of the page to be obtained
params	Additional parameters in url, dictionary or byte stream format, optional
**kwargs	12 control access parameters

3.head() method

r = requests.head('http://httpbin.org/get ') # gets the header information of HTML pages, which corresponds to the HTTP HEAD
print(r.headers)

4.post() method

r = requests.post(url, data=None, json=None, **kwargs)

parameter	explain
url	url link of the page to be updated
data	Dictionary, byte sequence or file, contents of Request
json	JSON format data, Request content

Example

payload = {'key1':'value1','key2':'value2'}
r = requests.post('http://httpbin.org/post',data=payload) # the method of submitting POST requests to HTML web pages, corresponding to HTTP POST
r = requests.post('http://httpbin.org/post',data='abc')
print(r4.text)

5.put() method

r = requests.put(url, data=None, **kwargs)

r = requests.put('http://httpbin.org/put',data=payload) # the method of submitting a PUT request to an HTML web page, which corresponds to the PUT of HTTP
print(r.text)

6.patch() method

r = requests.patch(url) # Submit a local modification request to an HTML web page, corresponding to the PATCH of HTTP `

7.delete() method

r = requests.delete(url, **kwargs)

r = requests.delete(url)   # Submit a DELETE request to the HTML page, corresponding to the DELETE of HTTP

url: the url link of the page to be deleted

3, Two important objects of requests Library

Response and Request

(1)Request

r == response returns a response object containing server resources
Request constructs a request object that requests resources from the server
From request to response

(2) Response object

The Response object contains all the information returned by the server and the Request information of the Request

attribute	explain
r.status_code	The return status of the HTTP request. 200 indicates successful connection and 404 indicates failure
r.text	The string form of HTTP response content, that is, the page content corresponding to the url
r.encoding	Guess the encoding of the response content from the HTTP header
r.apparent_encoding	Response content encoding method analyzed from the content (alternative encoding method)
r.content	Binary form of HTTP response content

Sometimes the printed page content is garbled, as shown below

Therefore, you need to understand the encoding of Response

r.encoding	Guess the encoding of the response content from the HTTP header
r.apparent_encoding	Response content encoding method analyzed from the content (alternative encoding method)

1) r.encoding: if there is no charset in the header, it is considered as ISO ‐ 8859 ‐ 1
2) r.text display web page content according to r.encoding
3)r.apparent_encoding: the encoding method analyzed according to the web page content can be regarded as an alternative to r.encoding

When we change the coding method, the printed content is
There are l two ways

r.encoding = 'utf-8'   #Direct change coding
r.encoding = r.apparent_encoding  #Use the alternative of r.encoding

4, Exception of requests Library

Network connection is risky, and exception handling is very important
(1) Requests exception

abnormal	explain
requests.ConnectionError	Network connection errors and exceptions, such as DNS query failure, connection rejection, etc
requests.HTTPError	HTTP error exception
requests.URLRequired	URL missing exception
requests.TooManyRedirects	Exceeding the maximum number of redirections, a redirection exception is generated
requests.ConnectTimeout	Connection to remote server timeout exception
requests.Timeout	The request URL timed out, resulting in a timeout exception

(2) Response exception

r.raise_for_status()	If it is not 200, an exception requests.HTTPError is generated

r.raise_for_status() determines r.status inside the method_ Whether code is equal to 200 is unnecessary
Add an additional if statement, which is convenient for exception handling with try ‐ exception

5, HTTP protocol

(1) HTTP is a stateless application layer protocol based on "request and response" mode

HTTP protocol uses URL as the identification for locating network resources. The URL format is as follows:
- http://host[:port][path]
-Host: Legal Internet host domain name or IP address
-Port: port number. The default port is 80
-Path: the path of the requested resource

URL is the Internet path to access resources through HTTP protocol. A URL corresponds to a data resource

(2) HTTP operation on resources

Manage resources through URL s and commands, operate independently and stateless, and network channels and servers become black boxes

6, General code framework for crawling web pages

import requests

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()  # If the status is not 200, an HTTPError exception is thrown
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return 'raise an exception'

if __name__ == '__main__':
    url = input('Please enter url address:')
    print(getHTMLText(url))

7, Restrictions on web crawlers

Robots protocol

Robots(Robots Exclusion Standard)
Function: the website tells web crawlers which pages can be crawled and which can't.
Form: robots.txt file in the root directory of the website
case
https://www.jd.com/robots.txt
http://www.baidu.com/robots.txt
http://www.qq.com/robots.txt
http://news.qq.com/robots.txt

*#Comment, for all, / for root

Use of Robots protocol

Web crawler: automatically or manually identify robots.txt, and then crawl the content
Binding: Robots protocol is recommended but not binding. Web crawlers may not abide by it, but there are legal risks

Tip:
As a computer technology, crawler determines its neutrality. Crawler itself is not prohibited by law. If crawler collects public data, it is OK. Of course, improper operation may involve the risk of breaking the law or even committing a crime. Therefore, it is necessary to use reasonably and standardize crawling.

Posted by javamint on Thu, 07 Oct 2021 18:42:13 -0700

Programmer Group