Crawler: requests module

Keywords: Python encoding JSON pip network

1. requests module

1.1 introduction to requests

Requests is a powerful, simple and easy-to-use HTTP request library. Compared with the previously used urllib module, the api of requests module is more convenient. (the essence is to encapsulate urlib3)

You can use the pip install requests command to install, but it is easy to have network problems, so I found a domestic image source to speed up.

Then we find the image source of Douban

pip install Package name -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

Just change the package name, you can download the module quickly.

1.2 requests

There are many request methods, but we only talk about the two most commonly used: GET request and POST request.

1.2.1 GET request

The GET method is used to send a request to the target web address. The method returns a Response object, which is explained in the next section.

Parameters of GET method:

URL: required, specify the requested URL

params: dictionary type, which specifies request parameters. It is often used when sending GET requests

Example:

import requests
url = 'http://www.httpbin.org/get'
params = {
    'key1':'value1',
    'key2':'value2'
}
response = requests.get(url=url,params=params)
print(response.text)

Result:

 

 

Headers: dictionary type, specifying request headers

Example:

import requests
url = 'http://www.httpbin.org/headers'
headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
print(response.text)

Result:

 

 

proxies: dictionary type, specify the proxy to use

Example:

import requests
url = 'http://www.httpbin.org/ip'
proxies = {
    'http':'113.116.127.164:8123',
    'http':'113.116.127.164:80'
}
response = requests.get(url=url,proxies=proxies)
print(response.text)

Result:

 

 

Cookie s: dictionary type, specifying cookies

Example:

import requests
url = 'http://www.httpbin.org/cookies'
cookies = {
    'name1':'value1',
    'name2':'value2'
}
response = requests.get(url=url,cookies=cookies)
print(response.text)

Result:

 

 

auth: tuple type, specifying the account and password when logging in

Example:

import requests
url = 'http://www.httpbin.org/basic-auth/user/password'
auth = ('user','password')
response = requests.get(url=url,auth=auth)
print(response.text)

Result:

 

verify: Boolean type, which specifies whether certificate verification is required when requesting a website. The default value is True, which means certificate verification is required. If certificate verification is not desired, it needs to be set to False

import requests
response = requests.get(url='https://www.httpbin.org/',verify=False)

Result:

 

But in this case, the Warning prompt will appear generally, because Python wants us to be able to use certificate validation.

If you do not want to see Warning information, you can use the following command to eliminate it:

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Timeout: Specifies the timeout time. If no response is received after the specified time, an exception will be thrown

1.2.2 POST request

The difference between a POST request and a GET request is that POST data does not appear in the address bar, and there is no upper limit on the size of the data.

So GET parameters and POST parameters can be used almost. Except params parameters, POST can use data parameters.

data: dictionary type, specifying form information, commonly used when sending POST requests

Example:

import requests
url = 'http://www.httpbin.org/post'
data = {
    'key1':'value1',
    'key2':'value2'
}
response = requests.post(url=url,data=data)
print(response.text)

Result:

1.3 requests response

1.3.1 response attribute

After a GET or POST request is used, a response object will be received. The commonly used properties and methods are listed as follows:

response.url: return the URL of the requested website

Response.status "Code: return the status code of the response

response.encoding: return the encoding method of the response

response.cookies: return the Cookie information of the response

response.headers: return response headers

response.content: returns the response body of bytes type

response.text: returns the response body of str type, equivalent to response.content.decode('utf-8 ')

response.json(): returns the response body of dict type, equivalent to json.loads(response.text)

import requests
response = requests.get('http://www.httpbin.org/get')
print(type(response))
# <class 'requests.models.Response'>
print(response.url) # Return to the URL
# http://www.httpbin.org/get
print(response.status_code) # Return the status code of the response
# 200
print(response.encoding) # Return the encoding of the response
# None
print(response.cookies) # Return the Cookie information
# <RequestsCookieJar[]>
print(response.headers) # Return response header
# {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Mon, 16 Dec 2019 03:16:22 GMT', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Server': 'nginx', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '189', 'Connection': 'keep-alive'}
print(type(response.content))# Return bytes Response body of type
# <class 'bytes'>
print(type(response.text)) # Return str Response body of type
# <class 'str'>
print(type(response.json())) # Return dict Response body of type
# <class 'dict'>

1.3.2 coding problems

#Coding problem
import requests
response=requests.get('http://www.autohome.com/news/')
# response.encoding='gbk' #The content of the page returned by home of cars website is gb2312 encoded, while the default encoding of requests is ISO-8859-1. If it is not set to gbk, the Chinese code will be garbled
print(response.text)

Posted by neogemima on Sun, 15 Dec 2019 21:06:05 -0800