[Huawei cloud technology sharing] beginner's part, know the most common and important library Requests of Python

Requests library is the most important and common library in Python crawler. You must master it skillfully

Let's meet this library

Requests library is the most important and common library in Python crawler. You must master it skillfully

Let's meet this library

import requests
url = 'http://www.baidu.com'
r = requests.get(url)
print type(r)
print r.status_code
print r.encoding
#print r.content
print r.cookies


//Get:
<class 'requests.models.Response'>
200
ISO-8859-1
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

2.Get request mode

values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.get(url,values)
print r.url

//Get: http://www.baidu.com/? User = AAA & id = 123

3.Post request mode

values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.post(url,values)
print r.url
#print r.text

//Get:
http://www.baidu.com/

4. Request headers processing

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}
header = {'User-Agent':user_agent}
url = 'http://www.baidu.com/'
r = requests.get(url,headers=header)
print r.content

Pay attention to the headers for processing requests
In many cases, our server will check whether the request comes from the browser, so we need to disguise it as a browser in the request header to request the server. In general, when making a request, it is best to disguise it as a browser to prevent access denial and other errors, which is also an anti crawler strategy

In particular, no matter what request we make in the future, we must take the headers with us. Don't be lazy and save trouble. Take this as a traffic rule to understand. Running a red light is not necessarily dangerous but unsafe. To save trouble, we need to follow the red light and stop the green light. The same is true for making a request for a web crawler. We must add the headers to prevent mistakes

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}
header = {'User-Agent':user_agent}
url = 'http://www.qq.com/'
request = urllib2.Request(url,headers=header)
response = urllib2.urlopen(request)
print response.read().decode('gbk')#Note here that you need to transcode the read web content. First, you need to check the format of chatset

Open www.qq.com on the browser and press F12 to view the user agent:

User agent: this value is used by some servers or proxies to determine whether a request is made by a browser
Content type: when using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed.
application/xml: used in XML RPC, such as RESTful/SOAP calls
application/json: used in JSON RPC calls
application/x-www-form-urlencoded: used by browsers when submitting Web forms
When using the RESTful or SOAP service provided by the server, the content type setting error will cause the server to reject the service

5. Response code and response header processing

url = 'http://www.baidu.com'
r = requests.get(url)

if r.status_code == requests.codes.ok:
 print r.status_code
 print r.headers
 print r.headers.get('content-type')#It is recommended to get the header field in this get mode
else:
 r.raise_for_status()

//Get:
200
{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'}
text/html

6.cookie processing

url = 'https://www.zhihu.com/'
r = requests.get(url)
print r.cookies
print r.cookies.keys()

//Get:
<RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]>
['aliyungf_tc']

7. Redirection and historical messages

To process redirection, you just need to set the allow ﹣ redirects field. Setting allow ﹣ redirectsy to True allows redirection, and setting False disables redirection.

r = requests.get(url,allow_redirects = True)
print r.url
print r.status_code
print r.history

//Get:
http://www.baidu.com/
200
[]

8. Timeout setting

The timeout option is set with the timeout parameter
python url = 'http://www.baidu.com' r = requests.get(url,timeout = 2)

9. Agent settings

proxis = {
 'http':'http://www.baidu.com',
 'http':'http://www.qq.com',
 'http':'http://www.sohu.com',

}

url = 'http://www.baidu.com'
r = requests.get(url,proxies = proxis)

Author: Ni Ping Yu

Hua Wei Yun Enterprise blog

Published 1015 original articles, won 5415 praises and 900000 visitors+

His message board follow

Posted by Karl33to on Mon, 17 Feb 2020 02:19:34 -0800

Programmer Group

[Huawei cloud technology sharing] beginner's part, know the most common and important library Requests of Python

Hot Keywords