Requests library is the most important and common library in Python crawler. You must master it skillfully
Let's meet this library
Requests library is the most important and common library in Python crawler. You must master it skillfully
Let's meet this library
import requests url = 'http://www.baidu.com' r = requests.get(url) print type(r) print r.status_code print r.encoding #print r.content print r.cookies //Get: <class 'requests.models.Response'> 200 ISO-8859-1 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
2.Get request mode
values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests.get(url,values) print r.url //Get: http://www.baidu.com/? User = AAA & id = 123
3.Post request mode
values = {'user':'aaa','id':'123'} url = 'http://www.baidu.com' r = requests.post(url,values) print r.url #print r.text //Get: http://www.baidu.com/
4. Request headers processing
user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.baidu.com/' r = requests.get(url,headers=header) print r.content
Pay attention to the headers for processing requests
In many cases, our server will check whether the request comes from the browser, so we need to disguise it as a browser in the request header to request the server. In general, when making a request, it is best to disguise it as a browser to prevent access denial and other errors, which is also an anti crawler strategy
In particular, no matter what request we make in the future, we must take the headers with us. Don't be lazy and save trouble. Take this as a traffic rule to understand. Running a red light is not necessarily dangerous but unsafe. To save trouble, we need to follow the red light and stop the green light. The same is true for making a request for a web crawler. We must add the headers to prevent mistakes
user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'} header = {'User-Agent':user_agent} url = 'http://www.qq.com/' request = urllib2.Request(url,headers=header) response = urllib2.urlopen(request) print response.read().decode('gbk')#Note here that you need to transcode the read web content. First, you need to check the format of chatset
Open www.qq.com on the browser and press F12 to view the user agent:
User agent: this value is used by some servers or proxies to determine whether a request is made by a browser
Content type: when using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed.
application/xml: used in XML RPC, such as RESTful/SOAP calls
application/json: used in JSON RPC calls
application/x-www-form-urlencoded: used by browsers when submitting Web forms
When using the RESTful or SOAP service provided by the server, the content type setting error will cause the server to reject the service
5. Response code and response header processing
url = 'http://www.baidu.com' r = requests.get(url) if r.status_code == requests.codes.ok: print r.status_code print r.headers print r.headers.get('content-type')#It is recommended to get the header field in this get mode else: r.raise_for_status() //Get: 200 {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'} text/html
6.cookie processing
url = 'https://www.zhihu.com/' r = requests.get(url) print r.cookies print r.cookies.keys() //Get: <RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]> ['aliyungf_tc']
7. Redirection and historical messages
To process redirection, you just need to set the allow ﹣ redirects field. Setting allow ﹣ redirectsy to True allows redirection, and setting False disables redirection.
r = requests.get(url,allow_redirects = True) print r.url print r.status_code print r.history //Get: http://www.baidu.com/ 200 []
8. Timeout setting
The timeout option is set with the timeout parameter
python url = 'http://www.baidu.com' r = requests.get(url,timeout = 2)
9. Agent settings
proxis = { 'http':'http://www.baidu.com', 'http':'http://www.qq.com', 'http':'http://www.sohu.com', } url = 'http://www.baidu.com' r = requests.get(url,proxies = proxis)
Author: Ni Ping Yu