Crawler learning (used by requests Library)

Keywords: Python Session Windows encoding JSON

requests Library

Although the urlib module in Python's standard library already contains most of the functions we usually use, its API is not very good, and Requests is advertised as "HTTP for Humans", indicating that it is more concise and convenient to use.

Installation and documentation address:

pip is very convenient to install:

pip install requests

Chinese document: http://docs.python-requests.org/zh_CN/latest/index.html
github address: https://github.com/requests/requests

Send GET request:

  1. The simplest way to send a get request is to call it through requests.get:

    response = requests.get("http://www.baidu.com/")
    
  2. Add headers and query parameters:
    If you want to add headers, you can pass in the headers parameter to increase the headers information in the request header. If you want to pass parameters in a url, you can use the params parameter. The relevant example code is as follows:

     import requests
    
     kw = {'wd':'China'}
    
     headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    
     # params receives the query parameters of a dictionary or string. The dictionary type is automatically converted to url encoding, and urlencode() is not required
     response = requests.get("http://www.baidu.com/s", params = kw, headers = headers)
    
     # View the response content. response.text returns data in Unicode format
     print(response.text)
    
     # View the response content, and the byte stream data returned by response.content
     print(response.content)
    
     # View full url address
     print(response.url)
    
     # View response header character encoding
     print(response.encoding)
    
     # View response code
     print(response.status_code)
    

Send POST request:

  1. The most basic post request can use the post method:

    response = requests.post("http://www.baidu.com/",data=data)
    
  2. Incoming data:
    At this time, you don't need to use urlencode to encode any more. Just pass in a dictionary. For example, the code of the data requesting the pull hook network:

     import requests
    
     url = "https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0"
    
     headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
         'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
     }
    
     data = {
         'first': 'true',
         'pn': 1,
         'kd': 'python'
     }
    
     resp = requests.post(url,headers=headers,data=data)
     # If it is json data, you can call the json method directly
     print(resp.json())
    

Use agent:

Using requests to add a proxy is also very simple, as long as you pass the proxies parameter in the requested method (such as get or post). The sample code is as follows:

import requests

url = "http://httpbin.org/get"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
}

proxy = {
    'http': '171.14.209.180:27829'
}

resp = requests.get(url,headers=headers,proxies=proxy)
with open('xx.html','w',encoding='utf-8') as fp:
    fp.write(resp.text)

cookie:

If a cookie is included in a response, you can use the cookie property to get the returned cookie value:

import requests

url = "http://www.renren.com/PLogin.do"
data = {"email":"970138074@qq.com",'password':"pythonspider"}
resp = requests.get('http://www.baidu.com/')
print(resp.cookies)
print(resp.cookies.get_dict())

session:

Before using the urlib library, you can use opener to send multiple requests, and multiple requests can share cookies. If you want to share cookies with requests, you can use the session object provided by the requests library. Note that the session here is not the session in web development, this place is just a session object. Or to login Renren as an example, using requests to achieve. The sample code is as follows:

import requests

url = "http://www.renren.com/PLogin.do"
data = {"email":"970138074@qq.com",'password':"pythonspider"}
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}

# Sign in
session = requests.session()
session.post(url,data=data,headers=headers)

# Visit Dapeng personal Center
resp = session.get('http://www.renren.com/880151247/profile')

print(resp.text)

To process untrusted SSL certificates:

For those websites that have been trusted with SSL integers, such as https://www.baidu.com/, the normal response can be returned directly by using requests. The sample code is as follows:

resp = requests.get('http://www.12306.cn/mormhweb/',verify=False)

print(resp.content.decode('utf-8'))

Posted by habuchas on Sun, 26 Jan 2020 03:44:04 -0800