Requs Library - Web Crawler

Keywords: Session encoding github JSON

Introduction to requests Library
Official documents: requests quick start
It is written in great detail and recommended to read the official documents.

Get started quickly

Import requests Library

import requests

Send request:

r = requests.get('url')
r = requests.post("http://httpbin.org/post")
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

The above are http request types, which are commonly used for get and post requests.
post is generally used to send data to the other party to get content.

Pass URL parameter
You may often want to pass some data to the query string of a URL.If you build a URL manually, the data is placed in the URL as key/value pairs, followed by a question mark.For example, httpbin.org/get?key=val.
Requests allow you to use params keyword parameters to provide them in a dictionary.For example, if you want to pass key1=value1 and key2=value2 to httpbin.org/get, you can use the following code:

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
By printing out the URL, you can see that the URL is correctly encoded:

print(r.url)
http://httpbin.org/get?key2=value2&key1=value1
Note that keys with a value of None in the dictionary will not be added to the query string of the URL.

You can also pass in a list as a value:

>>>payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

>>> r = requests.get('http://httpbin.org/get', params=payload)
>>> print(r.url)
http://httpbin.org/get?key1=value1&key2=value2&key2=value3

Customize Headers
When crawling a network resource, it is generally easy for the server to identify the crawler and then deny access.At this point, we usually want the crawler to simulate the browser and customize the headers.

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

Response Content

import requests
r = requests.get('https://github.com/timeline.json')
r.text
u'[{"repository":{"open_issues":0,"url":"https://github.com/.

#Or use.content to access the request response body in bytes
r.content
#Requests automatically decode gzip and deflate the encoded response data for you.

Requests automatically decode content from the server.Most unicode character sets can be seamlessly decoded."
After the request is sent, Requests makes a valid guess about the encoding of the response based on the HTTP header.When you access r.text, Requests uses its guessed text encoding.You can find out what encoding Requests uses, and you can use the r.encoding attribute to change it:

>>>r.encoding
>>>'utf-8'
>>>r.encoding = 'ISO-8859-1'

post method
Now many websites have to be logged in to crawl, at this time we have to pass some data to the server to get the content.
To do this, simply pass a dictionary to the data parameter.Your data dictionary is automatically encoded as a form when requested:

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post("http://httpbin.org/post", data=payload)
>>> print(r.text)
{
  ...
  "form": {
    "key2": "value2",
    "key1": "value1"
  },
  ...
}

Session session
Session objects allow you to keep certain parameters across requests.It also keeps cookie s between all requests made by the same Session instance, using the connection pooling feature of urllib3.

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

Response Status Code
We can detect the response status code:

>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200
>>> r.status_code == requests.codes.ok#Built-in status code query object
True

An error request (a 4XX client error, or a 5XX server error response) was sent and we can throw an exception via Response.raise_for_status():

>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404

>>> bad_r.raise_for_status()
Traceback (most recent call last):
  File "requests/models.py", line 832, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error

Response Header

>>> r.headers
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

cookie

#If a response contains cookie s, you can quickly access them:
>>>url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value

#To send your cookies to the server, use the cookies parameter:
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')
>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

If you want to save a cookie, you usually use the cookielib library.

import requests
import cookielib
#Set cookie s
session = requests.Session()
cookies = 'temp/cookie.txt'
#filename is not only a file, it can also be a str object
session.cookies = cookielib.LWPCookieJar(filename=cookies)
#Log in again to load cookie s directly without entering an account password
r =session.cookies.load()

#After getting cookie s
self.session.cookies.save()

overtime

>>>requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

agent
If you need to use a proxy, you can configure a single request by providing a proxies parameter for any request method:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

You can also configure the proxy through environment variables HTTP_PROXY and HTTPS_PROXY.

$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python
>>> import requests
>>> requests.get("http://example.org")

To set up a proxy for a particular connection mode or host, use scheme://hostname as the key, which matches the specified host and connection mode.

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

SOCKS
In addition to the basic HTTP proxy, Request also supports proxies for the SOCKS protocol.This is an optional feature and you need to install third-party libraries to use it.
You can use pip to get dependencies

$ pip install requests[socks]

Use:

proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}

Comprehensive example

import requests
url = 'https://www.zhihu.com/' 

headers =  {
        'Accept': '*/*',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With': 'XMLHttpRequest',
        'Referer': 'https://www.zhihu.com/',
        'Accept-Language': 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
        'Accept-Encoding': 'gzip, deflate, br',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
        'Host': 'www.zhihu.com'
        }

proxies = {
    "https": "http://116.28.206.126:8998",
    "https": "http://182.240.62.187:8998"
}

r = requests.get(url,timeout=10, proxies=proxies, headers= headers)

post_data = {
            '_xsrf': _xsrf,
            self.account_name: username,
            'password':password,
            'remember_me': 'true',
        }
r1 = requests.post((url,data=self.post_data,timeout=10, proxies=proxies, headers= headers).content.decode('utf8)

s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

Posted by keiron77 on Wed, 10 Jul 2019 09:39:28 -0700