Python crawler is currently based on the requests package. The following is the documentation of the package. It is more convenient to check some information.
http://docs.python-requests.org/en/master/
Crawling the product reviews of a tourism website, through analysis, it needs POST instructions to get json files. In short:
- GET is to add the information to be sent directly after the web address
- POST is to send an additional content to the server
There are about three types of content sent through POST, namely form, json and multipart. At present, the first two types are introduced
1.content in form
Content-Type: application/x-www-form-urlencoded
Put the content into dict and pass it to the parameter data.
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post(url, data=payload)
2. content in json
Content-Type: application/json
Convert dict to json and pass it to the data parameter.
payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload))
Or pass dict to the json parameter.
payload = {'some': 'data'} r = requests.post(url, json=payload)
Then paste a simple code for reference.
import requests import json def getCommentStr(): url = r"https://package.com/user/comment/product/queryComments.json" header = { 'User-Agent': r'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', 'Accept': r'application/json, text/javascript, */*; q=0.01', 'Accept-Language': r'en-US,en;q=0.5', 'Accept-Encoding': r'gzip, deflate, br', 'Content-Type': r'application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With': r'XMLHttpRequest', 'Content-Length': '65', 'DNT': '1', 'Connection': r'keep-alive', 'TE': r'Trailers' } params = { 'pageNo': '2', 'pageSize': '10', 'productId': '2590732030', 'rateStatus': 'ALL', 'type': 'all' } r = requests.post(url, headers = header, data = params) print(r.text) getCommentStr()
antic
- For cookies, I feel that you can use the editing function of the browser to gradually delete the cookie information sent each time, and judge which is useless?
- For the test code stage, I'm still used to storing the crawled data as str, which is also to reduce the load of the server.