Python crawler series - Preliminary Exploration: crawling travel review

Keywords: Python JSON Windows Firefox

Python crawler is currently based on the requests package. The following is the documentation of the package. It is more convenient to check some information.

http://docs.python-requests.org/en/master/

Crawling the product reviews of a tourism website, through analysis, it needs POST instructions to get json files. In short:

  • GET is to add the information to be sent directly after the web address
  • POST is to send an additional content to the server

There are about three types of content sent through POST, namely form, json and multipart. At present, the first two types are introduced

1.content in form

Content-Type: application/x-www-form-urlencoded

Put the content into dict and pass it to the parameter data.

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post(url, data=payload)

2. content in json

Content-Type: application/json

Convert dict to json and pass it to the data parameter.

payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))

Or pass dict to the json parameter.

payload = {'some': 'data'}
r = requests.post(url, json=payload)

Then paste a simple code for reference.

import requests
import json

def getCommentStr():
    url = r"https://package.com/user/comment/product/queryComments.json"

    header = {
        'User-Agent':           r'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Accept':               r'application/json, text/javascript, */*; q=0.01',
        'Accept-Language':      r'en-US,en;q=0.5',
        'Accept-Encoding':      r'gzip, deflate, br',
        'Content-Type':         r'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With':     r'XMLHttpRequest',
        'Content-Length':       '65',
        'DNT':                  '1',
        'Connection':           r'keep-alive',
        'TE':                   r'Trailers'
    }

    params = {
        'pageNo':               '2',
        'pageSize':             '10',
        'productId':            '2590732030',
        'rateStatus':           'ALL',
        'type':                 'all'
    }
    
    
    r = requests.post(url, headers = header, data = params)
    print(r.text)

getCommentStr()

antic

  • For cookies, I feel that you can use the editing function of the browser to gradually delete the cookie information sent each time, and judge which is useless?
  • For the test code stage, I'm still used to storing the crawled data as str, which is also to reduce the load of the server.

Posted by newbiez on Fri, 13 Dec 2019 13:31:54 -0800