Python 3 write 360 library crawler

Keywords: JSON Windows

Preface

Requirements: search and save 360 Gallery pictures
Gallery address: 360 Gallery

Basic thinking

After a simple analysis, it is found that the 360 picture library uses dynamic rendering, and uses pull-down reverse loading of pictures, so there are three common methods to solve the problem, namely, operating js, simulation browser, Ajax. If you use the simulation of the trans code is high, slow (of course, more stable), but after a simple analysis found that the image is loaded using Ajax. Then use ajax directly.

Analyze the request

View the response of the request

The result of the discovery request is a json string, and "img" is found in the string:“ http://image2.sina.com.cn/ty/k/2007-02-28/U1830P6T12D2772429F44DT20070228124853.jpg ”, such a field, this is the image address to be crawled.
Then the problem is simple. After simulating the request through requses, we can directly parse json to extract the image address and save it.

Our analysis request header has


These important fields
q: search keywords
PS SN: tag for crawling pictures (increase by 60 each time)
pn: number of pages to crawl pictures
Write code after simple analysis

import requests
from urllib.request import urlretrieve
import time


def image_spider(key_q, key_ps, key_sn):  # Request function
    url = "http://image.so.com/j?q=nba&src=srp&correct=nba&pn=60&ch=&sn=208&ps=201&pc=60&pd=1&prevsn=148" \
          "&sid=39b8dabeae51031efce5a8d8b1fc6957&ran=0&ras=6&cn=0&gn=0&kn=8&comm=1"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
        'Referer': "http://image.so.com/i?q=nba&src=srp"
    }
    params = {
        'q': key_q,
        'src': 'srp',
        'pn': '60',
        'ch': '',
        'sn': key_sn,
        'ps': key_ps,
        'pc': '60',
        'pd': '1',
        'prevsn': '0',
        'sid': 'd34ad660320d853f00ea2d476ba2eaa5',
        'ran': '0',
        'ras': '6',
        'cn': '0',
        'gn': '0',
        'kn': '8',
        'comm': '1'
    }

    response = requests.get(url, headers=headers, params=params).json()  # Turn json
    lists = response.get('list')  # Extract list pieces with a value of a list

    for lis in lists:
        open_image(lis.get('img'))  # Traverse the list and extract img string (picture url)
    time.sleep(2)


def open_image(image_url):
    image_name = hash(image_url)  # Picture name (url value of picture address)
    path = 'image/%s.jpg' % str(image_name)  # Picture address
    urlretrieve(image_url, path)  # Save pictures
    print(image_url, 'Download successful')
    time.sleep(1)


def main():
    key_q = input("Enter the image keywords to crawl")
    ps = 81
    sn = 88

    for i in range(0, 10):  # How many pictures to crawl
        key_ps = ps+i*60
        key_sn = sn+i*60
        image_spider(key_q, key_ps, key_sn)


if __name__ == '__main__':
    main()

summary

The 360 Gallery crawler is not difficult, in fact, as long as you carefully analyze ajax and the returned json, you can find the results.

Posted by rashu.dr on Mon, 28 Oct 2019 09:15:12 -0700