Python 100 line code easily crawls the 100G set diagram of Meizi net. I hope you have enough memory on your network disk [with source code]

Keywords: Programming Windows Python Linux Pycharm

preface

Recently, I was doing monitoring related supporting facilities, and found that many scripts are based on python. I heard a long time ago that its name is short of life. I learned python, which is not a joke. With the rise of artificial intelligence, machine learning and deep learning, most of the AI code on the market is written in Python. So in the age of artificial intelligence, it's time to learn python.

Basic environment configuration

  • Python3

  • PyCharm

Implementation steps

Take the picture of a girl as an example. It's very simple. It's divided into the following four steps:

  • Get the page number of the first page and create a folder corresponding to the page number

  • Get the column address of the page

  • Enter the column to get the page number of the column (there are multiple pictures under each column, displayed in pages)

  • Get the picture in the opposite label under the column and download it

matters needing attention

In the process of crawling, you need to pay attention to the following points, which may help you:

1) The guide library is actually similar to the framework or tool class in Java. The bottom layer is encapsulated

2) To define method functions, a crawler may have hundreds of lines, so try not to write them as one

3) Define global variables

4) Anti theft chain

5) Switch versions

6) Exception capture

code implementation

import requests
from bs4 import BeautifulSoup
import os
import urllib
import random


class mzitu():
    def all_url(self, url):
        html = self.request(url)
        all_a = BeautifulSoup(html.text, 'lxml').find('div', class_='all').find_all('a')
        for a in all_a:
            title = a.get_text()
            print(u'Start saving:', title)
            path = str(title).replace("?", '_')
            if not self.mkdir(path):  ##Skip existing folders
                print(u'Skipped:', title)
                continue
            href = a['href']
            self.html(href)

    def html(self, href):
        html = self.request(href)
        max_span = BeautifulSoup(html.text, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text()
        for page in range(1, int(max_span) + 1):
            page_url = href + '/' + str(page)
            self.img(page_url)

    def img(self, page_url):
        img_html = self.request(page_url)
        img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src']
        self.save(img_url, page_url)

    def save(self, img_url, page_url):
        name = img_url[-9:-4]
        try:
            img = self.requestpic(img_url, page_url)
            f = open(name + '.jpg', 'ab')
            f.write(img.content)
            f.close()
        except FileNotFoundError:  ##Catch exceptions and move on
            print(u'Picture does not exist skipped:', img_url)
            return False

    def mkdir(self, path):  ##This function creates a folder
        path = path.strip()
        isExists = os.path.exists(os.path.join("C:dmzitu", path))
        if not isExists:
            print(u'Built a name called', path, u'Folder for!')
            os.makedirs(os.path.join("C:dmzitu", path))
            os.chdir(os.path.join("C:dmzitu", path))  ##Switch to directory
            return True
        else:
            print(u'It's called', path, u'Folder already exists!')
            return False

    def requestpic(self, url, Referer):  ##This function takes the response of the web page and returns
        user_agent_list = [ 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" 
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", 
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", 
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", 
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", 
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", 
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", 
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        ua = random.choice(user_agent_list)
        headers = {'User-Agent': ua, "Referer": Referer}  ##Here are the key parameters for getting pictures from previous versions
        content = requests.get(url, headers=headers)
        return content

    def request(self, url):  ##This function takes the response of the web page and returns
        headers = {
            'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
        content = requests.get(url, headers=headers)
        return content


Mzitu = mzitu()  ##instantiation 
Mzitu.all_url('http://www.mzitu.com/all ') to function all_ The URL passed in parameter can be used as the starting crawler (that is, the entry)
print(u'Congratulations on downloading!')

Now, please open your eyes and let's have a picture.

summary

In fact, the script is very simple, from the configuration environment, the installation of the integrated development environment, the writing of the script to the smooth execution of the whole script, it took almost four or five hours, and finally the execution of the script is one-stop. Limited to the impact of server bandwidth and configuration, the 17G graph has been downloaded for almost three or four hours. As for the remaining 83G, let's download it by ourselves.

For beginners who want to learn Python development, reptile technology, python data analysis, artificial intelligence and other technologies more easily, here is also a set of system teaching resources for you, plus the python technology learning course qq skirt: 784758214, free of charge. There are questions in the learning process. There are professional old drivers in the group to answer questions for free! Click to join us python learning circle

Posted by miasma on Tue, 26 May 2020 00:19:00 -0700