Introduction to Python Crawler [2]: Site Crawling for Mei Zitu

Meizi Tu Website Crawling-Preface

Starting today, I'm going to roll up my sleeves and write Python crawlers directly. The best way to learn a language is to do it purposefully. So, next I'll use 10 + blogs to write about crawling pictures. I hope I can do it well.

In order to write a good crawler, we need to prepare a Firefox browser, also need to prepare a bag capture tool, bag capture tool, I use the CentOS tcpdump, plus wireshark, the installation and use of these two software, I suggest you still learn, we should use later.

Mei Zitu Website Crawling-Network Request Module requests

Mei Zitu Website Crawl-Install requests

Open Terminal: Use Command

pip3 install requests

Waiting for the installation to be completed.

Next, type the following command in the terminal

# mkdir demo  
# cd demo
# touch down.py

The linux command above is to create a folder named demo, and then create a down.py file. You can also use GUI tools, like windows, right-click to create various files.

To improve development efficiency on linux, we need to install a visual studio code development tool

For how to install vscode, refer to the official https://code.visualstudio.com/docs/setup/linux Detailed instructions are provided.

For centos, the following is the case:

sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc
sudo sh -c 'echo -e "[code]\nname=Visual Studio Code\nbaseurl=https://packages.microsoft.com/yumrepos/vscode\nenabled=1\ngpgcheck=1\ngpgkey=https://packages.microsoft.com/keys/microsoft.asc" > /etc/yum.repos.d/vscode.repo'

Then install it with the yum command

yum check-update
sudo yum install code
Python Resource sharing qun 784758214 ,Installation packages are included. PDF，Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

After successful installation, the following screen will appear in your CentOS

Next, let's talk about the above operation, because we use gnome graphical interface on this side, so there are some later operations, I explained them directly with the operating style of windows.

Open Software > File > Open File > Find the down.py file we just created

After that, enter it in VSCODE

import requests   #Import module

def run():        #Declare a run method
    print("Running code file")    #print contents

if __name__ == "__main__":   #Main Program Entry
    run()    #Call the run method above

tips: This tutorial is not a basic introduction to Python 3, so there are some coding basics that you understand by default. For example, Python has no semicolon ending and needs to be formatted aligned. I'll try to write the notes as complete as possible.

The ctrl+s on the keyboard saves the file. If the prompt permission is insufficient, then enter the password according to the prompt.

Enter the demo directory through the terminal, and then enter

python3 down.py

Display the following results to indicate that compilation is okay

[root@bogon demo]# python3 down.py
 Running code file

Next, let's start testing whether the requests module can be used

Modify the above code

import requests

def run():
    response = requests.get("http://www.baidu.com")
    print(response.text)

if __name__ == "__main__":
    run()

The results of the operation (the following figure shows that you have run successfully): ____________

Next, let's actually download a picture, such as the one below.

Modify the code. Before that, let's modify something.

Because every time you modify a file, you are prompted to have administrator privileges, so you can use the linux command to modify privileges.

[root@bogon linuxboy]# chmod -R 777 demo/

import requests

def run():
    response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg") 
    with open("xijinping.jpg","wb") as f :
        f.write(response.content)   
        f.close

if __name__ == "__main__":
    run()
Python Resource sharing qun 784758214 ,Installation packages are included. PDF，Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

After running the code, it was found that a file was generated inside the folder.

But when you open the file, you find that it is not accessible, which means that the file has not been downloaded at all.

We continue to modify the code, because there are some restrictions on the server image, we can open it with browser, but using Python code can not download it completely.

Modify the code

import requests

def run():
    # Header file, header is the dictionary type
    headers = {
        "Host":"www.newsimg.cn",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5383.400 QQBrowser/10.0.1313.400"
    }
    response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg",headers=headers) 
    with open("xijinping.jpg","wb") as f :
        f.write(response.content)   
        f.close

if __name__ == "__main__":
    run()

Well, compile the python file at the terminal this time

python3 down.py

I found the picture downloaded.

Let's focus on the requests.get section of the above code and add a headers argument. So our program downloaded the complete picture.

Python crawler page analysis

With this simple case, our next operation will be much simpler. How do reptiles work?

Enter Domain Name - > Download Source Code - > Analyse Picture Path - > Download Pictures

Here are his steps.

Meizi Tu Website Crawl-Input Domain Name

The website we are going to climb today is called http://www.meizitu.com/a/pure.html

Why crawl this website, because it's easy to crawl.

Okay, let's analyze this page.

One of the most important things about crawling is that you need to find the paging area, because paging represents regularity and regularity, so we can crawl easily. (You can do it more intelligently, enter the home page address, and the crawler can analyze all the addresses in this website by itself.)

In the picture above, we found pagination, so find the rule.

Using the developer tools of Firefox Browser to discover paging rules

http://www.meizitu.com/a/pure_1.html
http://www.meizitu.com/a/pure_2.html
http://www.meizitu.com/a/pure_3.html
http://www.meizitu.com/a/pure_4.html

Okay, then we use Python to implement this part. (The following is part of the object-oriented writing, no basic classmates, please Baidu to find some foundation, but for you who want to learn, these are very simple.)

import requests
all_urls = []  #Our mosaic of photo sets and list paths
class Spider():
    #Constructor to initialize data usage
    def __init__(self,target_url,headers):
        self.target_url = target_url
        self.headers = headers

    #Get all the URL s you want to crawl
    def getUrls(self,start_page,page_num):

        global all_urls
        #Loop to get the URL
        for i in range(start_page,page_num+1):
            url = self.target_url  % i
            all_urls.append(url)

if __name__ == "__main__":
    headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
    }
    target_url = 'http://www.meizitu.com/a/pure_%d.html'photo sets and list rules

    spider = Spider(target_url,headers)
    spider.getUrls(1,16)
    print(all_urls)

The above code may need some Python foundation to understand, but if you look at it carefully, there are a few points.

The first is class Spider(): We declare a class, and then we use def__init_ to declare a constructor, which I think you'll learn in 30 minutes with a tutorial.

Stitching URL s, we can use many ways, I use here is the most direct, string stitching.

Notice that there is a global variable all_urls in the above code, which I use to store all of our paging URL s.

Next, it's the core part of the crawler code.

We need to analyze the logic in the page. First open http://www.meizitu.com/a/pure_1.html Right-click to review elements.

Discover the link in the red box above

After clicking on the picture, I found that I went to a picture details page and found that it was a group of pictures. Now the problem is

We need to solve the first step. http://www.meizitu.com/a/pure_1.html Crawl through all of these pages http://www.meizitu.com/a/5585.html This address

Here we crawl in a multi-threaded way (here we also use a design pattern called the observer pattern)

import threading   #Multithread Module
import re #regular expression module
import time #Time Module

Firstly, three modules are introduced, namely multithreading, regular expression and time module.

A new global variable is added, and since it is a multithreaded operation, we need to introduce thread locks.

all_img_urls = []       #Array of Picture List Pages

g_lock = threading.Lock()  #Initialize a lock

Declare a producer's class to continually get the image details page address and add it to the global variable all_img_urls

#Producer, responsible for extracting list links from each page
class Producer(threading.Thread):   

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
        }
        global all_urls
        while len(all_urls) > 0 :
            g_lock.acquire()  #When accessing all_urls, you need to use the lock mechanism
            page_url = all_urls.pop()   #Remove the last element through the pop method and return the value

            g_lock.release() #Release the lock in time after use, so that other threads can use it easily.
            try:
                print("Analysis"+page_url)   
                response = requests.get(page_url , headers = headers,timeout=3)
                all_pic_link = re.findall('<a target=\'_blank\' href="(.*?)">',response.text,re.S)   
                global all_img_urls
                g_lock.acquire()   #Here's another lock.
                all_img_urls += all_pic_link   #This place pays attention to the splicing of arrays. It's a new grammar for python to use += instead of append directly.
                print(all_img_urls)
                g_lock.release()   #Release lock
                time.sleep(0.5)
            except:
                pass

The above code uses the concept of inheritance. I inherited a subclass from threading.Thread. You can flip through the basic learning of inheritance. http://www.runoob.com/python3/python3-class.html A beginner's course will do.

Thread lock, in the code above, when we operate all_urls.pop(), we do not want other threads to operate on him at the same time, otherwise there will be accidents, so we use g_lock.acquire() to lock resources, and then after using, remember to release g_lock.release() immediately, otherwise this resource will be released. It's been occupied and the program can't go on.

I use regular expressions to match URL s in Web pages. Later, we will use other methods to match.

The re.findall() method is to get all the matched content, regular expressions, you can find a 30-minute introductory tutorial, just look at it.

Where the code is error prone, I put it

try: except: Inside, of course, you can also customize errors.

If the above code is all right, then we can write it at the entrance of the program.

for x in range(2):
    t = Producer()
    t.start()

Execute the program, because our roducer inherits from the threading.Thread class, so one of the ways you have to implement it is def run, which I believe you've seen in the code above. Then we can execute it.~~~

Operation results:

In this way, the list of picture details pages has been stored by us.

Next, we need to perform such a step, I want to wait for the picture details page to complete, in the next analysis operation.

Add code here

#threads= []   
#Open two threads to access
for x in range(2):
    t = Producer()
    t.start()
    #threads.append(t)

# for tt in threads:
#     tt.join()

print("Come to me.")

Comment on the key code and run as follows

[linuxboy@bogon demo]$ python3 down.py
//Analysis of http://www.meizitu.com/a/pure_2.html
//Analysis of http://www.meizitu.com/a/pure_1.html
//Come to me.
['http://www.meizitu.com/a/5585.html',

Open the above tt.join and other code comments

[linuxboy@bogon demo]$ python3 down.py
//Analysis of http://www.meizitu.com/a/pure_2.html
//Analysis of http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5429.html', ......
//Come to me.

The essential difference is that because we are multithreaded programs, print("come to me") does not wait for other threads to finish running when the program runs, but when we transform it into the above code, we add the key code tt.join() to the main thread. The code waits until all the sub-threads are running, and then goes down. This satisfies the requirement, as I just said, to get a collection of all the picture details pages first.

What a join does is thread synchronization, that is, the main thread enters the blocking state after it encounters a join and waits until the execution of other sub-threads is finished, and the main thread continues to execute. This is something you may often encounter in the future.

Let's write a consumer / observer, which is an array of pages that constantly focus on the details of the pictures we just got.

Add a global variable to store the acquired image links

pic_links = []            #Picture Address List

#Consumer
class Consumer(threading.Thread) : 
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
        }
        global all_img_urls   #Call an array of global picture details pages
        print("%s is running " % threading.current_thread)
        while len(all_img_urls) >0 : 
            g_lock.acquire()
            img_url = all_img_urls.pop()
            g_lock.release()
            try:
                response = requests.get(img_url , headers = headers )
                response.encoding='gb2312'   #Since the page code we call is GB2312, we need to set it up.
                title = re.search('<title>(.*?) | Beauty Girl</title>',response.text).group(1)
                all_pic_src = re.findall('<img alt=.*?src="(.*?)" /><br />',response.text,re.S)

                pic_dict = {title:all_pic_src}   #python dictionary
                global pic_links
                g_lock.acquire()
                pic_links.append(pic_dict)    #Dictionary Array
                print(title+" Achieving Success")
                g_lock.release()

            except:
                pass
            time.sleep(0.5)

See no, the code above is very similar to what we just wrote. Later, I will modify this part of the code on github to be more concise, but this is the second lesson. We have a long way to go.

Some of the more important parts of the code, I have written with comments, you can refer directly. You must note that I used two regular expressions above to match the title and the url of the image. This title is used to create different folders later, so you should pay attention to it.

#Open 10 threads to get links
for x in range(10):
    ta = Consumer()
    ta.start()

Operation results:

[linuxboy@bogon demo]$ python3 down.py
 Analysis of http://www.meizitu.com/a/pure_2.html
 Analysis of http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5585.html', ......
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
Come to me.
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
Pure and picturesque, the photographer succeeded in using Madou
 Ye Zixuan, the goddess of mansion and mansion, has succeeded in shooting a group of popular portraits recently.
The Queen of America (bao) Chest (ru) brings the temptation of uniform to succeed
 Open your eyes and see you every day, that is happiness and success.
Lovely girl, may the warm wind protect innocence and perseverance to achieve success
 Pure girls like a ray of sunshine warm this winter to achieve success...
Python resource sharing qun 784758214, including installation packages, PDF, learning videos, here is the gathering place of Python learners, zero foundation, advanced, are welcome.

Does it feel like a big step forward from success?

Next, the operation of storing pictures mentioned in our article, or the same step, is to write a custom class.

class DownPic(threading.Thread) :

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'mm.chinasareview.com'

        }
        while True:   #  This place is written in a dead loop to continuously monitor whether the image link array is updated or not.
            global pic_links
            # Lock up
            g_lock.acquire()
            if len(pic_links) == 0:   #If there are no pictures, unlock them.
                # In any case, release the lock
                g_lock.release()
                continue
            else:
                pic = pic_links.pop()
                g_lock.release()
                # Traversing dictionary list
                for key,values in  pic.items():
                    path=key.rstrip("\\")
                    is_exists=os.path.exists(path)
                    # Judgment results
                    if not is_exists:
                        # Create a directory if it does not exist
                        # Create directory operation functions
                        os.makedirs(path) 

                        print (path+'Directory Creation Successful')

                    else:
                        # If the directory exists, it is not created and prompts that the directory already exists.
                        print(path+' directory already exists') 
                    for pic in values :
                        filename = path+"/"+pic.split('/')[-1]
                        if os.path.exists(filename):
                            continue
                        else:
                            response = requests.get(pic,headers=headers)
                            with open(filename,'wb') as f :
                                f.write(response.content)
                                f.close

After we get the link to the picture, we need to download it. The code above is to create a file directory that we got to title before, and then create a file through the following code in the directory.

Introducing a new module involving file manipulation

import os  #Directory Operating Module

# Traversing dictionary list
for key,values in  pic.items():
    path=key.rstrip("\\")
    is_exists=os.path.exists(path)
    # Judgment results
    if not is_exists:
        # Create a directory if it does not exist
        # Create directory operation functions
        os.makedirs(path) 

        print (path+'Directory Creation Successful')

    else:
        # If the directory exists, it is not created and prompts that the directory already exists.
        print(path+' directory already exists') 
    for pic in values :
        filename = path+"/"+pic.split('/')[-1]
        if os.path.exists(filename):
            continue
        else:
            response = requests.get(pic,headers=headers)
            with open(filename,'wb') as f :
                f.write(response.content)
                f.close

Because our image link array contains the dictionary format of YES, which is the following format.

[{"Sister 1":["http://Mm.chinasareview.com/wp-content/uploads/2016a/08/24/08/24/01.jpg "",,,http://mm.chinasareview.com/wp-content/uploads/2012016a//mm.chinasareview.com/08/08/24/02.jpg "." http://mm.chinasareview.com/wp-content/uploads/2012016a/08/24/03.jpg "]]]]}, {sister Figure 2:[" http://http://mm.chinasareview/view.view.chinasview.view.view.chinasview.com/www.view/wp-content/wp-2016a/08/24/01.jpg","http://mm.chinasa" Review.com/wp-content/uploads/2016a/08/08/24/02.jpg "." http:///mm.chinasareview.com/wp-content/uploads/2012016a/08/08/24/03.jpg "]]]]], {sister's Figure 3":["http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/01.jpg/01.jpg",http:///mm.chinasareview/mm.chinasareview.com/wp-content/uploads/2012012016a/08/24/01/01/01.jpg",,http:/// 24/02.jpg". "http://mm.chinasareview.com / wp-content/uploads/2016a/08/24/03.jpg"]}]

First, we need to loop through the first layer, get title, create directory, and then download the picture in the second layer. In the code, we are modifying it to add exception handling.

try:
    response = requests.get(pic,headers=headers)
    with open(filename,'wb') as f :
        f.write(response.content)
        f.close
except Exception as e:
    print(e)
    pass

Then write the code in the main program

#Open 10 threads to save pictures
for x in range(10):
    down = DownPic()
    down.start()

Operation results:

[linuxboy@bogon demo]$ python3 down.py
 Analysis of http://www.meizitu.com/a/pure_2.html
 Analysis of http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5585.html', 'http://www.meizitu.com/a/5577.html', 'http://www.meizitu.com/a/5576.html', 'http://www.meizitu.com/a/5574.html', 'http://www.meizitu.com/a/5569.html', .......
<function current_thread at 0x7fa5121f2268> is running 
<function current_thread at 0x7fa5121f2268> is running 
<function current_thread at 0x7fa5121f2268> is running 
Come to me.
Pure girls like a ray of sunshine warm this winter to achieve success
 Pure girls like a ray of sunshine warm this winter directory created successfully
 Lovely girl, may the warm wind protect innocence and perseverance to achieve success
 Lovely girl, may the warm wind protect the innocence and perseverance of directory creation be successful
 Super beauty, pure you and blue sky complement each other to achieve success
 Super-beautiful, pure you and blue sky complement each other to create a successful directory
 Beautiful and frozen, Taekwondo girls in the snow are successful
 Beautiful eyebrows with delicate facial features are like the success of a fairy tale princess
 Have a confident and charming smile, every day is a brilliant success
 Beautiful eyebrows with delicate facial features are like the success of the creation of the princess catalogue in fairy tales
 Have a confident and charming smile, every day is a brilliant directory to create success
 Pure and picturesque, the photographer succeeded in using Madou
 Python resource sharing qun 784758214, including installation packages, PDF, learning videos, here is the gathering place of Python learners, zero foundation, advanced, are welcome.

Simultaneously appearing under the file directory

![(//upload-images.jianshu.io/upload_images/12778909-4af04af598ecc20d?imageMogr2/auto-orient/strip|imageView2/2/w/1000/format/webp)

Click on a directory

Well, today's simple reptile became

Finally, we write it at the head of the code.

# -*- coding: UTF-8 -*-

Prevent non-ASCII character'xe5'in file error.

Posted by pmcconaghy on Sat, 20 Jul 2019 07:29:07 -0700

Programmer Group

Introduction to Python Crawler [2]: Site Crawling for Mei Zitu

Meizi Tu Website Crawling-Preface

Mei Zitu Website Crawling-Network Request Module requests

Mei Zitu Website Crawl-Install requests

Python crawler page analysis

Meizi Tu Website Crawl-Input Domain Name

Hot Keywords