Meizi Tu Website Crawling-Preface
Starting today, I'm going to roll up my sleeves and write Python crawlers directly. The best way to learn a language is to do it purposefully. So, next I'll use 10 + blogs to write about crawling pictures. I hope I can do it well.
In order to write a good crawler, we need to prepare a Firefox browser, also need to prepare a bag capture tool, bag capture tool, I use the CentOS tcpdump, plus wireshark, the installation and use of these two software, I suggest you still learn, we should use later.
Mei Zitu Website Crawling-Network Request Module requests
Mei Zitu Website Crawl-Install requests
Open Terminal: Use Command
pip3 install requests
Waiting for the installation to be completed.
Next, type the following command in the terminal
# mkdir demo # cd demo # touch down.py
The linux command above is to create a folder named demo, and then create a down.py file. You can also use GUI tools, like windows, right-click to create various files.
To improve development efficiency on linux, we need to install a visual studio code development tool
For how to install vscode, refer to the official https://code.visualstudio.com/docs/setup/linux Detailed instructions are provided.
For centos, the following is the case:
sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc sudo sh -c 'echo -e "[code]\nname=Visual Studio Code\nbaseurl=https://packages.microsoft.com/yumrepos/vscode\nenabled=1\ngpgcheck=1\ngpgkey=https://packages.microsoft.com/keys/microsoft.asc" > /etc/yum.repos.d/vscode.repo'
Then install it with the yum command
yum check-update sudo yum install code Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
After successful installation, the following screen will appear in your CentOS
Next, let's talk about the above operation, because we use gnome graphical interface on this side, so there are some later operations, I explained them directly with the operating style of windows.
Open Software > File > Open File > Find the down.py file we just created
After that, enter it in VSCODE
import requests #Import module def run(): #Declare a run method print("Running code file") #print contents if __name__ == "__main__": #Main Program Entry run() #Call the run method above
tips: This tutorial is not a basic introduction to Python 3, so there are some coding basics that you understand by default. For example, Python has no semicolon ending and needs to be formatted aligned. I'll try to write the notes as complete as possible.
The ctrl+s on the keyboard saves the file. If the prompt permission is insufficient, then enter the password according to the prompt.
Enter the demo directory through the terminal, and then enter
python3 down.py
Display the following results to indicate that compilation is okay
[root@bogon demo]# python3 down.py Running code file
Next, let's start testing whether the requests module can be used
Modify the above code
import requests def run(): response = requests.get("http://www.baidu.com") print(response.text) if __name__ == "__main__": run()
The results of the operation (the following figure shows that you have run successfully): ____________
Next, let's actually download a picture, such as the one below.
Modify the code. Before that, let's modify something.
Because every time you modify a file, you are prompted to have administrator privileges, so you can use the linux command to modify privileges.
[root@bogon linuxboy]# chmod -R 777 demo/
import requests def run(): response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg") with open("xijinping.jpg","wb") as f : f.write(response.content) f.close if __name__ == "__main__": run() Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
After running the code, it was found that a file was generated inside the folder.
But when you open the file, you find that it is not accessible, which means that the file has not been downloaded at all.
We continue to modify the code, because there are some restrictions on the server image, we can open it with browser, but using Python code can not download it completely.
Modify the code
import requests def run(): # Header file, header is the dictionary type headers = { "Host":"www.newsimg.cn", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5383.400 QQBrowser/10.0.1313.400" } response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg",headers=headers) with open("xijinping.jpg","wb") as f : f.write(response.content) f.close if __name__ == "__main__": run()
Well, compile the python file at the terminal this time
python3 down.py
I found the picture downloaded.
Let's focus on the requests.get section of the above code and add a headers argument. So our program downloaded the complete picture.
Python crawler page analysis
With this simple case, our next operation will be much simpler. How do reptiles work?
Enter Domain Name - > Download Source Code - > Analyse Picture Path - > Download Pictures
Here are his steps.
Meizi Tu Website Crawl-Input Domain Name
The website we are going to climb today is called http://www.meizitu.com/a/pure.html
Why crawl this website, because it's easy to crawl.
Okay, let's analyze this page.
One of the most important things about crawling is that you need to find the paging area, because paging represents regularity and regularity, so we can crawl easily. (You can do it more intelligently, enter the home page address, and the crawler can analyze all the addresses in this website by itself.)
In the picture above, we found pagination, so find the rule.
Using the developer tools of Firefox Browser to discover paging rules
http://www.meizitu.com/a/pure_1.html http://www.meizitu.com/a/pure_2.html http://www.meizitu.com/a/pure_3.html http://www.meizitu.com/a/pure_4.html
Okay, then we use Python to implement this part. (The following is part of the object-oriented writing, no basic classmates, please Baidu to find some foundation, but for you who want to learn, these are very simple.)
import requests all_urls = [] #Our mosaic of photo sets and list paths class Spider(): #Constructor to initialize data usage def __init__(self,target_url,headers): self.target_url = target_url self.headers = headers #Get all the URL s you want to crawl def getUrls(self,start_page,page_num): global all_urls #Loop to get the URL for i in range(start_page,page_num+1): url = self.target_url % i all_urls.append(url) if __name__ == "__main__": headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', 'HOST':'www.meizitu.com' } target_url = 'http://www.meizitu.com/a/pure_%d.html'photo sets and list rules spider = Spider(target_url,headers) spider.getUrls(1,16) print(all_urls)
The above code may need some Python foundation to understand, but if you look at it carefully, there are a few points.
The first is class Spider(): We declare a class, and then we use def__init_ to declare a constructor, which I think you'll learn in 30 minutes with a tutorial.
Stitching URL s, we can use many ways, I use here is the most direct, string stitching.
Notice that there is a global variable all_urls in the above code, which I use to store all of our paging URL s.
Next, it's the core part of the crawler code.
We need to analyze the logic in the page. First open http://www.meizitu.com/a/pure_1.html Right-click to review elements.
Discover the link in the red box above
After clicking on the picture, I found that I went to a picture details page and found that it was a group of pictures. Now the problem is
We need to solve the first step. http://www.meizitu.com/a/pure_1.html Crawl through all of these pages http://www.meizitu.com/a/5585.html This address
Here we crawl in a multi-threaded way (here we also use a design pattern called the observer pattern)
import threading #Multithread Module import re #regular expression module import time #Time Module
Firstly, three modules are introduced, namely multithreading, regular expression and time module.
A new global variable is added, and since it is a multithreaded operation, we need to introduce thread locks.
all_img_urls = [] #Array of Picture List Pages g_lock = threading.Lock() #Initialize a lock
Declare a producer's class to continually get the image details page address and add it to the global variable all_img_urls
#Producer, responsible for extracting list links from each page class Producer(threading.Thread): def run(self): headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', 'HOST':'www.meizitu.com' } global all_urls while len(all_urls) > 0 : g_lock.acquire() #When accessing all_urls, you need to use the lock mechanism page_url = all_urls.pop() #Remove the last element through the pop method and return the value g_lock.release() #Release the lock in time after use, so that other threads can use it easily. try: print("Analysis"+page_url) response = requests.get(page_url , headers = headers,timeout=3) all_pic_link = re.findall('<a target=\'_blank\' href="(.*?)">',response.text,re.S) global all_img_urls g_lock.acquire() #Here's another lock. all_img_urls += all_pic_link #This place pays attention to the splicing of arrays. It's a new grammar for python to use += instead of append directly. print(all_img_urls) g_lock.release() #Release lock time.sleep(0.5) except: pass
The above code uses the concept of inheritance. I inherited a subclass from threading.Thread. You can flip through the basic learning of inheritance. http://www.runoob.com/python3/python3-class.html A beginner's course will do.
Thread lock, in the code above, when we operate all_urls.pop(), we do not want other threads to operate on him at the same time, otherwise there will be accidents, so we use g_lock.acquire() to lock resources, and then after using, remember to release g_lock.release() immediately, otherwise this resource will be released. It's been occupied and the program can't go on.
I use regular expressions to match URL s in Web pages. Later, we will use other methods to match.
The re.findall() method is to get all the matched content, regular expressions, you can find a 30-minute introductory tutorial, just look at it.
Where the code is error prone, I put it
try: except: Inside, of course, you can also customize errors.
If the above code is all right, then we can write it at the entrance of the program.
for x in range(2): t = Producer() t.start()
Execute the program, because our roducer inherits from the threading.Thread class, so one of the ways you have to implement it is def run, which I believe you've seen in the code above. Then we can execute it.~~~
Operation results:
In this way, the list of picture details pages has been stored by us.
Next, we need to perform such a step, I want to wait for the picture details page to complete, in the next analysis operation.
Add code here
#threads= [] #Open two threads to access for x in range(2): t = Producer() t.start() #threads.append(t) # for tt in threads: # tt.join() print("Come to me.")
Comment on the key code and run as follows
[linuxboy@bogon demo]$ python3 down.py //Analysis of http://www.meizitu.com/a/pure_2.html //Analysis of http://www.meizitu.com/a/pure_1.html //Come to me. ['http://www.meizitu.com/a/5585.html',
Open the above tt.join and other code comments
[linuxboy@bogon demo]$ python3 down.py //Analysis of http://www.meizitu.com/a/pure_2.html //Analysis of http://www.meizitu.com/a/pure_1.html ['http://www.meizitu.com/a/5429.html', ...... //Come to me.
The essential difference is that because we are multithreaded programs, print("come to me") does not wait for other threads to finish running when the program runs, but when we transform it into the above code, we add the key code tt.join() to the main thread. The code waits until all the sub-threads are running, and then goes down. This satisfies the requirement, as I just said, to get a collection of all the picture details pages first.
What a join does is thread synchronization, that is, the main thread enters the blocking state after it encounters a join and waits until the execution of other sub-threads is finished, and the main thread continues to execute. This is something you may often encounter in the future.
Let's write a consumer / observer, which is an array of pages that constantly focus on the details of the pictures we just got.
Add a global variable to store the acquired image links
pic_links = [] #Picture Address List
#Consumer class Consumer(threading.Thread) : def run(self): headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', 'HOST':'www.meizitu.com' } global all_img_urls #Call an array of global picture details pages print("%s is running " % threading.current_thread) while len(all_img_urls) >0 : g_lock.acquire() img_url = all_img_urls.pop() g_lock.release() try: response = requests.get(img_url , headers = headers ) response.encoding='gb2312' #Since the page code we call is GB2312, we need to set it up. title = re.search('<title>(.*?) | Beauty Girl</title>',response.text).group(1) all_pic_src = re.findall('<img alt=.*?src="(.*?)" /><br />',response.text,re.S) pic_dict = {title:all_pic_src} #python dictionary global pic_links g_lock.acquire() pic_links.append(pic_dict) #Dictionary Array print(title+" Achieving Success") g_lock.release() except: pass time.sleep(0.5)
See no, the code above is very similar to what we just wrote. Later, I will modify this part of the code on github to be more concise, but this is the second lesson. We have a long way to go.
Some of the more important parts of the code, I have written with comments, you can refer directly. You must note that I used two regular expressions above to match the title and the url of the image. This title is used to create different folders later, so you should pay attention to it.
#Open 10 threads to get links for x in range(10): ta = Consumer() ta.start()
Operation results:
[linuxboy@bogon demo]$ python3 down.py Analysis of http://www.meizitu.com/a/pure_2.html Analysis of http://www.meizitu.com/a/pure_1.html ['http://www.meizitu.com/a/5585.html', ...... <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running Come to me. <function current_thread at 0x7f7caef851e0> is running <function current_thread at 0x7f7caef851e0> is running Pure and picturesque, the photographer succeeded in using Madou Ye Zixuan, the goddess of mansion and mansion, has succeeded in shooting a group of popular portraits recently. The Queen of America (bao) Chest (ru) brings the temptation of uniform to succeed Open your eyes and see you every day, that is happiness and success. Lovely girl, may the warm wind protect innocence and perseverance to achieve success Pure girls like a ray of sunshine warm this winter to achieve success... Python resource sharing qun 784758214, including installation packages, PDF, learning videos, here is the gathering place of Python learners, zero foundation, advanced, are welcome.
Does it feel like a big step forward from success?
Next, the operation of storing pictures mentioned in our article, or the same step, is to write a custom class.
class DownPic(threading.Thread) : def run(self): headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', 'HOST':'mm.chinasareview.com' } while True: # This place is written in a dead loop to continuously monitor whether the image link array is updated or not. global pic_links # Lock up g_lock.acquire() if len(pic_links) == 0: #If there are no pictures, unlock them. # In any case, release the lock g_lock.release() continue else: pic = pic_links.pop() g_lock.release() # Traversing dictionary list for key,values in pic.items(): path=key.rstrip("\\") is_exists=os.path.exists(path) # Judgment results if not is_exists: # Create a directory if it does not exist # Create directory operation functions os.makedirs(path) print (path+'Directory Creation Successful') else: # If the directory exists, it is not created and prompts that the directory already exists. print(path+' directory already exists') for pic in values : filename = path+"/"+pic.split('/')[-1] if os.path.exists(filename): continue else: response = requests.get(pic,headers=headers) with open(filename,'wb') as f : f.write(response.content) f.close
After we get the link to the picture, we need to download it. The code above is to create a file directory that we got to title before, and then create a file through the following code in the directory.
Introducing a new module involving file manipulation
import os #Directory Operating Module
# Traversing dictionary list for key,values in pic.items(): path=key.rstrip("\\") is_exists=os.path.exists(path) # Judgment results if not is_exists: # Create a directory if it does not exist # Create directory operation functions os.makedirs(path) print (path+'Directory Creation Successful') else: # If the directory exists, it is not created and prompts that the directory already exists. print(path+' directory already exists') for pic in values : filename = path+"/"+pic.split('/')[-1] if os.path.exists(filename): continue else: response = requests.get(pic,headers=headers) with open(filename,'wb') as f : f.write(response.content) f.close
Because our image link array contains the dictionary format of YES, which is the following format.
[{"Sister 1":["http://Mm.chinasareview.com/wp-content/uploads/2016a/08/24/08/24/01.jpg "",,,http://mm.chinasareview.com/wp-content/uploads/2012016a//mm.chinasareview.com/08/08/24/02.jpg "." http://mm.chinasareview.com/wp-content/uploads/2012016a/08/24/03.jpg "]]]]}, {sister Figure 2:[" http://http://mm.chinasareview/view.view.chinasview.view.view.chinasview.com/www.view/wp-content/wp-2016a/08/24/01.jpg","http://mm.chinasa" Review.com/wp-content/uploads/2016a/08/08/24/02.jpg "." http:///mm.chinasareview.com/wp-content/uploads/2012016a/08/08/24/03.jpg "]]]]], {sister's Figure 3":["http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/01.jpg/01.jpg",http:///mm.chinasareview/mm.chinasareview.com/wp-content/uploads/2012012016a/08/24/01/01/01.jpg",,http:/// 24/02.jpg". "http://mm.chinasareview.com / wp-content/uploads/2016a/08/24/03.jpg"]}]
First, we need to loop through the first layer, get title, create directory, and then download the picture in the second layer. In the code, we are modifying it to add exception handling.
try: response = requests.get(pic,headers=headers) with open(filename,'wb') as f : f.write(response.content) f.close except Exception as e: print(e) pass
Then write the code in the main program
#Open 10 threads to save pictures for x in range(10): down = DownPic() down.start()
Operation results:
[linuxboy@bogon demo]$ python3 down.py Analysis of http://www.meizitu.com/a/pure_2.html Analysis of http://www.meizitu.com/a/pure_1.html ['http://www.meizitu.com/a/5585.html', 'http://www.meizitu.com/a/5577.html', 'http://www.meizitu.com/a/5576.html', 'http://www.meizitu.com/a/5574.html', 'http://www.meizitu.com/a/5569.html', ....... <function current_thread at 0x7fa5121f2268> is running <function current_thread at 0x7fa5121f2268> is running <function current_thread at 0x7fa5121f2268> is running Come to me. Pure girls like a ray of sunshine warm this winter to achieve success Pure girls like a ray of sunshine warm this winter directory created successfully Lovely girl, may the warm wind protect innocence and perseverance to achieve success Lovely girl, may the warm wind protect the innocence and perseverance of directory creation be successful Super beauty, pure you and blue sky complement each other to achieve success Super-beautiful, pure you and blue sky complement each other to create a successful directory Beautiful and frozen, Taekwondo girls in the snow are successful Beautiful eyebrows with delicate facial features are like the success of a fairy tale princess Have a confident and charming smile, every day is a brilliant success Beautiful eyebrows with delicate facial features are like the success of the creation of the princess catalogue in fairy tales Have a confident and charming smile, every day is a brilliant directory to create success Pure and picturesque, the photographer succeeded in using Madou Python resource sharing qun 784758214, including installation packages, PDF, learning videos, here is the gathering place of Python learners, zero foundation, advanced, are welcome.
Simultaneously appearing under the file directory
Click on a directory
Well, today's simple reptile became
Finally, we write it at the head of the code.
# -*- coding: UTF-8 -*-
Prevent non-ASCII character'xe5'in file error.