Python Crawler Warfare: Bulk Download of Site Pictures

Preface

The text and pictures in this article are from the Internet. They are for study and communication only. They do not have any commercial use. The copyright is owned by the original author. If you have any questions, please contact us in time for processing.

Author: GitPython

PS: If you need Python learning materials for your child, click on the link below to get them

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

1. Get url links to pictures

First, open the first page of Baidu Pictures, notice the index in the url below

Next, switch the page to a traditional flip, because it's good for us to crawl pictures!

Comparing several url s, it is found that the pn parameter is the number requested.By modifying the pn parameter and observing the data returned, it is found that each page can only have up to 60 pictures.

Note: The gsm parameter is a hexadecimal representation of the pn parameter, so it is possible to remove it

Then, right-click the source code of the page and search objURL directly (ctrl+F)

In this way, we find URLs that need pictures.

2. Save picture links locally

Now, all we have to do is crawl out this information.

Note: There are objURLs, hoverURL s in the web page...But we're using objURL because this is the original image

So, how do I get the objURL?Use regular expressions!

So how do we do that with regular expressions?Actually, only one line of code is needed...

results = re.findall('"objURL":"(.*?)",', html)

Core Code:

1. Get the picture url code:

 1 # Get Pictures url Connect
 2 def get_parse_page(pn,name):
 3 
 4     for i in range(int(pn)):
 5         # 1.Get Web Page
 6         print('Getting #{}page'.format(i+1))
 7 
 8         # Baidu Picture Home Page url
 9         # name Is the keyword you want to search for
10         # pn Is the number of pages you want to download
11 
12         url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%s&pn=%d' %(name,i*20)
13 
14         headers = {
15             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4843.400 QQBrowser/9.7.13021.400'}
16 
17         # Send the request and get the corresponding
18         response = requests.get(url, headers=headers)
19         html = response.content.decode()
20         # print(html)
21 
22         # 2.Regular Expression Parsing Web Page
23         # "objURL":"http://n.sinaimg.cn/sports/transform/20170406/dHEk-fycxmks5842687.jpg"
24         results = re.findall('"objURL":"(.*?)",', html) # Return to a list
25 
26         # Save the picture locally based on the link to the picture you get
27         save_to_txt(results, name, i)

2. Save the picture to the local code:

 1 # Save Picture to Local
 2 def save_to_txt(results, name, i):
 3 
 4     j = 0
 5     # Create folder under directory
 6     if not os.path.exists('./' + name):
 7         os.makedirs('./' + name)
 8 
 9     # Download Pictures
10     for result in results:
11         print('Saving section{}individual'.format(j))
12         try:
13             pic = requests.get(result, timeout=10)
14             time.sleep(1)
15         except:
16             print('Current picture cannot be downloaded')
17             j += 1
18             continue
19 
20         # Ignorable, this code has bug
21         # file_name = result.split('/')
22         # file_name = file_name[len(file_name) - 1]
23         # print(file_name)
24         #
25         # end = re.search('(.png|.jpg|.jpeg|.gif)$', file_name)
26         # if end == None:
27         #     file_name = file_name + '.jpg'
28 
29         # Save Pictures to Folder
30         file_full_name = './' + name + '/' + str(i) + '-' + str(j) + '.jpg'
31         with open(file_full_name, 'wb') as f: 
32             f.write(pic.content)
33 
34         j += 1

3. Main function code:

1 # Principal function
2 if __name__ == '__main__':
3 
4     name = input('Enter the keywords you want to download:')
5     pn = input('You want to download the first few pages (60 pages per page):')
6     get_parse_page(pn,

Instructions:

1 # Configure the following modules
2 import requests 
3 import re
4 import os
5 import time
6 
7 # 1.Function py source file
8 # 2.Enter the keywords you want to search for, such as "Kirky", "Teddy", etc.
9 # 3.Enter the number of pages you want to download, such as 5, which is download 5 x 60=300 Picture

Posted by Scip on Wed, 18 Dec 2019 23:43:39 -0800

Programmer Group