Python crawler grabs pictures locally

Keywords: Python crawler

One: Target Site Information

Desktop on the other side     The web address is: Desktop Wallpaper PC Desktop Wallpaper HD Wallpaper Download Desktop Background Picture On the Other Desktop

2: Target Site Analysis

(1): Construct a list of URL s for a page

What we need to do is crawl pictures of a given number of pages on the site, so first we need to look at the relationships between the links on each page and construct a list of URLs that need to crawl pages.

 

Links on the first page: http://www.netbian.com/

Links on page 2: http://www.netbian.com/index_2.htm

As you can see, page links from the beginning of the second page are just different numbers, so we can write a simple code to get a list of URLs for the page

# Initialized list of page links
page_links_list=['http://www.netbian.com/']
 
#Get the number of pages crawled and page links
 
pages = int(input('Enter the number of pages you want to crawl:'))
if pages > 1:
    for page in range(2, pages + 1):
        url = 'http://www.netbian.com/index_' + str(page) + '.htm'
        page_links_list.append(url)
else:
    page_links_list=page_links_list
print(page_links_list)
Please enter the number of pages you want to crawl:5
['http://www.netbian.com/', 'http://www.netbian.com/index_2.htm', 'http://www.netbian.com/index_3.htm', 'http://www.netbian.com/index_4.htm', 'http://www.netbian.com/index_5.htm']

(2): Get links to all pictures on a page

We've got links to all the pages, but we haven't got links to every picture, so what we need to do next is get links to all the pictures on one page. Here, let's take the first page as an example and get a link to each picture. Other pages are similar.

First right-click on the page - > View the element, then click the small cursor on the left side of the viewer, and then place the mouse over any picture, so that you can locate the code location of the picture. We know that there are 18 pictures on each page. Next, we need to use tags to locate the exact location of the pictures on the page. As shown in the image below, we used div.list li a img to locate exactly 18 img tags. The img tag contains the image links we need.

Next, let's take the first page as an example to get a link to each picture.

#python3 -m pip install bs4
#python3 -m pip install lxml
import requests
from bs4 import BeautifulSoup
 
# Initialized list of page links
url='http://www.netbian.com/'
# Picture Link List
img_links_list = []
 
#Get img tags, get links to pictures
html = requests.get(url).content.decode('gbk')
soup = BeautifulSoup(html, 'lxml')
imgs = soup.select("div.list li a img")
for img in imgs:
    img_link = img['src']
    img_links_list.append(img_link)
 
print(img_links_list)
print(len(img_links_list))
['http://img.netbian.com/file/2019/0817/smalle213d95e54c5b4fb355b710a473292ea1566035585.jpg', 'http://img.netbian.com/file/2019/0817/small15ca224d7c4c119affe2cfd2d811862e1566035332.jpg', 'http://img.netbian.com/file/2018/1225/604a688cd6f79161236e6250189bc25b.jpg', 'http://img.netbian.com/file/2019/0817/smallab7249d18e67c9336109e3bedc094f381566034907.jpg', 'http://img.netbian.com/file/2019/0817/small5816e940e6957f7db5e499de9978bda41566031298.jpg', 'http://img.netbian.com/file/2019/0817/smalladda3febb072e9103f8f06f27dcb19c21566031139.jpg', 'http://img.netbian.com/file/2019/0817/small0e9f43492debe6dc2ce7a3e6cc48c1ad1566030965.jpg', 'http://img.netbian.com/file/2019/0817/smallcfd5b4c6fa10ffcbcdcc8b1b9e6db91a1566030209.jpg', 'http://img.netbian.com/file/2019/0817/smalld1f07e215f0da059b44d27623ec6fa8f1566029835.jpg', 'http://img.netbian.com/file/2019/0817/small1674b7b97714672be3165bd31de418eb1566014363.jpg', 'http://img.netbian.com/file/2019/0814/small1a5c2fe49dec02929f219d0bdb680e9c1565786931.jpg', 'http://img.netbian.com/file/2019/0814/smalle333c0a8e9fe18324d793ce7258abbbf1565786718.jpg', 'http://img.netbian.com/file/2019/0814/smallb0c9494b4042ac9c9d25b6e4243facfd1565786402.jpg', 'http://img.netbian.com/file/2019/0814/small19dfd078dd820bb1598129bbe4542eff1565786204.jpg', 'http://img.netbian.com/file/2019/0808/smallea41cb48c796ffd3020514994fc3e8391565274057.jpg', 'http://img.netbian.com/file/2019/0808/small3998c40805ea6811d81b7c57d8d235fc1565273792.jpg', 'http://img.netbian.com/file/2019/0808/smallb505448b1318dbb2820dcb212eb39e191565273639.jpg', 'http://img.netbian.com/file/2019/0808/small0f04af422502a40b6c8dc19d53d1f3481565273554.jpg']
18

(3): Download pictures locally

With the link to the picture, we need to download the picture locally. Here we will download the first picture as an example

url: http://img.netbian.com/file/2019/0817/smalle213d95e54c5b4fb355b710a473292ea1566035585.jpg
import urllib.request
url='http://img.netbian.com/file/2019/0817/smalle213d95e54c5b4fb355b710a473292ea1566035585.jpg'
urllib.request.urlretrieve(url, filename='test.jpg')

(4): A simple crawler to get pictures

Combine the above three sections to construct a url list of pages, get links to all pictures on a page, and download pictures locally. Construct a complete but inefficient crawl.

import requests
from bs4 import BeautifulSoup
import lxml
import urllib
import os
import time
 
#Get pictures and download them locally
def GetImages(url):
    html=requests.get(url, timeout = 2).content.decode('gbk')
    soup=BeautifulSoup(html,'lxml')
    imgs=soup.select("div.list li a img")
    for img in imgs:
        link=img['src']
        display=link.split('/')[-1]
        print('Downloading:',display)
        filename='./images/'+display
        urllib.request.urlretrieve(link,filename)
 
#Get the number of pages crawled, return the number of links
def GetUrls(page_links_list):
    pages = int(input('Enter the number of pages you want to crawl:'))
    if pages > 1:
        for page in range(2, pages + 1):
            url = 'http://www.netbian.com/index_' + str(page) + '.htm'
            page_links_list.append(url)
    else:
        page_links_list=page_links_list
 
 
if __name__ == '__main__':
    page_links_list=['http://www.netbian.com/']
    GetUrls(page_links_list)
    os.mkdir('./images')
    print("Start downloading pictures!!!")
    start = time.time()
    for url in page_links_list:
        GetImages(url)
    print('Picture download successful!!!')
    end = time.time() - start
    print('The time consumed is:', end)
Please enter the number of pages you want to crawl:5
 Start downloading pictures!!!
Downloading: smalle213d95e54c5b4fb355b710a473292ea1566035585.jpg
 Downloading: small15ca224d7c4c119affe2cfd2d811862e1566035332.jpg
 Downloading: 604 a688cd6f79161236e6250189bc25b.jpg
 Downloading: smallab7249d18e67c9336109e3bedc094f381566034907.jpg
 Downloading: small5816e940e6957f7db5e499de9978bda41566031298.jpg
 Downloading: smalladda3febb072e9103f8f06f27dcb19c21566031139.jpg
 Downloading: small0e9f43492debe6dc2ce7a3e6cc48c1ad1566030965.jpg
 Downloading: smallcfd5b4c6fa10ffcbcdcc8b1b9e6db91a1566030209.jpg
. . . . . . 
Picture download successful!!!
Time consumed is: 21.575999975204468

This part of the code above can run completely, but it is not efficient because you need a queue to download pictures. So to solve this problem, the code below uses multithreading to crawl and download pictures.

(5) Use Python multithreading to crawl pictures and download them locally

Multithreaded We use the threading module that comes with Python. And we've used a model called Producer and Consumer, where producers store download links to get pictures from each page in a global list. Consumers download images specifically from this global list.

It is important to note that global variables are used in multiple threads with locks to ensure data consistency.

import urllib
import threading
from bs4 import BeautifulSoup
import requests
import os
import time
import lxml
 
# Initialized list of page links
page_links_list=['http://www.netbian.com/']
# Picture Link List
img_links_list = []
 
#Get the number of pages crawled and page links
def GetUrls(page_links_list):
    pages = int(input('Enter the number of pages you want to crawl:'))
    if pages > 1:
        for page in range(2, pages + 1):
            url = 'http://www.netbian.com/index_' + str(page) + '.htm'
            page_links_list.append(url)
    else:
        page_links_list=page_links_list
 
#Initialize lock, create a lock
gLock=threading.Lock()
 
#Producer, responsible for getting links to pictures from each page
class Producer(threading.Thread):
    def run(self):
        while len(page_links_list)>0:
            #Uplock
            gLock.acquire()
            #The last element in the list is removed by default
            page_url=page_links_list.pop()
            #Release lock
            gLock.release()
 
            #Get img Tags
            html = requests.get(page_url).content.decode('gbk')
            soup = BeautifulSoup(html, 'lxml')
            imgs = soup.select("div.list li a img")
 
            #Lock 3
            gLock.acquire()
            for img in imgs:
                img_link = img['src']
                img_links_list.append(img_link)
            #Release lock
            gLock.release()
        #print(len(img_links_list))
 
#Consumer, responsible for downloading pictures from the obtained picture link
class Consumer(threading.Thread,):
    def run(self):
        print("%s is running"%threading.current_thread())
        while True:
            #print(len(img_links_list))
            #Uplock
            gLock.acquire()
            if len(img_links_list)==0:
                #In any case, release the lock
                gLock.release()
                continue
            else:
                img_url=img_links_list.pop()
                #print(img_links_list)
                gLock.release()
                filename=img_url.split('/')[-1]
                print('Downloading:', filename)
                path = './images/'+filename
                urllib.request.urlretrieve(img_url, filename=path)
                if len(img_links_list)==0:
                    end=time.time()
                    print("The time consumed is:", (end - start))
                    exit()
 
 
if __name__ == '__main__':
    GetUrls(page_links_list)
    os.mkdir('./images')
    start=time.time()
    # 5 producer threads to crawl image links from the page
    for x in range(5):
        Producer().start()
 
    # 10 consumer threads to extract download links from and download
    for x in range(10):
        Consumer().start()
Please enter the number of pages you want to crawl:5
<Consumer(Thread-6, started 5764)> is running
<Consumer(Thread-7, started 9632)> is running
<Consumer(Thread-8, started 9788)> is running
<Consumer(Thread-9, started 5468)> is running
<Consumer(Thread-10, started 9832)> is running
<Consumer(Thread-11, started 3320)> is running
<Consumer(Thread-12, started 6456)> is running
<Consumer(Thread-13, started 10080)> is running
<Consumer(Thread-14, started 3876)> is running
<Consumer(Thread-15, started 9224)> is running
 Downloading: small9e75d6ac9506efe1d87e96062791fb261564149099.jpg
 Downloading: small02961cac5e02a901b77779eaed43c6f91564156941.jpg
 Downloading: small117c84b2f427c981bf33184c1c5c4bb91564193581.jpg
 Downloading: smallfedb420af6f753512c169021587982621564455847.jpg
 Downloading: small14d3739bf11dd92055abb56e3f792d3f1564456102.jpg
 Downloading: smallaf755644af7d114d4cbf36fbe0e84d0c1564456347.jpg
 Downloading: small9f7af6d0e2372a4d9e536d8ea9fc40341564456537.jpg
. . . . . . 
Time consumed is:1.635000228881836       #10 process end times
 Time consumed is:1.6419999599456787
 Time consumed is:1.6560001373291016
 Time consumed is:1.684000015258789
 Time consumed is:1.7009999752044678
 Time consumed is:1.7030000686645508
 Time consumed is:1.7060000896453857
 Time consumed is:1.7139999866485596
 Time consumed is:1.7350001335144043
 Time consumed is:1.748000144958496

Posted by gus on Mon, 22 Nov 2021 12:38:01 -0800