Getting started with Python crawler [12]: multi thread crawling of doutula expression pack

Keywords: Python Windows

Doutula emoticon bag multi thread crawling - write in front

I found that many people are crawling through a website called doutula. There are many expression packs in it, and then I look at them. There are various ways to realize them. Today I will implement a multi-threaded version for you. Key technology point aiohttp, you can take a look at my previous article, and then learn it.

The website will not be analyzed, but only to find the rule, splicing URL, matching key points, and then crawling.

Doutula emoticon package multi thread crawling and rolling code

First, quickly import the modules we need. Unlike other articles, I put the same expression under the same folder, so I need to import the os module.

import asyncio
import aiohttp
from lxml import etree
import os

Write the main entry method

if __name__ == '__main__':
    url_format = "http://www.doutula.com/article/list/?page={}"
    urls = [url_format.format(index) for index in range(1,586)]
    loop = asyncio.get_event_loop()
    tasks = [x_get_face(url) for url in urls]
    results = loop.run_until_complete(asyncio.wait(tasks))
Python Resource sharing qun 784758214 ,There is an installation package inside. PDF,Learning video, here is Python Learners' gathering place, zero foundation, advanced, welcome

We want to learn, not attack other servers, so limit the number of concurrent servers.

sema = asyncio.Semaphore(3)

async def x_get_face(url):
    with(await sema):
        await get_face(url)

Finally, the operation is as fierce as a tiger. You can complete all the codes and get it done. There is nothing new in this part. Look for the picture link and download it.

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
async def get_face(url):
    print("In operation{}".format(url))
    async with aiohttp.ClientSession() as s:
        async with s.get(url,headers=headers,timeout=5) as res:
            if res.status==200:
                html = await res.text()
                html_format = etree.HTML(html)

                hrefs = html_format.xpath("//a[@class='list-group-item random_list']")

                for link in hrefs:
                    url = link.get("href")
                    title = link.xpath("div[@class='random_title']/text()")[0]  # Get file header

                    path = './biaoqings/{}'.format(title.strip())  # Hard coded, you need to create a folder of biaoqings in the root directory of the project first

                    if not os.path.exists(path):
                        os.mkdir(path)
                    else:
                        pass

                    async with s.get(url, headers=headers, timeout=3) as res:
                        if res.status == 200:
                            new_html = await res.text()

                            new_html_format = etree.HTML(new_html)
                            imgs = new_html_format.xpath("//div[@class='artile_des']")
                            for img in imgs:
                                try:
                                    img = img.xpath("table//img")[0]
                                    img_down_url = img.get("src")
                                    img_title = img.get("alt")
                                except Exception as e:
                                    print(e)

                                async with s.get(img_down_url, timeout=3) as res:
                                    img_data = await res.read()
                                    try:
                                        with open("{}/{}.{}".format(path,img_title.replace('\r\n',""),img_down_url.split('.')[-1]),"wb+") as file:
                                            file.write(img_data)
                                    except Exception as e:
                                        print(e)

                        else:
                            pass

            else:
                print("Page access failed")
Python Resource sharing qun 784758214 ,There is an installation package inside. PDF,Learning video, here is Python Learners' gathering place, zero foundation, advanced, welcome

Wait, a large number of expression packs come to my bowl.

Posted by coder500 on Thu, 17 Oct 2019 09:05:13 -0700