Thread & coordination & asynchrony of crawler

Keywords: Python network Windows pip

Thread pool

  • from multiprocessing.dummy import Pool
  • The callback function asynchronously performs some operation on the elements in the iteratable object
    • Note: callback must have one parameter and only one parameter
  • Asynchronous is mainly used in time-consuming operations
from multiprocessing.dummy import Pool

pool = Pool(3)  # Instantiate the thread pool object. 3 is the maximum number of threads in the thread pool
# Parameter 1: callback function (only function name, without parenthesis); parameter 2: List
# Parameter 1 will receive an element in the parameter 2 list, and the callback function can perform some operation on the element in the list
pool.map(callback,list)

Test: synchronous & asynchronous efficiency

Build a flash, start the service by yourself, and test the execution time

  • Create a new server.py
from flask import Flask, render_template
import time

app = Flask(__name__)


@app.route('/xx')
def index_1():
    time.sleep(2)
    return render_template('test.html')


@app.route('/yy')
def index_2():
    time.sleep(2)
    return render_template('test.html')


@app.route('/oo')
def index_3():
    time.sleep(2)
    return render_template('test.html')


if __name__ == '__main__':
    app.run(debug=True)
  • Create a new templates folder, and create an HTML file under the folder. I write test.html, and write some data
<html lang="en">
<head>
    <meta charset="UTF-8"/>
    <title>test</title>
</head>
<body>
<div>
    <p>Baili Convention</p>
</div>
<div class="song">
    <p>Li Qingzhao</p>
    <p>Wang Anshi</p>
    <p>Su Shi</p>
    <p>Liu Zongyuan</p>
    <a href="http://Www.song.com / "title =" Zhao Kuangyin "target ="
        <span>this is span</span>
        The Song Dynasty is the most powerful Dynasty, not the military, but the economy is very strong, the people are very rich</a>
    <a href="" class="du">Clouds can block out the sun,It's sad to see Chang'an</a>
    <img src="http://www.baidu.com/meinv.jpg" alt=""/>
</div>
<div class="tang">
    <ul>
        <li><a href="http://Www.baidu.com "title =" Qing "> it rains in the clear and bright season, and pedestrians on the road want to break their souls. Ask where there is a restaurant. Shepherd boy points to Xinghua village</a></li>
        <li><a href="http://Www.163.com "title =" Qin "> in the Qin Dynasty, the moon was in the Ming Dynasty, and in the Han Dynasty, people from the long march had not yet returned, but the flying General of the dragon city was there, and Hu Madu was not taught Yinshan Mountain</a></li>
        <li><a href="http://Www.126.com "id =" Qi "> it's a common sight in Qiwang's house. Cui jiutang has heard it several times before. It's just a beautiful scenery in the south of the Yangtze River. It's the time to meet you when the flowers fall</a></li>
        <li><a href="http://Www.sina.com "class =" Du "> Du Fu</a></li>
        <li><a href="http://Www.du du.com "class =" Du "> Du Mu</a></li>
        <li><b>Du Xiaoyue</b></li>
        <li><i>Spend honeymoon</i></li>
        <li><a href="http://Www.haha. Com "id =" Feng "> Phoenix on the stage, Phoenix flows to Taikong River, Wu palace, flowers and plants bury the path, Jin Dynasty, chengguqiu</a></li>
    </ul>
</div>
</body>
</html>

Synchronous & asynchronous execution time

import requests
from bs4 import BeautifulSoup
import time
 #Thread pool module
from multiprocessing.dummy import Pool

urls = [
    'http://127.0.0.1:5000/xx',
    'http://127.0.0.1:5000/yy',
    'http://127.0.0.1:5000/oo',
]

#Data crawling, return the source code data of the crawling page
def get_request(url):
    page_text = requests.get(url=url).text
    return page_text

#Data analysis, return label text
def parse(page_text):
    soup = BeautifulSoup(page_text, 'lxml')
    return soup.select('#feng')[0].text

#Sync code
if __name__ == '__main__':
    start = time.time()
    for url in urls:
        page_text = get_request(url)
        text_data = parse(page_text)
        print(text_data)
    print(time.time() - start)
"""
Execution result:
Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
 Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
 Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
6.056272029876709
"""

#Asynchronous code
if __name__ == '__main__':
    start = time.time()
    pool = Pool(3) instantiate thread pool object
    #Parameter 1: callback function (only function name, without parenthesis); parameter 2: List
    #Parameter 1 will receive an element in the parameter 2 list, and the callback function can perform some operation on the element in the list
    page_text_list = pool.map(get_request,urls)
    text_data = pool.map(parse,page_text_list)
    for i in text_data:
        print(i)
    print(time.time() - start)
"""
Execution result:
Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
 Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
 Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty
2.0537397861480713

Can increase 0.01 seconds without for
"""

To sum up: asynchronous code execution efficiency is significantly improved

Case: pear crawling video based on thread pool

  • Thinking analysis
    • Crawl to the url corresponding to the video detail page and store it in an iterative object
    • Send the request again to get the real video address of the video details page
      • Note: the video of the video detail page is generated dynamically by js code, and regular parsing is required
    • Write a callback to get the binary file of the video and store it persistently
import requests
from lxml import etree
from multiprocessing.dummy import Pool
import re
import os

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

# Address of pear video fortune
main_url = 'https://www.pearvideo.com/category_3'
# Analyze the src of the video detail page under the board
main_page_text = requests.get(url=main_url, headers=headers).text
tree = etree.HTML(main_page_text)
li_list = tree.xpath('//*[@id="listvideoListUl"]/li')
# Thread pool
video_urls = []
for li in li_list:
    # Specific address and video title of video detail page
    detail_url = "https://www.pearvideo.com/" + li.xpath('./div/a/@href')[0]
    name = li.xpath('./div/a/div[2]/text()')[0]
    # Request details page
    page_text = requests.get(url=detail_url, headers=headers).text
    # The video of the video detail page is generated dynamically by js code, using regular parsing
    ex = 'srcUrl="(.*?)",vdoUrl='
    video_url = re.findall(ex, page_text, re.S)[0]  # List type returned
    dic = {
        'url': video_url,
        'name': name,
    }
    video_urls.append(dic)

# Callback function
def get_video(url):
    # Send a request to the video address to persist the binary file
    video_data = requests.get(url=url['url'], headers=headers).content
    file_name = "./video/" + url['name'] + ".mp4"
    with open(file_name, 'wb') as f:
        f.write(video_data)
        print(url['name'], "Download completed!")

# Create a folder to store videos
dir_name = 'video'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
# Instantiate thread pool
pool = Pool(4)
pool.map(get_video, video_urls)

Single thread + multitask asynchronous process

asyncio (key)

Special functions

  • If the definition of a function is modified by the async keyword, the function is a special function.
  • Special points:
    • After the function is called, the implementation statement inside the function will not be executed immediately.
    • This function returns a co program object

Association

  • A process is an object. When a special function is called, it returns a co program object.

  • Coroutine object = = special function

    import asyncio
    from time import sleep
    
    async def get_request(url):
        print('Requesting:', url)
        sleep(2)
        print('Request succeeded:', url)
        return '666'
    # Return a process object
    g = get_request("https://www,qq.com")

Task object

  • It further encapsulates the cooperation object (that is, a high-level cooperation object)

  • Task object = = cooperation object = = special function (representing a fixed form of task)

    asyncio.ensure_future(Co object)
    
    task = asyncio.ensure_future(g)
    
    # g: Process object
  • Bind callback:

    # Define a callback function for a task
    def callback(task):
        task.result() # Represents the return value of the special function corresponding to the current task object
        print("I'm callback:", task)
    
    task.add_done_callback(funcName)
    
    # Task: task object
    # funcName: the name of the callback function
    • funcName this callback function must take a parameter, which represents the current task object
      • Parameter. result(): indicates the return value of the special function corresponding to the current task object

Event loop object

  • Create event loop object

  • The task object needs to be registered with the event loop object

    # Create event loop object
    loop = asyncio.get_event_loop()
    # Register / load the task object into the event loop object, and then you need to start the loop object
    loop.run_until_complete(task)  # Used to load and start an event cycle
    
    # Task: task object

wait for

await: when the blocking operation is over, let the loop go back to execute the code after blocking.

Hang up

asyncio.wait(): hand over the current task object to the cpu.

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

asyncio.wait   # Suspend operation
tasks   # Task object list

Key points for attention

  • No module code that does not support asynchrony can appear inside the implementation of special functions, otherwise the asynchrony effect will be interrupted

aiohttp (key)

  • requests: asynchronous is not supported and cannot appear inside special functions.

  • aiohttp: support asynchronous network request module, used with asyncio

    • pip install aiohttp
  • Code writing

    • Write out the basic structure
    import asyncio
    import aiohttp
    
    # Asynchronous network request based on aiohttp
    async def get_requests(url):
        # A request object was instantiated
        with aiohttp.ClientSession() as aio:
            # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response:
            with aio.get(url=url) as response:
                # text() gets the response data as a string
                # read() gets response data of type bytes
                page_text = await response.text()
                return page_text
    • Detail supplement (refer to complete code for code)
      • Add async keyword before each with
      • Add the await keyword before each blocking operation
  • Complete code

    import asyncio
    import aiohttp
    
    # Asynchronous network request based on aiohttp
    async def get_requests(url):
        # Instantiated a request object
        async with aiohttp.ClientSession() as aio:
            # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response:
            async with await aio.get(url=url) as response:
                # text() gets the response data as a string
                # read() gets response data of type bytes
                page_text = await response.text()
                return page_text

Single task cooperation operation

import asyncio
from time import sleep

async def get_request(url):
    print('Requesting:', url)
    sleep(2)
    print('Request succeeded:', url)
    return '666'

# Define a callback function for a task
def callback(task):
    print("I'm callback:", task)

# Return a process object
g = get_request("https://www,qq.com")

# Create a task object
task = asyncio.ensure_future(g)
"""

# Bind callback function to task object
task.add_done_callback(callback)

# Create event loop object
loop = asyncio.get_event_loop()
# Register / load the task object into the event loop object, and then you need to start the loop object
loop.run_until_complete(task)  # Used to load and start an event cycle
"""
//Execution result:
//Requesting: www,qq.com
//Requesting: www,qq.com
"""

Multitask cooperation operation

import asyncio
import time

start = time.time()
async def get_request(url):
    print('Requesting:', url)
    # await let the loop go back to execute the code after blocking when the blocking operation is over
    await asyncio.sleep(2)
    print('Request succeeded:', url)
    return '666'

urls = [
    'http://127.0.0.1:5000/xx',
    'http://127.0.0.1:5000/yy',
    'http://127.0.0.1:5000/oo',
]
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
# When registering a task list with an event loop, you must suspend the task list
# asyncio.wait() suspends the operation and hands over the current task object to the cpu
loop.run_until_complete(asyncio.wait(tasks))
print('Total time consumption:', time.time() - start)

Single thread & multitask asynchronous crawler

Self test based on Flask

  • Test code in the above test: synchronous & asynchronous efficiency, follow the above steps to start the project; then run the following code.
import asyncio
import time
import aiohttp
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

urls = [
    'http://127.0.0.1:5000/xx',
    'http://127.0.0.1:5000/yy',
    'http://127.0.0.1:5000/oo',
]

start = time.time()

"""
# Initiate the request and obtain the response data (asynchronous is not allowed)
async def get_requests(url):
    # requests is a module that does not support asynchronous
    page_text = requests.get(url).text
    return page_text
"""

async def get_requests(url):
    """
    //Asynchronous network request based on aiohttp
    :param url: 
    :return: 
    """
    # Instantiated a request object
    async with aiohttp.ClientSession() as aio:
        # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response:
        async with await aio.get(url=url) as response:
            # text() gets the response data as a string
            # read() gets response data of type bytes
            page_text = await response.text()
            return page_text

def parse(task):
    """
    //Define callback function
    :param task:
    :return:
    """
    page_text = task.result()  # Get the return value of the special function (the source code data of the requested page)
    tree = etree.HTML(page_text)
    content = tree.xpath('//*[@id="feng"]/text()')[0]
    print(content)

tasks = []
for url in urls:
    c = get_requests(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print('Total time consumption:', time.time() - start)

Case: asynchronous pear crawling video based on single thread and multitask

  • Thinking of the above case: pear crawling video based on thread pool
import asyncio
import time
import aiohttp
from lxml import etree
import re
import os
import requests

# Time module is to test the time-consuming of crawling video
start = time.time()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
# Address of pear video fortune
main_url = 'https://www.pearvideo.com/category_3'
main_page_text = requests.get(url=main_url, headers=headers).text
tree = etree.HTML(main_page_text)
li_list = tree.xpath('//*[@id="listvideoListUl"]/li')
urls = []  # [{'url': video_url,'name': name},{}...]
for li in li_list:
    detail_url = "https://www.pearvideo.com/" + li.xpath('./div/a/@href')[0]
    name = li.xpath('./div/a/div[2]/text()')[0]
    page_text = requests.get(url=detail_url, headers=headers).text
    # The video of the video detail page is generated dynamically by js code
    ex = 'srcUrl="(.*?)",vdoUrl='
    video_url = re.findall(ex, page_text, re.S)[0]  # List type returned
    dic = {
        'url': video_url,
        'name': name,
    }
    urls.append(dic)

# Asynchronous network request based on aiohttp
async def get_requests(url):
    # Instantiated a request object
    async with aiohttp.ClientSession() as aio:
        # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response:
        async with await aio.get(url=url['url'], headers=headers) as response:
            # text() gets the response data as a string
            # read() gets response data of type bytes
            page_read = await response.read()
            dic = {
                "page_read": page_read,
                "name": url['name']
            }
            return dic


def parse(task):
    """
    //Define callback function
    :param task:
    :return:
    """
    dic_info = task.result()  # Get the return value of the special function (the source code data of the requested page)
    file_name = "./video/" + dic_info["name"] + ".mp4"
    with open(file_name, 'wb') as f:
        f.write(dic_info['page_read'])
        print(dic_info["name"], "Download completed!")

tasks = []
for url in urls:
    c = get_requests(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

dir_name = 'video'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print('Total time consumption:', time.time() - start)

Posted by Mirkules on Fri, 20 Mar 2020 10:32:57 -0700