python crawler -- multitask asynchronous process, hurry up, hurry up

Keywords: Python Windows network

Multitask asynchronous process asyncio

Special functions:
    -Is the definition of a function modified by async keyword
    -Special points:
        -When a special function is called, it returns a co program object
        -The internal program statement is not executed immediately after a special function call
        
- Association
    - object. Coroutine = = special function. A process represents a specific set of operations.
    
-Task object
    -Advanced orchestration (further encapsulation of orchestration)
        -Task object = = cooperation = = special function
            -Task object = = special function
    -Bind callback:
        - task.add_done_callback(task)
            -Parameter task: task object corresponding to the current callback function
            -task.result(): returns the return value of the special function corresponding to the task object
            
-Event loop object
    -Create event loop object
    -Register the task object into the object and turn it on
    -Function: loop can execute all task objects registered internally asynchronously

-Hang: it means to hand over the use right of cpu.

await: used inside special functions when blocked
 wait: give each task a suspend permission

#[key] in the internal implementation of special functions, no module (such as time,requests) code that does not support asynchrony can appear. If it does, the whole asynchrony effect will be interrupted!!!

Use of asyncio

import asyncio
import time
from time import sleep

# Special functions
async def get_request(url):
    print('Downloading: ',url)
    sleep(2)
    print('Download completed: ',url)
    return 'page_text'

# Callback function
def parse(task):
    # Parameter represents task object
    print('i am callback',task.result())


start = time.time()
# Call special functions
func = get_request('www.xx.com')

# Create task object
task = asyncio.ensure_future(func)

# Bind callback function to task object
task.add_done_callback(parse)

# Create an event loop object
loop = asyncio.get_event_loop()

# Let loop perform a task
loop.run_until_complete(task)

print("Total time consuming:",time.time()-start) #Total time: 2.0017831325531006

Multitask process

import asyncio
import time

# Special functions
async def get_request(url):
    print('Downloading',url)
    # time.sleep(2) modules that do not support asynchronous will interrupt the whole asynchronous effect
    await asyncio.sleep(2)
    print('Download complete',url)
    return 'page_text'

def parse(task):
    print(task.result())


start = time.time()
urls = ['www.xxx1.com','www.xxx2.com','www.xxx3.com']

tasks = []  #Store multitask
for url in urls:
    # Call special functions
    func = get_request(url)

    # Create task object
    task = asyncio.ensure_future(func)

    # Bind callback function to task object
    task.add_done_callback(parse)
    tasks.append(task)

# Create event loop object
loop = asyncio.get_event_loop()

# Perform tasks
loop.run_until_complete(asyncio.wait(tasks))
print('Total time consuming:',time.time()-start) #2.0015313625335693

Use of aiohttp

- requests Asynchronous is not supported
- aiohttp It is a network request module supporting asynchronous
    - Environmental installation
    - Coding process:
        - General structure:
             with aiohttp.ClientSession() as s:
                #s.get(url,headers,params,proxy="http://ip:port")
                with s.get(url) as response:
                    #response.read() binary (. content)
                    page_text = response.text()
                    return page_text
                
    - Additional details
         - In every with Front plus async
         - Before each blocking operation, add await
                async with aiohttp.ClientSession() as s:
                    #s.get(url,headers,params,proxy="http://ip:port")
                    async with await s.get(url) as response:
                        #response.read() binary (. content)
                        page_text = await response.text()
                        return page_text

Asynchronous cooperative crawler case

# We need to use multi task asynchronous process to obtain the source code data of Baidu, Sogou, Jingdong and Taobao, and analyze it simply
import asyncio
import requests
import time
from lxml import etree
urls = ['https://www.baidu.com','http://www.taobao.com/','http://www.jd.com/','https://www.sogou.com/']
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
# Special functions
async def get_request(url):
    print('Downloading',url)
    page_text = requests.get(url,headers=headers).text
    print(url,'Download complete')
    return page_text

# Callback function
def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    div = tree.xpath('//div')
    print(div)

start = time.time()
tasks = []#Store multitask
for url in urls:
    func = get_request(url)
    task = asyncio.ensure_future(func)
    task.add_done_callback(parse)
    tasks.append(task)

# Create event outside loop
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print('Total time consuming:',time.time()-start)

#According to the results, it is found that execution is not asynchronous, because requests are not asynchronous modules, so the whole program will not be executed asynchronously

Crawler of multitask cooperation based on aiohttp

# We need to use multi task asynchronous process to obtain the source code data of Baidu, Sogou, Jingdong and Taobao, and analyze it briefly
import asyncio
import requests
import time
import aiohttp
from lxml import etree
urls = ['https://www.baidu.com','http://www.taobao.com/','http://www.jd.com/','https://www.sogou.com/']
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
# Special functions
async def get_request(url):
    async with aiohttp.ClientSession() as s:
        # s.get(url,headers,params,proxy="http://ip:port")
        async with await s.get(url,headers=headers) as response:
            print('Downloading', url)
            # response.read() binary (. content)
            page_text = await response.text()
            print(url, 'Download complete')
            return page_text

# Callback function
def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    div = tree.xpath('//div')
    print(div)

start = time.time()
tasks = []#Store multitask
for url in urls:
    func = get_request(url)
    task = asyncio.ensure_future(func)
    task.add_done_callback(parse)
    tasks.append(task)

# Create event outside loop
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print('Total time consuming:',time.time()-start) #Total time: 3.0848371982574463

Posted by skaforey on Mon, 09 Dec 2019 19:16:32 -0800