Thread pool
- from multiprocessing.dummy import Pool
- The callback function asynchronously performs some operation on the elements in the iteratable object
- Note: callback must have one parameter and only one parameter
- Asynchronous is mainly used in time-consuming operations
from multiprocessing.dummy import Pool pool = Pool(3) # Instantiate the thread pool object. 3 is the maximum number of threads in the thread pool # Parameter 1: callback function (only function name, without parenthesis); parameter 2: List # Parameter 1 will receive an element in the parameter 2 list, and the callback function can perform some operation on the element in the list pool.map(callback,list)
Test: synchronous & asynchronous efficiency
Build a flash, start the service by yourself, and test the execution time
- Create a new server.py
from flask import Flask, render_template import time app = Flask(__name__) @app.route('/xx') def index_1(): time.sleep(2) return render_template('test.html') @app.route('/yy') def index_2(): time.sleep(2) return render_template('test.html') @app.route('/oo') def index_3(): time.sleep(2) return render_template('test.html') if __name__ == '__main__': app.run(debug=True)
- Create a new templates folder, and create an HTML file under the folder. I write test.html, and write some data
<html lang="en"> <head> <meta charset="UTF-8"/> <title>test</title> </head> <body> <div> <p>Baili Convention</p> </div> <div class="song"> <p>Li Qingzhao</p> <p>Wang Anshi</p> <p>Su Shi</p> <p>Liu Zongyuan</p> <a href="http://Www.song.com / "title =" Zhao Kuangyin "target =" <span>this is span</span> The Song Dynasty is the most powerful Dynasty, not the military, but the economy is very strong, the people are very rich</a> <a href="" class="du">Clouds can block out the sun,It's sad to see Chang'an</a> <img src="http://www.baidu.com/meinv.jpg" alt=""/> </div> <div class="tang"> <ul> <li><a href="http://Www.baidu.com "title =" Qing "> it rains in the clear and bright season, and pedestrians on the road want to break their souls. Ask where there is a restaurant. Shepherd boy points to Xinghua village</a></li> <li><a href="http://Www.163.com "title =" Qin "> in the Qin Dynasty, the moon was in the Ming Dynasty, and in the Han Dynasty, people from the long march had not yet returned, but the flying General of the dragon city was there, and Hu Madu was not taught Yinshan Mountain</a></li> <li><a href="http://Www.126.com "id =" Qi "> it's a common sight in Qiwang's house. Cui jiutang has heard it several times before. It's just a beautiful scenery in the south of the Yangtze River. It's the time to meet you when the flowers fall</a></li> <li><a href="http://Www.sina.com "class =" Du "> Du Fu</a></li> <li><a href="http://Www.du du.com "class =" Du "> Du Mu</a></li> <li><b>Du Xiaoyue</b></li> <li><i>Spend honeymoon</i></li> <li><a href="http://Www.haha. Com "id =" Feng "> Phoenix on the stage, Phoenix flows to Taikong River, Wu palace, flowers and plants bury the path, Jin Dynasty, chengguqiu</a></li> </ul> </div> </body> </html>
Synchronous & asynchronous execution time
import requests from bs4 import BeautifulSoup import time #Thread pool module from multiprocessing.dummy import Pool urls = [ 'http://127.0.0.1:5000/xx', 'http://127.0.0.1:5000/yy', 'http://127.0.0.1:5000/oo', ] #Data crawling, return the source code data of the crawling page def get_request(url): page_text = requests.get(url=url).text return page_text #Data analysis, return label text def parse(page_text): soup = BeautifulSoup(page_text, 'lxml') return soup.select('#feng')[0].text #Sync code if __name__ == '__main__': start = time.time() for url in urls: page_text = get_request(url) text_data = parse(page_text) print(text_data) print(time.time() - start) """ Execution result: Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty 6.056272029876709 """ #Asynchronous code if __name__ == '__main__': start = time.time() pool = Pool(3) instantiate thread pool object #Parameter 1: callback function (only function name, without parenthesis); parameter 2: List #Parameter 1 will receive an element in the parameter 2 list, and the callback function can perform some operation on the element in the list page_text_list = pool.map(get_request,urls) text_data = pool.map(parse,page_text_list) for i in text_data: print(i) print(time.time() - start) """ Execution result: Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty Phoenix on the stage, Phoenix on the Taikong River, Wu palace, flowers and plants, ancient hills in the Jin Dynasty 2.0537397861480713 Can increase 0.01 seconds without for """
To sum up: asynchronous code execution efficiency is significantly improved
Case: pear crawling video based on thread pool
- Thinking analysis
- Crawl to the url corresponding to the video detail page and store it in an iterative object
- Send the request again to get the real video address of the video details page
- Note: the video of the video detail page is generated dynamically by js code, and regular parsing is required
- Write a callback to get the binary file of the video and store it persistently
import requests from lxml import etree from multiprocessing.dummy import Pool import re import os headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36' } # Address of pear video fortune main_url = 'https://www.pearvideo.com/category_3' # Analyze the src of the video detail page under the board main_page_text = requests.get(url=main_url, headers=headers).text tree = etree.HTML(main_page_text) li_list = tree.xpath('//*[@id="listvideoListUl"]/li') # Thread pool video_urls = [] for li in li_list: # Specific address and video title of video detail page detail_url = "https://www.pearvideo.com/" + li.xpath('./div/a/@href')[0] name = li.xpath('./div/a/div[2]/text()')[0] # Request details page page_text = requests.get(url=detail_url, headers=headers).text # The video of the video detail page is generated dynamically by js code, using regular parsing ex = 'srcUrl="(.*?)",vdoUrl=' video_url = re.findall(ex, page_text, re.S)[0] # List type returned dic = { 'url': video_url, 'name': name, } video_urls.append(dic) # Callback function def get_video(url): # Send a request to the video address to persist the binary file video_data = requests.get(url=url['url'], headers=headers).content file_name = "./video/" + url['name'] + ".mp4" with open(file_name, 'wb') as f: f.write(video_data) print(url['name'], "Download completed!") # Create a folder to store videos dir_name = 'video' if not os.path.exists(dir_name): os.mkdir(dir_name) # Instantiate thread pool pool = Pool(4) pool.map(get_video, video_urls)
Single thread + multitask asynchronous process
asyncio (key)
Special functions
- If the definition of a function is modified by the async keyword, the function is a special function.
- Special points:
- After the function is called, the implementation statement inside the function will not be executed immediately.
- This function returns a co program object
Association
A process is an object. When a special function is called, it returns a co program object.
-
Coroutine object = = special function
import asyncio from time import sleep async def get_request(url): print('Requesting:', url) sleep(2) print('Request succeeded:', url) return '666' # Return a process object g = get_request("https://www,qq.com")
Task object
It further encapsulates the cooperation object (that is, a high-level cooperation object)
-
Task object = = cooperation object = = special function (representing a fixed form of task)
asyncio.ensure_future(Co object) task = asyncio.ensure_future(g) # g: Process object
-
Bind callback:
# Define a callback function for a task def callback(task): task.result() # Represents the return value of the special function corresponding to the current task object print("I'm callback:", task) task.add_done_callback(funcName) # Task: task object # funcName: the name of the callback function
-
funcName this callback function must take a parameter, which represents the current task object
- Parameter. result(): indicates the return value of the special function corresponding to the current task object
-
funcName this callback function must take a parameter, which represents the current task object
Event loop object
Create event loop object
-
The task object needs to be registered with the event loop object
# Create event loop object loop = asyncio.get_event_loop() # Register / load the task object into the event loop object, and then you need to start the loop object loop.run_until_complete(task) # Used to load and start an event cycle # Task: task object
wait for
await: when the blocking operation is over, let the loop go back to execute the code after blocking.
Hang up
asyncio.wait(): hand over the current task object to the cpu.
loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) asyncio.wait # Suspend operation tasks # Task object list
Key points for attention
- No module code that does not support asynchrony can appear inside the implementation of special functions, otherwise the asynchrony effect will be interrupted
aiohttp (key)
requests: asynchronous is not supported and cannot appear inside special functions.
-
aiohttp: support asynchronous network request module, used with asyncio
- pip install aiohttp
-
Code writing
- Write out the basic structure
import asyncio import aiohttp # Asynchronous network request based on aiohttp async def get_requests(url): # A request object was instantiated with aiohttp.ClientSession() as aio: # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response: with aio.get(url=url) as response: # text() gets the response data as a string # read() gets response data of type bytes page_text = await response.text() return page_text
- Detail supplement (refer to complete code for code)
- Add async keyword before each with
- Add the await keyword before each blocking operation
-
Complete code
import asyncio import aiohttp # Asynchronous network request based on aiohttp async def get_requests(url): # Instantiated a request object async with aiohttp.ClientSession() as aio: # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response: async with await aio.get(url=url) as response: # text() gets the response data as a string # read() gets response data of type bytes page_text = await response.text() return page_text
Single task cooperation operation
import asyncio from time import sleep async def get_request(url): print('Requesting:', url) sleep(2) print('Request succeeded:', url) return '666' # Define a callback function for a task def callback(task): print("I'm callback:", task) # Return a process object g = get_request("https://www,qq.com") # Create a task object task = asyncio.ensure_future(g) """ # Bind callback function to task object task.add_done_callback(callback) # Create event loop object loop = asyncio.get_event_loop() # Register / load the task object into the event loop object, and then you need to start the loop object loop.run_until_complete(task) # Used to load and start an event cycle """ //Execution result: //Requesting: www,qq.com //Requesting: www,qq.com """
Multitask cooperation operation
import asyncio import time start = time.time() async def get_request(url): print('Requesting:', url) # await let the loop go back to execute the code after blocking when the blocking operation is over await asyncio.sleep(2) print('Request succeeded:', url) return '666' urls = [ 'http://127.0.0.1:5000/xx', 'http://127.0.0.1:5000/yy', 'http://127.0.0.1:5000/oo', ] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() # When registering a task list with an event loop, you must suspend the task list # asyncio.wait() suspends the operation and hands over the current task object to the cpu loop.run_until_complete(asyncio.wait(tasks)) print('Total time consumption:', time.time() - start)
Single thread & multitask asynchronous crawler
Self test based on Flask
- Test code in the above test: synchronous & asynchronous efficiency, follow the above steps to start the project; then run the following code.
import asyncio import time import aiohttp from lxml import etree headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36' } urls = [ 'http://127.0.0.1:5000/xx', 'http://127.0.0.1:5000/yy', 'http://127.0.0.1:5000/oo', ] start = time.time() """ # Initiate the request and obtain the response data (asynchronous is not allowed) async def get_requests(url): # requests is a module that does not support asynchronous page_text = requests.get(url).text return page_text """ async def get_requests(url): """ //Asynchronous network request based on aiohttp :param url: :return: """ # Instantiated a request object async with aiohttp.ClientSession() as aio: # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response: async with await aio.get(url=url) as response: # text() gets the response data as a string # read() gets response data of type bytes page_text = await response.text() return page_text def parse(task): """ //Define callback function :param task: :return: """ page_text = task.result() # Get the return value of the special function (the source code data of the requested page) tree = etree.HTML(page_text) content = tree.xpath('//*[@id="feng"]/text()')[0] print(content) tasks = [] for url in urls: c = get_requests(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print('Total time consumption:', time.time() - start)
Case: asynchronous pear crawling video based on single thread and multitask
- Thinking of the above case: pear crawling video based on thread pool
import asyncio import time import aiohttp from lxml import etree import re import os import requests # Time module is to test the time-consuming of crawling video start = time.time() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36' } # Address of pear video fortune main_url = 'https://www.pearvideo.com/category_3' main_page_text = requests.get(url=main_url, headers=headers).text tree = etree.HTML(main_page_text) li_list = tree.xpath('//*[@id="listvideoListUl"]/li') urls = [] # [{'url': video_url,'name': name},{}...] for li in li_list: detail_url = "https://www.pearvideo.com/" + li.xpath('./div/a/@href')[0] name = li.xpath('./div/a/div[2]/text()')[0] page_text = requests.get(url=detail_url, headers=headers).text # The video of the video detail page is generated dynamically by js code ex = 'srcUrl="(.*?)",vdoUrl=' video_url = re.findall(ex, page_text, re.S)[0] # List type returned dic = { 'url': video_url, 'name': name, } urls.append(dic) # Asynchronous network request based on aiohttp async def get_requests(url): # Instantiated a request object async with aiohttp.ClientSession() as aio: # with aio.get/post(url=url,headers=headers,data/params,proxy='http://ip:prot') as response: async with await aio.get(url=url['url'], headers=headers) as response: # text() gets the response data as a string # read() gets response data of type bytes page_read = await response.read() dic = { "page_read": page_read, "name": url['name'] } return dic def parse(task): """ //Define callback function :param task: :return: """ dic_info = task.result() # Get the return value of the special function (the source code data of the requested page) file_name = "./video/" + dic_info["name"] + ".mp4" with open(file_name, 'wb') as f: f.write(dic_info['page_read']) print(dic_info["name"], "Download completed!") tasks = [] for url in urls: c = get_requests(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) tasks.append(task) dir_name = 'video' if not os.path.exists(dir_name): os.mkdir(dir_name) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print('Total time consumption:', time.time() - start)