Tip: after the article is written, the directory can be generated automatically. Please refer to the help document on the right for how to generate it
preface
Recently, I followed the dark horse programmer to learn the request crawler and successfully completed the batch processing of NCBI documents. The problem is that the crawling efficiency is too slow. Therefore, after listening to the multi-threaded crawler course, I adapted my code and recorded it here.
Tip: This article mainly introduces the queue.Queue queue and * * threading.Thread() * * module in detail.
1, What is a queue?
In my understanding, the queue module is a one-way channel, first in first out. The calls of some of these methods are recorded here:
Queue.empty() returns boolean type and checks whether the queue is empty;
Queue.full() returns boolean type and checks whether the queue is full;
Queue.put() puts the data into the queue, and the quantity is increased by 1;
Queue.put_nowait() puts the data into the queue without waiting. If the data is empty, an error is reported;
Queue.get() takes the data from the queue, but the data will not be automatically reduced by 1;
Queue.get_nowait() takes the data out of the queue without waiting. If the data is empty, an error is reported;
Queue.task_done() means that the task has been done and tells the queue that it has been executed and does not need to be repeated;
Queue.join() does not execute other processes until the queue is empty;
Queue.task_done() and Queue.join() need to be used together, which can be understood as task_ Each time done () executes, it removes the execution object from the pair column, while the join() method blocks other programs until the queue is empty. So if task_done() is not executed, and join() does not know that it has been executed, so it is executed repeatedly and will not terminate.
2, What is threading?
This article introduces the difference between thread and process
Multi process programs are more robust than multithreaded programs, but they consume more resources and are less efficient when switching processes. However, for some concurrent operations that require simultaneous and shared variables, only threads can be used, not processes.
-
In short, a program has at least one process, and a process has at least one thread
-
The division scale of threads is smaller than that of processes, which makes the concurrency of multithreaded programs high.
-
In addition, the process has an independent memory unit during execution, and multiple threads share memory, which greatly improves the running efficiency of the program.
-
Threads are different from processes during execution. Each independent thread has an entry for program operation, sequential execution sequence and program exit. However, threads cannot be executed independently. They must be stored in the application, and the application provides multiple thread execution control.
-
From a logical point of view, the significance of multithreading is that there are multiple execution parts in an application that can be executed at the same time. However, the operating system does not regard multiple threads as multiple independent applications to realize process scheduling, management and resource allocation. This is the important difference between process and thread.
--------
Copyright notice: This article is the original article of CSDN blogger "Black_God1", which follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this notice for reprint.
Original link: https://blog.csdn.net/Black_God1/article/details/81876754
class threading.Thread(group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None)
target specifies the execution object
Daemon or setDaemon sets the child thread as a daemon thread, and the child thread will end when the main thread ends. How to set it to true indicates that the thread is a daemon thread
3, Use steps
1. Import and storage
The pans library needs to be cited as numpy Library:
# Author : cxnie66 # Date : 2021/9/9 # Position : Shanghai import requests import re from lxml import etree import numpy as np import pandas as pd from queue import Queue import threading
2. Read in data
The code is as follows:
# Author : cxnie66 # Date : 2021/9/9 # Position : Shanghai class NCBISpider: def __init__(self): self.headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"} self.start_url = "https://pubmed.ncbi.nlm.nih.gov/?term=AHLs&page=1" self.url="https://pubmed.ncbi.nlm.nih.gov/?term=AHLs&page={}" self.url_list_queue = Queue() self.parase_queue = Queue() self.content_queue = Queue() def get_url_total_num(self): # Get the total number of pages used to build the url address global total_nums # It must be placed at the front, otherwise there will be problems. Record it # Crawl the first content to get the html page start_response = requests.get(self.start_url, headers=self.headers).content.decode() # Use regular expressions to get the total number of web pages total_num = re.findall('totalResults: parseInt\("(.*?)", 10\)', start_response, re.S)[0] total_nums = int(total_num)+1 # Conversion format, must be int # return total_num # There is a problem with setting the global property of globality def url_lists(self): # Build the url list and save it in the queue print(total_nums) for i in range(total_nums): url_list = self.url.format(i) self.url_list_queue.put(url_list) # Save in queue def parase_url(self): # Crawling content while True: # You must use while Ture for non-stop access # print(url) url = self.url_list_queue.get() print(url) response = requests.get(url, headers=self.headers, timeout=8) self.parase_queue.put(response.content.decode()) self.url_list_queue.task_done() def save_csv_title(self): # Save data to csv columns = ["PMID", "title", "paper_citation", "author", "Abstract", "paper_url"] title_csv = pd.DataFrame(columns=columns) title_csv.to_csv('AHLs_paper.csv', mode="a", index=False, header=1, encoding="utf-8") # pass def get_content(self): # Get related content while True: html = self.parase_queue.get() nodes = etree.HTML(html) articel = nodes.xpath('//div[@class="search-results-chunk results-chunk"]/article') # print(articel) ret = [] for art in articel: # pass item = {} # Realize the de newline, empty character and connection of the title item["title"] = art.xpath( './div[@class="docsum-wrap"]/div[@class="docsum-content"]/a[@class="docsum-title"]//text()') item["title"] = [i.replace("\n", "").strip() for i in item["title"]] item["title"] = [''.join(item["title"])] item["PMID"] = art.xpath('./div[@class="docsum-wrap"]//span[@class="citation-part"]/span/text()') # Journal related information item["paper_citation"] = art.xpath( './div[@class="docsum-wrap"]//span[@class="docsum-journal-citation full-journal-citation"]/text()') # author item["author"] = art.xpath('./div[@class="docsum-wrap"]//span[@class="docsum-authors full-authors"]/text()') # abstract item["Abstract"] = art.xpath('./div[@class="docsum-wrap"]//div[@class="full-view-snippet"]//text()') item["Abstract"] = [i.replace("\n", "").strip() for i in item["Abstract"]] item["Abstract"] = [''.join(item["Abstract"])] # Article address item["url"] = art.xpath('./div[@class="docsum-wrap"]//div[@class="share"]/button/@data-permalink-url') ret.append(item) self.content_queue.put(ret) self.parase_queue.task_done() def save_content(self): #Save to specified content while True: ret = self.content_queue.get() pf = pd.DataFrame(ret) pf.to_csv('AHLs_paper.csv', mode="a", index=False, header=0, encoding="utf-8") # print(ret) self.content_queue.task_done() def run(self): # Implement main logic # 1, Preparation: get the total number of pages to prepare for building the url address later self.get_url_total_num() # 2, Preparation: build the first column title of the csv file self.save_csv_title() # 3, Carry out main process continuation threading_list = [] # Build the threading list and start in turn. Otherwise, each needs to start, which is troublesome # 1. Construct the url list multi-threaded first t_url = threading.Thread(target=self.url_lists) threading_list.append(t_url) # 2. Open 5 threads to get the crawling content and store it in the queue for i in range(5): t_parase = threading.Thread(target=self.parase_url) threading_list.append(t_parase) # 3. Open 5 threads for data filtering and store them in the queue for i in range(5): t_content = threading.Thread(target=self.get_content) threading_list.append(t_content) # 4. Store in csv t_save = threading.Thread(target=self.save_content) threading_list.append(t_save) # 5. Traverse the threads and start them successively. setDaemon sets the main thread as the main thread. If it is not set, the sub thread will end and the process will not stop for t in threading_list: t.setDaemon(True) # The child thread is a daemon thread, and the child thread will end when the main thread ends t.start() # 6. The join attribute allows the main thread to wait for the star of the child thread to end, otherwise the main thread will end immediately, and the child thread can only execute part of the process for q in [self.url_list_queue, self.parase_queue, self.content_queue]: q.join() # Let the main thread wait for blocking and wait for the tasks in the queue to complete print("End of main thread!!!!") if __name__ =="__main__": ncbi_spider = NCBISpider() ncbi_spider.run()
The url used here is the data requested by the network.
summary
run is mainly explained in detail here:- To build a threading list, start() is required for each threading, so start() needs to be set for the building list in turn
- setDaemon(), which is mainly used to set these sub threads as guardian threads, which means that if the main thread ends, the sub thread will immediately end execution. If it is not set, the main thread ends and the sub thread will still execute.
- The * * join() * * method of the thread is to block the main thread until the execution of the child thread ends. If it is not set, the main thread will end immediately, and the child thread will end, and the program cannot be fully executed.