Multi thread crawling NCBI database literature

Keywords: Big Data Multithreading crawler

Tip: after the article is written, the directory can be generated automatically. Please refer to the help document on the right for how to generate it

preface

Recently, I followed the dark horse programmer to learn the request crawler and successfully completed the batch processing of NCBI documents. The problem is that the crawling efficiency is too slow. Therefore, after listening to the multi-threaded crawler course, I adapted my code and recorded it here.

Tip: This article mainly introduces the queue.Queue queue and * * threading.Thread() * * module in detail.

1, What is a queue?

In my understanding, the queue module is a one-way channel, first in first out. The calls of some of these methods are recorded here:
Queue.empty() returns boolean type and checks whether the queue is empty;
Queue.full() returns boolean type and checks whether the queue is full;
Queue.put() puts the data into the queue, and the quantity is increased by 1;
Queue.put_nowait() puts the data into the queue without waiting. If the data is empty, an error is reported;
Queue.get() takes the data from the queue, but the data will not be automatically reduced by 1;
Queue.get_nowait() takes the data out of the queue without waiting. If the data is empty, an error is reported;
Queue.task_done() means that the task has been done and tells the queue that it has been executed and does not need to be repeated;
Queue.join() does not execute other processes until the queue is empty;
Queue.task_done() and Queue.join() need to be used together, which can be understood as task_ Each time done () executes, it removes the execution object from the pair column, while the join() method blocks other programs until the queue is empty. So if task_done() is not executed, and join() does not know that it has been executed, so it is executed repeatedly and will not terminate.

2, What is threading?

This article introduces the difference between thread and process
Multi process programs are more robust than multithreaded programs, but they consume more resources and are less efficient when switching processes. However, for some concurrent operations that require simultaneous and shared variables, only threads can be used, not processes.

  1. In short, a program has at least one process, and a process has at least one thread

  2. The division scale of threads is smaller than that of processes, which makes the concurrency of multithreaded programs high.

  3. In addition, the process has an independent memory unit during execution, and multiple threads share memory, which greatly improves the running efficiency of the program.

  4. Threads are different from processes during execution. Each independent thread has an entry for program operation, sequential execution sequence and program exit. However, threads cannot be executed independently. They must be stored in the application, and the application provides multiple thread execution control.

  5. From a logical point of view, the significance of multithreading is that there are multiple execution parts in an application that can be executed at the same time. However, the operating system does not regard multiple threads as multiple independent applications to realize process scheduling, management and resource allocation. This is the important difference between process and thread.
    --------
    Copyright notice: This article is the original article of CSDN blogger "Black_God1", which follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this notice for reprint.

Original link: https://blog.csdn.net/Black_God1/article/details/81876754

class threading.Thread(group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None)

target specifies the execution object
Daemon or setDaemon sets the child thread as a daemon thread, and the child thread will end when the main thread ends. How to set it to true indicates that the thread is a daemon thread

3, Use steps

1. Import and storage

The pans library needs to be cited as numpy Library:

# Author   : cxnie66
# Date     : 2021/9/9
# Position : Shanghai
import requests
import re
from lxml import etree
import numpy as np
import pandas as pd
from queue import Queue
import threading

2. Read in data

The code is as follows:

# Author   : cxnie66
# Date     : 2021/9/9
# Position : Shanghai
class NCBISpider:
    def __init__(self):
        self.headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}
        self.start_url = "https://pubmed.ncbi.nlm.nih.gov/?term=AHLs&page=1"
        self.url="https://pubmed.ncbi.nlm.nih.gov/?term=AHLs&page={}"
        self.url_list_queue = Queue()
        self.parase_queue = Queue()
        self.content_queue = Queue()

    def get_url_total_num(self):  # Get the total number of pages used to build the url address
        global total_nums  # It must be placed at the front, otherwise there will be problems. Record it
        # Crawl the first content to get the html page
        start_response = requests.get(self.start_url, headers=self.headers).content.decode()
        # Use regular expressions to get the total number of web pages
        total_num = re.findall('totalResults: parseInt\("(.*?)", 10\)', start_response, re.S)[0]
        total_nums = int(total_num)+1  # Conversion format, must be int
        # return total_num  # There is a problem with setting the global property of globality

    def url_lists(self): # Build the url list and save it in the queue
        print(total_nums)
        for i in range(total_nums):
            url_list = self.url.format(i)
            self.url_list_queue.put(url_list)  # Save in queue

    def parase_url(self):  # Crawling content
        while True:  # You must use while Ture for non-stop access
            # print(url)
            url = self.url_list_queue.get()
            print(url)
            response = requests.get(url, headers=self.headers, timeout=8)
            self.parase_queue.put(response.content.decode())
            self.url_list_queue.task_done()

    def save_csv_title(self):  # Save data to csv
        columns = ["PMID", "title", "paper_citation", "author", "Abstract", "paper_url"]
        title_csv = pd.DataFrame(columns=columns)
        title_csv.to_csv('AHLs_paper.csv', mode="a", index=False, header=1, encoding="utf-8")
        # pass

    def get_content(self):  # Get related content
        while True:
            html = self.parase_queue.get()
            nodes = etree.HTML(html)
            articel = nodes.xpath('//div[@class="search-results-chunk results-chunk"]/article')
        # print(articel)
            ret = []
            for art in articel:
            # pass
                item = {}
                # Realize the de newline, empty character and connection of the title
                item["title"] = art.xpath(
                    './div[@class="docsum-wrap"]/div[@class="docsum-content"]/a[@class="docsum-title"]//text()')
                item["title"] = [i.replace("\n", "").strip() for i in item["title"]]
                item["title"] = [''.join(item["title"])]

                item["PMID"] = art.xpath('./div[@class="docsum-wrap"]//span[@class="citation-part"]/span/text()')

                # Journal related information
                item["paper_citation"] = art.xpath(
                    './div[@class="docsum-wrap"]//span[@class="docsum-journal-citation full-journal-citation"]/text()')

                # author
                item["author"] = art.xpath('./div[@class="docsum-wrap"]//span[@class="docsum-authors full-authors"]/text()')
                # abstract
                item["Abstract"] = art.xpath('./div[@class="docsum-wrap"]//div[@class="full-view-snippet"]//text()')
                item["Abstract"] = [i.replace("\n", "").strip() for i in item["Abstract"]]
                item["Abstract"] = [''.join(item["Abstract"])]
                # Article address
                item["url"] = art.xpath('./div[@class="docsum-wrap"]//div[@class="share"]/button/@data-permalink-url')
                ret.append(item)
            self.content_queue.put(ret)
            self.parase_queue.task_done()

    def save_content(self): #Save to specified content
        while True:
            ret = self.content_queue.get()
            pf = pd.DataFrame(ret)
            pf.to_csv('AHLs_paper.csv', mode="a", index=False, header=0, encoding="utf-8")
            # print(ret)
            self.content_queue.task_done()

    def run(self):  # Implement main logic

        # 1, Preparation: get the total number of pages to prepare for building the url address later
        self.get_url_total_num()
        # 2, Preparation: build the first column title of the csv file
        self.save_csv_title()

        # 3, Carry out main process continuation
        threading_list = []  # Build the threading list and start in turn. Otherwise, each needs to start, which is troublesome

        # 1. Construct the url list multi-threaded first
        t_url = threading.Thread(target=self.url_lists)
        threading_list.append(t_url)

        # 2. Open 5 threads to get the crawling content and store it in the queue
        for i in range(5):
            t_parase = threading.Thread(target=self.parase_url)
            threading_list.append(t_parase)

        # 3. Open 5 threads for data filtering and store them in the queue
        for i in range(5):
            t_content = threading.Thread(target=self.get_content)
            threading_list.append(t_content)

        # 4. Store in csv
        t_save = threading.Thread(target=self.save_content)
        threading_list.append(t_save)

        # 5. Traverse the threads and start them successively. setDaemon sets the main thread as the main thread. If it is not set, the sub thread will end and the process will not stop
        for t in threading_list:
            t.setDaemon(True)  # The child thread is a daemon thread, and the child thread will end when the main thread ends
            t.start()

        # 6. The join attribute allows the main thread to wait for the star of the child thread to end, otherwise the main thread will end immediately, and the child thread can only execute part of the process
        for q in [self.url_list_queue, self.parase_queue, self.content_queue]:
            q.join()  # Let the main thread wait for blocking and wait for the tasks in the queue to complete

        print("End of main thread!!!!")

if __name__ =="__main__":
    ncbi_spider = NCBISpider()
    ncbi_spider.run()

The url used here is the data requested by the network.

summary

run is mainly explained in detail here:
  • To build a threading list, start() is required for each threading, so start() needs to be set for the building list in turn
  • setDaemon(), which is mainly used to set these sub threads as guardian threads, which means that if the main thread ends, the sub thread will immediately end execution. If it is not set, the main thread ends and the sub thread will still execute.
  • The * * join() * * method of the thread is to block the main thread until the execution of the child thread ends. If it is not set, the main thread will end immediately, and the child thread will end, and the program cannot be fully executed.

Posted by dharprog on Thu, 09 Sep 2021 20:18:19 -0700