Python thread pool / process pool memory management

Keywords: JSON encoding Python

concurrent.futures thread pool / process pool memory management

cause

In the past, there was a need for a crawler task. A machine with 1G memory ran a crawler and crawled the data of a website. Previously, this method was used:

 futures = list()
 with ThreadPool(max_workers=20) as exc:
      for tr in table.select("tr"):

          # Get thread execution results
          future = exc.submit(self.get_asn, tr.strings)
          futures.append(future)

# Use as "completed asynchronously to process completed threads
for future in as_completed(futures):
    result = future.result()
    # Splicing the path of asn.json
    file_path = result["asn"] + ".json"
    asn_file = os.path.join(self.base_data_path, file_path)

    with open(asn_file, "w", encoding="utf8") as f:
        f.write(json.dumps(result))

The ThreadPoolExecutor submit method of concurrent.futures is used. Because 20 threads are started to crawl at the same time, the speed of connecting to the website is still very fast, and the task is processed quickly. You can see that when I finish a task, I write the file, but the program is terminated soon after 2 minutes. When the monitoring program is found, the memory occupation of the program reaches 80% were killed by the system.

  • Why does memory burst? Make complaints about long time delay. After monitoring the memory, the memory is not released immediately after the program has processed the task. It is released after a long delay (python GC).

Improvement method:

Reference resources https://stackoverflow.com/questions/34770169/using-concurrent-futures-without-running-out-of-ram

# Number of simultaneous jobs allowed
MAX_JOBS_IN_QUEUE = 1000


tr_list = table.select("tr")
tr_left = len(tr_list) - 1  # <----
tr_iter = iter(tr_list)  # <------
jobs = dict()

  with ThreadPool(max_workers=20) as exc:
      while tr_left:
          print("#" * 100, "TASK: {} <===>  JOB: {}".format(tr_left, len(jobs)))
          for tr in tr_iter:
              # Get thread execution results
              job = exc.submit(self.get_asn, tr.strings)
              jobs[job] = tr
              if len(jobs) > MAX_JOBS_IN_QUEUE:
                  break  # limit the job submission for now job

          # Processing task threads asynchronously using as "completed
          for job in as_completed(jobs):
              tr_left -= 1  # one down - many to go...   <---
              result = job.result()
              # Remove the result from the dictionary because we don't need to store it
              del jobs[job]

              # Splicing the path of asn.json
              file_path = result["asn"] + ".json"
              asn_file = os.path.join(self.base_data_path, file_path)

              with open(asn_file, "w", encoding="utf8") as f:
                  f.write(json.dumps(result))
              break

When there is switching html crawling in the improved scheme, it will occasionally rise a little, up to 65%, with an average of about 35%.

  • The same with ProcessPoolExecutor process

Posted by knowj on Fri, 03 Apr 2020 00:03:35 -0700