Python implements multi-threaded PDF to Word, praise!

Keywords: Python github ascii Programming

Work often encounters the need to extract text from PDF files, a PDF is OK, copy and paste it can not take too much time, if you need to convert a large number of PDF to Word, how to do?

Today we teach you to use 60 lines of code to achieve, multi-threaded batch PDF to Word. No interest in looking at the specific process can be pulled directly to the end, there is code.

Decomposition of tasks

How many steps do you take to convert PDF to Word? Two steps, the first step is to read the PDF file, and the second step is to write to the Word file.

Yes, it's that simple. With Python third-party packages, we can easily implement the above two processes. We need to use the Python third-party packages. pdfminer3k and python-docx These two bags.

Read PDF

What I don't know in the process of learning can be added to me?
python Learning Exchange Button qun，784758214
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

resource_manager = PDFResourceManager()
return_str = StringIO()	
lap_params = LAParams()

device = TextConverter(resource_manager, return_str, laparams=lap_params)
process_pdf(resource_manager, device, file)  # File is a PDF file handle opened using the open method
device.close()

# Here content is the PDF content converted to text.
content = return_str.getvalue()

The content variable stores the text content we read from the PDF file. As you can see, using pdfminer3k can easily accomplish this task. Next we need to write the text content into a word file.

Write to Word

from docx import Document

doc = Document()
for line in content.split('\n'):
    paragraph = doc.add_paragraph()
    paragraph.add_run(remove_control_characters(line))
doc.save(file_path)

content is the text we read in front of us. Because the whole PDF is read as a string, we need to use split method to separate each line, and then write word by line, otherwise all the text will be on the same line. At the same time, this code uses a remove_control_characters function, which needs to be implemented by itself to remove control characters (newline, tab, escape, etc.) because python-docx does not support writing control characters.

def remove_control_characters(content):
    mpa = dict.fromkeys(range(32))
    return content.translate(mpa)

The control character is ASCII code below 32, so we use str's translate method to remove the character below 32.

It works, but it's too slow!

If we use the above code to convert 100 PDF files, we will find that the speed is too slow to accept. It takes a long time for each PDF to convert well. What should we do? Don't worry. Next, we introduce multithreading and convert multiple PDFs at the same time, which can effectively speed up the conversion.

import os
from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=int(config['max_worker'])) as executor:
    for file in os.listdir(config['pdf_folder']):
        extension_name = os.path.splitext(file)[1]
        if extension_name != '.pdf':
            continue
        file_name = os.path.splitext(file)[0]
        pdf_file = config['pdf_folder'] + '/' + file
        word_file = config['word_folder'] + '/' + file_name + '.docx'
        print('Processing: ', file)
        result = executor.submit(pdf_to_word, pdf_file, word_file)
        tasks.append(result)
while True:
    exit_flag = True
    for task in tasks:
        if not task.done():
            exit_flag = False
    if exit_flag:
        print('complete')
        exit(0)

Cong is a dictionary that contains the address of PDF folder and word folder. It uses concurrent package in Python standard library to realize multi-process. The method of pdf_to_word is to encapsulate reading PDF and writing word logic. The latter while loop is to query whether the task has been completed.

If you are still confused in the world of programming, you can join our Python Learning button qun: 784758214 to see how our predecessors learned. Exchange of experience. From basic Python script to web development, crawler, django, data mining, zero-base to actual project data are sorted out. To every Python buddy! Share some learning methods and small details that need attention. Click to join us. python learner gathering place

Effect

At this point, we have implemented a multi-threaded batch conversion of PDF to word documents. Take a famous article and try it out. The effect is as follows (the left side is the converted word, and the right side is the PDF):

I don't want to write code, I just want to use it.

All the code introduced in this article has been packaged into a separate and runnable project and stored in github. If you don't want to write your own code, you can directly clone or download the GitHub project to run. The project address is as follows (remember point star):

simpleapples/pdf2wordgithub.com

Posted by cihan on Sat, 07 Sep 2019 00:06:26 -0700

Programmer Group