Work often encounters the need to extract text from PDF files, a PDF is OK, copy and paste it can not take too much time, if you need to convert a large number of PDF to Word, how to do?
Today we teach you to use 60 lines of code to achieve, multi-threaded batch PDF to Word. No interest in looking at the specific process can be pulled directly to the end, there is code.
How many steps do you take to convert PDF to Word? Two steps, the first step is to read the PDF file, and the second step is to write to the Word file.
What I don't know in the process of learning can be added to me? python Learning Exchange Button qun，784758214 //There are good learning video tutorials, development tools and e-books in the group. //Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams resource_manager = PDFResourceManager() return_str = StringIO() lap_params = LAParams() device = TextConverter(resource_manager, return_str, laparams=lap_params) process_pdf(resource_manager, device, file) # File is a PDF file handle opened using the open method device.close() # Here content is the PDF content converted to text. content = return_str.getvalue()
The content variable stores the text content we read from the PDF file. As you can see, using pdfminer3k can easily accomplish this task. Next we need to write the text content into a word file.
from docx import Document doc = Document() for line in content.split('\n'): paragraph = doc.add_paragraph() paragraph.add_run(remove_control_characters(line)) doc.save(file_path)
content is the text we read in front of us. Because the whole PDF is read as a string, we need to use split method to separate each line, and then write word by line, otherwise all the text will be on the same line. At the same time, this code uses a remove_control_characters function, which needs to be implemented by itself to remove control characters (newline, tab, escape, etc.) because python-docx does not support writing control characters.
def remove_control_characters(content): mpa = dict.fromkeys(range(32)) return content.translate(mpa)
The control character is ASCII code below 32, so we use str's translate method to remove the character below 32.
If we use the above code to convert 100 PDF files, we will find that the speed is too slow to accept. It takes a long time for each PDF to convert well. What should we do? Don't worry. Next, we introduce multithreading and convert multiple PDFs at the same time, which can effectively speed up the conversion.
import os from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor(max_workers=int(config['max_worker'])) as executor: for file in os.listdir(config['pdf_folder']): extension_name = os.path.splitext(file) if extension_name != '.pdf': continue file_name = os.path.splitext(file) pdf_file = config['pdf_folder'] + '/' + file word_file = config['word_folder'] + '/' + file_name + '.docx' print('Processing: ', file) result = executor.submit(pdf_to_word, pdf_file, word_file) tasks.append(result) while True: exit_flag = True for task in tasks: if not task.done(): exit_flag = False if exit_flag: print('complete') exit(0)
Cong is a dictionary that contains the address of PDF folder and word folder. It uses concurrent package in Python standard library to realize multi-process. The method of pdf_to_word is to encapsulate reading PDF and writing word logic. The latter while loop is to query whether the task has been completed.
If you are still confused in the world of programming, you can join our Python Learning button qun: 784758214 to see how our predecessors learned. Exchange of experience. From basic Python script to web development, crawler, django, data mining, zero-base to actual project data are sorted out. To every Python buddy! Share some learning methods and small details that need attention. Click to join us. python learner gathering place
At this point, we have implemented a multi-threaded batch conversion of PDF to word documents. Take a famous article and try it out. The effect is as follows (the left side is the converted word, and the right side is the PDF):
All the code introduced in this article has been packaged into a separate and runnable project and stored in github. If you don't want to write your own code, you can directly clone or download the GitHub project to run. The project address is as follows (remember point star):