Using the zipfile module of python to solve the problem of word batch generation

Keywords: Python

  • Platform: windows10
  • Interpreter: Python 3.7

Task requirements

There is an existing excel document containing target data. You need to replace the corresponding content of each line of data in the specified word and save it one by one.

Task Disassembly

First, check the word document format. You can see that the file suffix is. doc. What needs to be replaced is the English part in the red box of the body part.

The target data is excel document, and the data under the corresponding column in excel is replaced into word document.

excel data is relatively regular without secondary processing. If the column name does not correspond to or does not exist in the word document, it needs to be adjusted or added. In this way, just consider how to read the doc file and replace it according to a certain logic.

Mission plan

Scheme 1: use python-docx.Document to read word documents

Before this demand, I didn't use python to operate word documents. Thank you for your help in the communication group. Although I wrote the code in step from Python+Excel+Word to produce 100 contract articles in one second, when the execution times were wrong, the target string was not replaced, etc.

  • Problem 1: the module is installed incorrectly. In the article, I mistakenly thought that pip install docx was OK. When calling the Document class, I found that there was no such class under the module, so I Baidu. It should be PIP install Python docx and import docx.
  • Problem 2: the python docx module cannot operate DOC documents. As mentioned above, the word documents processed this time are doc suffixes, which need to be converted to docx suffixes for normal operation. In fact, a document can be saved as a document through word software, but it is not elegant in Python programming. The main reason is that I am too lazy. At most, I copy the target file path to the code, Therefore, use win32com module to call word program to convert doc document into docx document.
  • Problem 3: Python+Excel+Word makes a hundred contracts in one second. It is to locate the specific text segment for replacement. When trying for the first time, it is found that the replacement cannot succeed. Run the code step by step to locate the problem. It can be imagined that Document divides the whole word Document into multiple paragraphs. A paragraph has many lines and each line has multiple text blocks. Due to the unclear division of text blocks in each line, Chinese / English input in different ways of Chinese and English input methods will lead to the disassembly of a word, or it may be caused by a certain format in the word Document, such as underline, When there is no underline, the words are not separated. Try to replace the content with paragraphs.text. The text can be replaced successfully, but the format of the underline is discarded. Therefore, only the text method under the text block can be used for replacement. Input English with the same input method in the original word file (corresponding to the column name of Excel, it should be ensured that the string does not appear elsewhere in word, that is, Chinese is also OK. The recommended writing method: # column name #)

After solving the above problems one by one, enter the target file path and input path.
Source code:

from copy import deepcopy
from pathlib import Path
from win32com import client as wc
from docx import Document  # pip install python-docx
import pandas as pd


# Python docx can't process doc documents. Use win32com to save them as docx documents
def doctransform2docx(doc_path):
    docx_path = doc_path + 'x'
    suffix = doc_path.split('.')[1]
    assert 'doc' in suffix, 'Passed in is not word Document, please re-enter!'
    if suffix == 'docx':
        return Path(doc_path)
    word = wc.Dispatch('Word.Application')
    doc = word.Documents.Open(doc_path)
    doc.SaveAs2(docx_path, 16)  # docx is 16
    doc.Close()
    word.Quit()
    return Document(docx_path)

# Replace the specific characters in docx. Since the run method has poor display effect in formatted docx files, fill the space occupied by English characters for the text in docx
def replace_docx(name, values, wordfile, path_name = 'Company'):
    wordfile_copy = deepcopy(wordfile)  # Prevent the original file from being tampered with, and deepcopy is a copy
    for col_name, value in zip(name, values):
        if col_name == 'Company':
            path_name = str(value)
        for paragraphs in wordfile_copy.paragraphs:
            for run in paragraphs.runs:
                run.text = run.text.replace(col_name, str(value))
    # The docx document is replaced and saved as. Be sure to use the absolute path
    wordfile_copy.save(f'{save_folder}/{path_name}.docx')

if __name__ == '__main__':
    # Define the file path to process
    doc_path = r"D:\solve_path\Company.doc"
    excel_path = r"D:\solve_path\information.xls"
    save_folder = Path('D:/docx_save')
    save_folder.mkdir(parents=True, exist_ok=True)  # Automatically create a folder when it does not exist
    # Get excel data
    data = pd.read_excel(excel_path)
    wordfile = doctransform2docx(doc_path)
    data_save = data.apply(lambda x: replace_docx(x.index, x.values, wordfile), axis=1)

When I thought I was finished, the problem came and the box in the original file was gone (omitted!!!)
design sketch:

I think there must be a corresponding solution in Python docx, but I tried for the first time. The source code is like a Book of heaven. After calling the method many times, I found that one of the parameters is output, wordfile.part.blob:

The output reminds me of the xml file at the beginning of the file I saw when decrypting excel. Then I first tried to replace the text in it. I thought it would be like run.text = run.text.replace(col_name, str(value)). However, an error was reported and modification is prohibited.

Scheme 2: zipfile to skillfully solve word documents

Just when I thought there was no other way, I gave it up. Baidu Encyclopedia helped me:

The docx document is essentially an XML file, emmmm. It's wonderful. In order to extract the pictures in xlsx, the xlsx file was decompressed and then extracted. As expected, it is feasible. The replaced main file is the document.xml file in the word folder

Of course, before writing the code, first try whether you can manually restore to docx. Restore with 7z default parameters fails. After many searches, you can compress with zip type. There is no limit to the software. Manual decompression and replacement character compression are successful, and start typing the code.
In addition to the habit of reading excel files with pandas, there is no need to install other packages. They are built-in packages in current Python 3.7.
Use zipfile to decompress the compressed class file. This article comes from: How to compress and decompress files in python , the article is very detailed. I only rewrite os.path to pathlib.
However, there is a problem when compressing and restoring the files in the directory to docx documents:

  • Question 1: How to compress and decompress files in python The compressed file type in is
    Zipfile.zip_flattened compresses all traversed files into one directory, which causes the file hierarchy mismatch in the restored docx, and the docx reading fails. Use zipfile.zlib.flattened to successfully compress by hierarchy.
  • Problem 2: when saving a zipfile compressed file, there should be a file name and its alias, and the alias cannot be an absolute path. In order to restore normally, the original name should also be used, which is f.write (file path, file path alias) in the code

Source code:

from win32com import client as wc
from shutil import rmtree
import zipfile
from copy import deepcopy
from pathlib import Path
import pandas as pd


# The doc document does not contain the required xml file. Use win32com to save it as a docx document
def doctransform2docx(doc_path):
    docx_path = doc_path + 'x'
    suffix = doc_path.split('.')[1]
    assert 'doc' in suffix, 'Passed in is not word Document, please re-enter!'
    if suffix == 'docx':
        return Path(doc_path)
    word = wc.Dispatch('Word.Application')
    doc = word.Documents.Open(doc_path)
    doc.SaveAs2(docx_path, 16)  # docx is 16
    doc.Close()
    word.Quit()
    return Path(docx_path)

# docx document decompression
def docx_unzip(docx_path):
    docx_path = Path(docx_path) if isinstance(docx_path, str) else docx_path
    upzip_path = docx_path.with_name(docx_path.stem)
    with zipfile.ZipFile(docx_path, 'r') as f:
        for file in f.namelist():
            f.extract(file, path=upzip_path)
    xml_path = upzip_path.joinpath('word/document.xml')
    with xml_path.open(encoding='utf-8') as f:
        xml_file = f.read()
    return upzip_path, xml_path, xml_file

# Compress all files in the folder into docx documents
def docx_zipped(docx_path, zipped_path):
    docx_path = Path(docx_path) if isinstance(docx_path, str) else docx_path
    with zipfile.ZipFile(zipped_path, 'w', zipfile.zlib.DEFLATED) as f:
        for file in docx_path.glob('**/*.*'):
            f.write(file, file.as_posix().replace(docx_path.as_posix() + '/', ''))

# Delete the generated unzipped folder
def remove_folder(path):
    path = Path(path) if isinstance(path, str) else path
    if path.exists():
        rmtree(path)
    else:
        raise "The system cannot find the specified file"
            
# Replace the specific characters in docx and re save document.xml to the directory to be compressed
def replace_docx(name, values, xml_file, xml_path, unzip_path, path_name='Company'):
    xml_path = Path(xml_path) if isinstance(xml_path, str) else xml_path
    xml_file_copy = deepcopy(xml_file)  # Deep copy xml content
    for col_name, value in zip(name, values):
        if col_name == 'Company':
            path_name = str(value)
        xml_file_copy = xml_file_copy.replace(col_name, str(value))
    with xml_path.open(mode='w', encoding='utf-8') as f:
        f.write(xml_file_copy)
    # After the xml document is replaced, it is recompressed by zipfile and saved as a docx document
    docx_zipped(unzip_path, f'{save_folder}/{path_name}.docx')


if __name__ == '__main__':
    # Define the file path to process
    doc_path = r"D:\solve_path\Company.doc"
    excel_path = r"D:\solve_path\information.xls"
    save_folder = Path('D:/docx_save')
    save_folder.mkdir(parents=True, exist_ok=True)  # Automatically create a folder when it does not exist

    # Get excel data
    data = pd.read_excel(excel_path)
    docx_path = doctransform2docx(doc_path)
    unzip_path, xml_path, xml_file = docx_unzip(docx_path)
    data_save = data.apply(lambda x: replace_docx(x.index, x.values, xml_file, xml_path, unzip_path), axis=1)
    remove_folder(unzip_path)

Open the generated file, the box does not disappear, and the underline is also in.

summary

After several attempts, my knowledge is not deep. I stumbled to find a scheme that can replace the string in docx and will not change the original format. I believe there will be a better scheme. But I didn't find it at this time. Time keeps moving forward, and I shouldn't fall down in order to achieve common prosperity. If you have any questions, please leave a message in the comment area.

Made on 28 November 2021

Posted by Pellefant on Sun, 28 Nov 2021 00:31:55 -0800