Turn your video to text (3-export word)

Keywords: Java xml encoding

Hello, this is the last article in the series. We will export the text record to a well-organized word document for easy reading and sharing. The source code address thomas open source project

Overall structure

This chapter is in the third step of the overall transformation process, as shown in the following figure:

Introduction to docx document format

First of all, I will give you a general introduction to docx document format. Docx is actually a compressed format file. After manually changing the suffix to zip, you can extract the file. Usually, the main content structure is the word after decompression/ document.xml File.

For example, the following figure is the simplest word document with only "hello" in the body:

After changing the suffix of the document to. zip, unzip the document, and you will see word/document.xml The main contents are as follows

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
    mc:Ignorable="w14 w15 w16se w16cid w16 w16cex wp14">
    <w:body>
        <w:p w14:paraId="6D5AFF05" w14:textId="678C6FAC" w:rsidR="000933A6" w:rsidRDefault="008D746B">
            <w:r>
                <w:rPr>
                    <w:rFonts w:hint="eastAsia"/>
                </w:rPr>
                <w:t>Hello</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

From the above file, we can roughly see the basic structure of the word document:

  • < W: P > is paragraph
  • < W: R > is the line in the paragraph
  • < W: RPR > is row style information
  • < W: T > is the text content

The basic logic of docx4j library is corresponding to the above xml organization structure: convert the content of the above xml into the corresponding java objects and methods, and realize the functions of document generation and editing.

docx4j document operation

Next, based on the docx4j library, the word document operation is implemented.

First, docx4j dependency is introduced:

<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-ReferenceImpl</artifactId>
<version>8.1.6</version>

First of all, we need to record the dialogue of each video file, and generate the table of the following modes:

The processing logic of the corresponding table is:

// Create header
Tbl tbl = Context.getWmlObjectFactory().createTbl();
//Set the basic style of the table, including the border, etc
String strTblPr = "<w:tblPr "
        + Namespaces.W_NAMESPACE_DECLARATION
        + ">"
        + "<w:tblStyle w:val=\"TableGrid\"/>"
        + "<w:tblW w:w=\"0\" w:type=\"auto\"/>"
        + "<w:tblLook w:val=\"04A0\"/>"
        + "</w:tblPr>";
try {
    TblPr tblPr = (TblPr) XmlUtils.unmarshalString(strTblPr);
    tbl.setTblPr(tblPr);
} catch (JAXBException e) {
    log.error("be based on XML Analytic generation TblPr error", e);
}

// Set header row
Tr hearTr = Context.getWmlObjectFactory().createTr();
tbl.getContent().add(hearTr);
geneTblHearderCell(hearTr, "D9D9D9", 2629, docPart.createParagraphOfText("time"));
geneTblHearderCell(hearTr, "D9D9D9", 5667, docPart.createParagraphOfText("content"));

// Set content line
taskResultRepo.findByTaskIdEqualsOrderByBeginTimeAsc(taskId).forEach(result -> {
    Tr tr = Context.getWmlObjectFactory().createTr();
    tbl.getContent().add(tr);

    //Create first cell
    Tc tc1 = Context.getWmlObjectFactory().createTc();
    tc1.getContent().add(docPart.createParagraphOfText(formatSecond(result.getBeginTime())));

    //Create second cell
    Tc tc2 = Context.getWmlObjectFactory().createTc();
    tc2.getContent().add(docPart.createParagraphOfText(result.getWords()));

    //Add cells to the table
    tr.getContent().addAll(Arrays.asList(tc1, tc2));
});
//Adding a form to a document
docPart.getContent().add(tbl);
//Add page break
docPart.getContent().add(createNextPage());

As a special reminder, it is recommended not to use it as much as possible XmlUtils.unmarshalString To generate objects, except for the above reference to the official example to create the header TblPr, all other structures of this project are built with java objects. The reason is that parsing directly based on xml is easy to cause namespace errors.

docx4j also supports inserting pictures into documents, such as:

//Write pictures to word documents
Inline inline = null;
try {
    BinaryPartAbstractImage imagePart = BinaryPartAbstractImage.createImagePart(wordPackage,
            Files.readAllBytes(Paths.get("doc\\thomas-gitee.png")));
    inline = imagePart.createImageInline("Open source project address", "QR code picture", 1, 2, false);
} catch (Exception e) {
    log.error("Exception creating picture object", e);
}

ObjectFactory factory = Context.getWmlObjectFactory();
P p = factory.createP();
R r = factory.createR();
p.getContent().add(r);
Drawing drawing = factory.createDrawing();
r.getContent().add(drawing);
drawing.getAnchorOrInline().add(inline);

The following is to set the Title and chapter, and set the content style to Title and Heading1 respectively:

//Set document title
mainDocumentPart.addStyledParagraphOfText("Title", THOMAS_DOCX_NAME);
//Take the first line as the chapter name
mainDocumentPart.addStyledParagraphOfText("Heading1", taskInfo.getTaskName());

Generating directories is also simple:

//Generate a directory, which should be placed at the back
Toc.setTocHeadingText("catalog");
TocGenerator tocGenerator = new TocGenerator(wordPackage);
tocGenerator.generateToc(5, " TOC \\o \"1-3\" \\h \\z \\u ", true);

It should be noted that the first parameter in the generateToc method is the location where the directory is inserted into the document. The code above is to insert the directory into the fifth location.

After the document structure is assembled, call the save method of WordprocessingMLPackage to save the document.

last

At this point, we have finally completed the dialogue in MP4 video, and finally converted it into text, and output it as a standard format word document. If there are any mistakes or omissions in the implementation process, please give feedback, thank you.

This series uses "Thomas and friends" animation video as the material. The origin is that children especially like this animation program, especially like to listen to Thomas's story. In order to better tell the children Thomas bedtime story, these functions are realized on a whim, hoping to help you.

Posted by talltorp on Thu, 11 Jun 2020 22:56:14 -0700