Java reads text and images in PDF

Keywords: Java Maven

This article will introduce the method of reading text and image in PDF document through Java program. The methods extractText() and extractImages() are called to read.

Use tools: Free Spire.PDF for Java (free version)
Jar file acquisition import:
Method 1: Through the official website Download jar File package. After downloading, decompress the file and import the Spire.Pdf.jar file under the lib folder into the java program. After importing, the following figure is shown:

Method 2: Through maven Warehouse Installation Import, Reference Import method.

Java code example
[Example 1] Read text in PDF

import com.spire.pdf.*;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractText {
    public static void main(String[]args) throws Exception {
        //Loading test documents
        PdfDocument pdf = new PdfDocument("sample.pdf");

        //Instantiate StringBuilder class
        StringBuilder sb = new StringBuilder();
        //Define an int variable
        int index = 0;

        //Traveling through each page of PDF document
        PdfPageBase page;
        for (int i= 0; i<pdf.getPages().getCount();i++) {
            page = pdf.getPages().get(i);
            //Call the extractText() method to extract text
            sb.append(page.extractText(true));
            FileWriter writer;
            try {
                //Write text from StringBuilder object to txt
                writer = new FileWriter("ExtractText.txt");
                writer.write(sb.toString());
                writer.flush();
            } catch (IOException e) {
                e.printStackTrace();
            }
          }
        pdf.close();
        }
    }

Text reading results:

[Example 2] Read pictures in PDF

import com.spire.pdf.*;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;

public class ExtractImg {
    public static void main(String[] args) throws Exception{
        //Loading test documents
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile("test.pdf");

        //Define an int variable
        int index = 0;

        //Traverse PDF pages
        for (int i= 0;i< pdf.getPages().getCount(); i ++){
           //Get the PDF page
            PdfPageBase page = pdf.getPages().get(i);

            //Use the extractImages method to get pictures on the page
            for (BufferedImage image : page.extractImages()) {

                //Specify the name of the output picture
                File output = new File( String.format("Image_%d.png", index++));
                //Save the picture as a PNG file
                ImageIO.write(image, "PNG", output);
            }
        }
    }
}

Picture reading results:

(End of this article)

Posted by moagrius on Fri, 04 Oct 2019 01:08:30 -0700