Python software design basics section 8 - Tesseract OCR video caption extraction

Keywords: Python image processing OCR

catalogue

1, Tesseract OCR overview and environment configuration

(1) Introduction to Tesseract OCR

(2) Tesseract OCR installation

1. Program download and installation

2. Configure environment variables

3. Language configuration and program testing

2, Implementation of video caption extraction

(1) Implementation principle

(2) Code implementation

3, Summary

1, Tesseract OCR overview and environment configuration

(1) Introduction to Tesseract OCR

Tesseract is an open source OCR (Optical Character Recognition) developed by HP Labs and maintained by Google Engine. Tesseract can handle many natural languages, such as English, Portuguese and so on. By 2015, it can support more than 100 written languages, and can easily master other languages through training and learning.

(2) Tesseract OCR installation

1. Program download and installation

Official website: https://github.com/tesseract-ocr/tesseract
Official documents: https://github.com/tesseract-ocr/tessdoc
Language pack address: https://github.com/tesseract-ocr/tessdata
Download address: https://digi.bib.uni-mannheim.de/tesseract/

After entering the download page, download the official stable version with the file name of "tesseract-ocr-w64-setup-v5.0.0.20190623.exe".

After downloading, install on PC. During installation, pay attention not to have Chinese in the path to avoid problems.

During the installation process, you can select the language pack to install, such as simplified Chinese. However, the speed is slow. It is recommended to download the language pack through other channels and install it locally.

2. Configure environment variables

Press "Win+R" on the computer to open the command line, and enter "sysdm.cpl" to open the window for setting Path.

Select Advanced - environment variables.

Add the installation path of Tesseract OCR to the path variable.

3. Language configuration and program testing

Copy the language file "chi_sim.traineddata" to the tessdata folder under the tessact OCR installation directory, so that the program language is displayed in Chinese. Open the command window under the Tesseract OCR installation directory and enter the "tesseract -v" command to detect the installation of Tesseract OCR.

The following figure prompts that the installation configuration is complete.

2, Implementation of video caption extraction

(1) Implementation principle

1. Read the video and obtain the video size to find the caption position

2. Capture the area where the caption is located and save it as a variable

3. Convert caption area to grayscale image

4. Binarization of the edge characteristic matrix of the caption area along the X-axis direction by cv2

5. The text recognition module in Tesseract OCR is used to recognize the processed caption area

6. Output the recognized subtitle text

(2) Code implementation

Select the clip of let bullets fly for recognition, and output subtitle information.

The following are video images:

The implementation code is as follows:

import pytesseract
import cv2
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
#Import third party libraries

if __name__ == '__main__':
    path = "Let the bullet fly.mp4"
    #Define the video path to read

    cap = cv2.VideoCapture(path)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    print(frame_count)
    #Calculate the number of frames of the video
    
    i=0
    #Defines the initial number of frames
    while i<frame_count:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        _, frame = cap.read(i)

        shape = frame.shape
        print(shape)
        #Output the size information of the video to be processed

        img=frame[635:715, 100:1200]
        #Resize screenshots
        plt.imshow(img)
        plt.axis("off")
        plt.show()
        #Display the captured picture
        
        img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        #Convert the intercepted image into a grayscale image
        #cv2.imshow("Frame-2:Gray", img)
        #Display grayscale image
        _, img= cv2.threshold(img, 220, 255, cv2.THRESH_BINARY)
        # Image, threshold and maximum value of mapping. The algorithm used is generally cv2.thresh_binary
        #cv2.imshow("Frame-3:Binary", img)
        #Display grayscale image
        
        tessdata_dir_config = '--tessdata-dir "D:\\Python lib\\Tesseract-OCR\\tessdata"  --psm 7 -c preserve_interword_spaces=1'
        #Select the module for character recognition
        word = pytesseract.image_to_string(img,
                                           lang='chi_sim',
                                           #config=' --psm 7 -c preserve_interword_spaces=1')
                                           config=tessdata_dir_config)
        print(word)
        #Output recognized text
        i=i+24*5
        #The number of frames to be recognized is defined and recognized every 120 frames

        if cv2.waitKey(10) & 0xff == ord("q"):
            break
        #Set the action to stop the program

    cap.release()
    cv2.destroyAllWindows()

The output results are as follows:

The effect and accuracy of character recognition are good.

3, Summary

When using the above code for subtitle recognition, it is necessary to constantly adjust the position of intercepting subtitles. Because different videos have different subtitle positions and different picture proportions, different parameters need to be used to identify different videos.

After several comparisons, it is found that the recognition effect is the best when the caption is in the center of the screenshot. And the recognition effect is the best when the caption does not overlap with the picture.

The second is to adopt different picture processing methods according to different caption types. For example, some subtitles are light and not pure white. At this time, edge binarization is easy to lead to incomplete subtitles and poor recognition effect.

Solution: before the final recognition, first output the grayed and binary caption screenshots to see the effects of the two, and then select the appropriate processing method. For example, in the second image below, the image above is grayed out and the image below is binarized. After binarization, the pictures are obviously missing and difficult to identify. Therefore, binarization can not be selected and only grayscale can be selected.

Posted by kykin on Tue, 30 Nov 2021 06:36:28 -0800

Programmer Group

Python software design basics section 8 - Tesseract OCR video caption extraction

catalogue

1, Tesseract OCR overview and environment configuration

(1) Introduction to Tesseract OCR

(2) Tesseract OCR installation

1. Program download and installation

2. Configure environment variables

3. Language configuration and program testing

2, Implementation of video caption extraction

(1) Implementation principle

(2) Code implementation

3, Summary

Hot Keywords