poi reads word.doc/docx parsing and stores it regularly

SpringBook code structure:

Required pom dependencies

Design of database tables

Say no more, put the code first: Meteorological Service

public  void testReadByDoc(String path) throws Exception {
    Meteorological meteorological = new Meteorological();
    String [] content =null;
    //Take the subscript of the current. field
    int i = path.indexOf(".");
    //Read the file to the stream
    InputStream is = new FileInputStream(path);
    if(path.length()-i==4){   //doc file
        HWPFDocument doc = new HWPFDocument(is);
        Range range = doc.getRange();
    //Read word into paragraph format
        content =MeteorologicalUtil.printInfo(range);
    }else {          // docx file
        XWPFDocument xdoc = new XWPFDocument(is);
        content =MeteorologicalUtil.printInfox(xdoc);
    }
    //Remove empty passages
    String[] contenta = MeteorologicalUtil.removeArrayEmptyTextBackNewArray(content);  
   //Take the length of the array
    int len =contenta.length;
    Date time = MeteorologicalUtil.getTime(contenta);  //Time to get weather forecasts
    String s = contenta[len - 6];
       if(s.contains("and")){ //Determine whether to include and
           meteorological.setAlert(contenta[len-5]);
           meteorological.setWeather(contenta[len-4]);
       }
    String minimum = contenta[len - 1].substring(0, contenta[len - 1].length() - 1);
    String maximum = contenta[len - 2].substring(0, contenta[len - 2].length() - 1);
    String windforce = contenta[len - 3].substring(0, contenta[len - 3].length() - 1);
    meteorological.setWeather(contenta[len-4]);
    meteorological.setMaximum(maximum);
    meteorological.setMinimum(minimum);
    meteorological.setNowtime(time);
    meteorological.setWindforce(windforce);
    meteorologicalMapper.insert(meteorological);   //Encapsulation and Preservation
    is.close();
}

This code step:
First, parse the word document, read the content of the document in the form of paragraphs, and then get the information in the word document.
Since there may be spaces and carriage returns in the document that affect how we process the document according to paragraphs, we need to remove these possible impacts on our code (part of the code is posted below).
At this time, the de-duplication part has been solved, and we can process the next step according to the data we get (processing code is also posted below).
When the solution is completed, the data is put into the object and stored in the database.
Because there are still timers (timers included in spring), we need to add a timer. Well, let's not say much. Now we start posting processing code.

Because of code problems, some methods are mentioned in util tool classes

public class MeteorologicalUtil {
public static  String []  printInfo(Range range) {
    //Get the number of paragraphs
    int paraNum = range.numParagraphs();
    String [] paragraphArr =new String[paraNum];
    for (int i=0; i<paraNum; i++) {
        paragraphArr[i] =range.getParagraph(i).text();
    }
    return  paragraphArr;
}
public static  String []  printInfox(XWPFDocument xwpfDocument) {
    //Get the number of paragraphs
    int paraNum =xwpfDocument.getParagraphs().size();
    String [] paragraphArr =new String[paraNum];
    List<XWPFParagraph> paragraphs = xwpfDocument.getParagraphs();
    for(int i =0 ;i<paraNum;i++){

        paragraphArr[i] =paragraphs.get(i).getParagraphText();
    }
    return  paragraphArr;
}

/**
 * Obtaining meteorological time
 * @param arr
 * @return
 */
public  static Date getTime(String [] arr){

    SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy year MM month dd day HH:mm");
    simpleDateFormat.setTimeZone(TimeZone.getTimeZone("GMT+0"));  //Eliminate time difference
    for (String ph :arr){
        String timeStr = patternTime(ph);
        if(timeStr!=null){
            try {
                return  simpleDateFormat.parse(timeStr);
            } catch (ParseException e) {
                e.printStackTrace();
                return null;
            }
        }
    }
    return null;
}


/**
 * Remove empty passages
 * @param strArray
 * @return
 */
public static String[] removeArrayEmptyTextBackNewArray(String[] strArray) {
   //Converting arrays to list objects
    List<String> strList= Arrays.asList(strArray);
    List<String> strListNew=new ArrayList<>();

    for (int i = 0; i <strList.size(); i++) {
        strList.set(i,strList.get(i).replaceAll("\b","").replaceAll("\r",""));
        //strList.set(i,strList.get(i).substring(0,strList.get(i).length()-1));
        if (strList.get(i)!=null&&!strList.get(i).equals("")){
            strListNew.add(strList.get(i));            }
    }
   //Converting a collection into an array
    String[] strNewArray = strListNew.toArray(new String[strListNew.size()]);
    return   strNewArray;

}

public static String patternTime(String content){

    //In the format of **** year ** month ** day ** year * month * day, we can change different filtering rules, and filter regular expressions in different formats to match the time in the text.
    Pattern pattern = Pattern.compile("((([0-9]{4})year([0-9]{2}|[1-9]))month([0-9]{2}|[1-9]))day([0-9]{2}|[1-9]):([0-9]{2}|[1-9])"); //Attempt to extract data of this type
    Matcher matcher = pattern.matcher(content);
    if (matcher.find()) {  //Determine whether the text finds a regular string and extract it
        String str_ymd =  matcher.group(0);
        return str_ymd;
    }
    return null;
}
}

Well, the code has been put on, and then we need to add a timer. (This timer is easy to understand because it comes with spring.)

@Configuration //The declaration is a configuration class
@EnableScheduling //Open Timing Tasks
public class MeteorologicalTask {

@Autowired
MeteorologicalService meteorologicalService;

//    @ Scheduled (cron = 0 01 * *?)// Execution cycle (executed at 1:00 a.m. every day) (do not know how to handle the timing, you can see Cron online)
@Scheduled(cron = "*/5 * * * * ?")//Execute every 5 seconds
public void work() {
//File path
    File file = new File("C:\\Users\\qps12\\Desktop\\Meteorological Bureau 2");
    //Get all files or folders under the folder
    File[] fileList = file.listFiles();

    for (int i = 0; i < fileList.length; i++) {
        if (fileList[i].isFile()) {      //Check only files. And traverse
            File currentFile  = fileList[i];
            String path = currentFile.getAbsolutePath();  //Absolute path of current file
            try {
                meteorologicalService.testReadByDoc(path);  //Perform parsing, encapsulating, and saving data
            } catch (Exception e) {
                e.printStackTrace();
                return;
            }
            currentFile.delete();   //Delete files
        }
    }
}

}
Note: This method is only applicable to document processing with almost identical format type. If you want to deal with some documents without rules, you'd better use the method of fuzzy matching (not studied yet, so you won't haha). In addition, I paste the documents that I handle myself:

Posted by The_Walrus on Mon, 05 Aug 2019 19:30:27 -0700

Programmer Group

poi reads word.doc/docx parsing and stores it regularly

Hot Keywords