Introduction to Chinese Word Segmentation in Lucene Notes 17-Lucene

Keywords: Big Data Apache Java Attribute

I. The Function of Word Segmenter

The function of word segmentation is to get a TokenStream stream, which stores some information related to word segmentation, and can get detailed information of word segmentation through attributes.

2. Custom Stop Segmenter

package com.wsy;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.util.Version;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Set;

public class MyStopAnalyzer extends Analyzer {
    private Set set;

    public MyStopAnalyzer(String[] stopWords) {
        // Check out English stop words in StopAnalyr
        System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        set = StopFilter.makeStopSet(Version.LUCENE_35, stopWords, true);
        // Add the original stop word
        set.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
    }

    public MyStopAnalyzer() {
        set = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set);
    }

    public static void displayAllToken(String string, Analyzer analyzer) {
        try {
            TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(string));
            // Put in attribute information to view information in the stream
            // Location incremental information, distance between lexical units
            PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);
            // Location offset information for each vocabulary unit
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            // Word segmentation information for each lexical unit
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
            // Type information of the segmenter used
            TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class);
            while (tokenStream.incrementToken()) {
                System.out.println(positionIncrementAttribute.getPositionIncrement() + ":" + charTermAttribute + "[" + offsetAttribute.startOffset() + "-" + offsetAttribute.endOffset() + "]-->" + typeAttribute.type());
            }
            System.out.println("----------------------------");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        // Customized Deactivated Word Segmenter
        Analyzer analyzer1 = new MyStopAnalyzer(new String[]{"I", "you", "hate"});
        // Default Deactivated Word Segmenter
        Analyzer analyzer2 = new StopAnalyzer(Version.LUCENE_35);
        String string = "how are you, thank you. I hate you.";
        MyStopAnalyzer.displayAllToken(string, analyzer1);
        MyStopAnalyzer.displayAllToken(string, analyzer2);
    }
}

The following statement is very important for setting up the filter chain and Tokenizer for the word separator. If you need to add it, just continue to add it.

new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set);

III. Chinese Word Segmenter

Speaking of Chinese word segmentation, there are many kinds, such as paoding, mmseg, IK and so on. But some of them are not updated.

Here's a demo of mmseg, which is based on Sogou's lexicon and downloads the compressed package of mmseg 4j-1.8.5. Open and view the contents, where data is stored in the word library, jar package into the project, mmseg-all has two jar packages, one with dic, one without dic, here we use without dic. Using the default participle thesaurus, we tested the phrase "I'm from Liaocheng, Shandong Province. My name is Wang Shaoyang." I found that "Liaocheng" was divided into "Liaocheng" and "City". I was not convinced that there was no such thing as "Liaocheng"? So let's add it ourselves. Open words-my.dic in the data folder, add "Liaocheng", and see the participle again. Liaocheng will be divided into "Liaocheng".

public static void main(String[] args) {
    // mmseg word segmenter
    // When there is no designated thesaurus, word-by-word segmentation
    Analyzer analyzer3 = new MMSegAnalyzer();
    // After specifying the local thesaurus, the words are segmented according to the segmenter
    Analyzer analyzer4 = new MMSegAnalyzer(new File("E:\\Lucene\\mmseg4j-1.8.5\\data"));
    String string2 = "I come from Liaocheng, Shandong Province. My name is Wang Shaoyang.";
    MyStopAnalyzer.displayAllToken(string2, analyzer3);
    MyStopAnalyzer.displayAllToken(string2, analyzer4);
}

 

Posted by programmingjeff on Thu, 24 Jan 2019 06:24:14 -0800