I. The Function of Word Segmenter
The function of word segmentation is to get a TokenStream stream, which stores some information related to word segmentation, and can get detailed information of word segmentation through attributes.
2. Custom Stop Segmenter
package com.wsy; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; import org.apache.lucene.util.Version; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.Set; public class MyStopAnalyzer extends Analyzer { private Set set; public MyStopAnalyzer(String[] stopWords) { // Check out English stop words in StopAnalyr System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET); set = StopFilter.makeStopSet(Version.LUCENE_35, stopWords, true); // Add the original stop word set.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET); } public MyStopAnalyzer() { set = StopAnalyzer.ENGLISH_STOP_WORDS_SET; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set); } public static void displayAllToken(String string, Analyzer analyzer) { try { TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(string)); // Put in attribute information to view information in the stream // Location incremental information, distance between lexical units PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class); // Location offset information for each vocabulary unit OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); // Word segmentation information for each lexical unit CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); // Type information of the segmenter used TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class); while (tokenStream.incrementToken()) { System.out.println(positionIncrementAttribute.getPositionIncrement() + ":" + charTermAttribute + "[" + offsetAttribute.startOffset() + "-" + offsetAttribute.endOffset() + "]-->" + typeAttribute.type()); } System.out.println("----------------------------"); } catch (IOException e) { e.printStackTrace(); } } public static void main(String[] args) { // Customized Deactivated Word Segmenter Analyzer analyzer1 = new MyStopAnalyzer(new String[]{"I", "you", "hate"}); // Default Deactivated Word Segmenter Analyzer analyzer2 = new StopAnalyzer(Version.LUCENE_35); String string = "how are you, thank you. I hate you."; MyStopAnalyzer.displayAllToken(string, analyzer1); MyStopAnalyzer.displayAllToken(string, analyzer2); } }
The following statement is very important for setting up the filter chain and Tokenizer for the word separator. If you need to add it, just continue to add it.
new StopFilter(Version.LUCENE_35, new LowerCaseFilter(Version.LUCENE_35, new LetterTokenizer(Version.LUCENE_35, reader)), set);
III. Chinese Word Segmenter
Speaking of Chinese word segmentation, there are many kinds, such as paoding, mmseg, IK and so on. But some of them are not updated.
Here's a demo of mmseg, which is based on Sogou's lexicon and downloads the compressed package of mmseg 4j-1.8.5. Open and view the contents, where data is stored in the word library, jar package into the project, mmseg-all has two jar packages, one with dic, one without dic, here we use without dic. Using the default participle thesaurus, we tested the phrase "I'm from Liaocheng, Shandong Province. My name is Wang Shaoyang." I found that "Liaocheng" was divided into "Liaocheng" and "City". I was not convinced that there was no such thing as "Liaocheng"? So let's add it ourselves. Open words-my.dic in the data folder, add "Liaocheng", and see the participle again. Liaocheng will be divided into "Liaocheng".
public static void main(String[] args) { // mmseg word segmenter // When there is no designated thesaurus, word-by-word segmentation Analyzer analyzer3 = new MMSegAnalyzer(); // After specifying the local thesaurus, the words are segmented according to the segmenter Analyzer analyzer4 = new MMSegAnalyzer(new File("E:\\Lucene\\mmseg4j-1.8.5\\data")); String string2 = "I come from Liaocheng, Shandong Province. My name is Wang Shaoyang."; MyStopAnalyzer.displayAllToken(string2, analyzer3); MyStopAnalyzer.displayAllToken(string2, analyzer4); }