Full Text Retrieval Engine and Tools
lucence
lucence is a full-text search engine.
The usage steps of lucence code level are as follows:
- Create a document (org.apache.lucene.document.Document) and add fields to it through the add method of Document (lucence.document.Field)
- Create lucence.index.IndexWriter and add many built Documents through addDocument or addDocuments methods
- close IndexWriter by closing method
- To create the index searcher lucence.search.IndexSearcher, you need to pass in the index warehouse reader (lucenc.index.DirectoryReader) parameter
- Query is executed on the searcher by search method with the parameter of ucence.search.Query object. A query is constructed by the instance parse(String) method of the query parser lucene.queryparser.classic.QueryParser. QueryParser instance can be obtained by the new standard parser lucene. queryparser. flexible. standard. QueryParser.
Chinese Text Index Construction and Query Example
(The following is about version 6.4.2 of lucene)
Standard Analyr can not be used to construct an index of Chinese text. It will divide Chinese text into single characters instead of words. Therefore, it needs to use Chinese word segmentation device and custom Analezer. maven package org.apache.lucene:lucene-analyzers-common (developed by Lucene) comes with an analyzer lucene.analysis.cjk.CJK analyzer that claims to be able to segmente CJK (Chinese, Japanese, Korean) text, but its effect is rather poor and almost useless, such as the text "What is the flower language of white roses? White rose language: naive, pure, respectful, humble, I That's enough to match you. The participle result is "White Rose, Rose, Rose, Rose, Fresh Flower, Flower, Flower Language, Language is, What, White, Rose, Rose, Flower, Flower, Flower, Flower, Flower, Flower, Flower, Flower Language, Innocence, Purity, Respect, Humility, My Foot, Sufficiency, To Match with You, You". In addition, smartcn (version 6.4.2 maven package for org.apache.lucene:lucene-analyzers-smartcn:6.4.2) has artifactId different from version 3.6.2, which is lucene-smartcn. It is an analyzer package for processing Chinese texts published with lucene. The participle features of the participle are relatively fragmented words, such as white, rose, flower, flower, yeah, shi. Do you, white, rose, flower, language, innocence, purity, respect, humility, I, enough, match, match, match, match with, you.
- Create an index with the smartcn analysis package that comes with it
The following code is written in scala language. It's not complicated to use. Even if you don't know scala grammar, you don't need to be afraid. You can understand the logical process of building indexes and queries.
The code satisfies the scenario requirement: Suppose the file $lucene.wikiIdTitle has many rows, each row has two columns, the first column is a number, the second column is a string, they are separated by TAB characters, the string represents the text content of a document, the number represents the unique number of the document, we need to index the document, query, give keywords, output the unique number of the document.
import org.slf4j.LoggerFactory import com.typesafe.config.{ConfigFactory, _} import org.apache.lucene import lucene.index.{DirectoryReader, IndexWriter, IndexWriterConfig, _} import lucene.document.{Field, FieldType, _} import lucene.analysis.{CharArraySet, _} import lucene.search.{IndexSearcher, _} import lucene.store.{RAMDirectory, _} import lucene.analysis.cn.smart.{SmartChineseAnalyzer, _} import org.apache.lucene.index.IndexWriterConfig.OpenMode import org.apache.lucene.queryparser.classic.QueryParser import scala.collection.convert.ImplicitConversions._ import scala.io.Source import scala.util.{Failure, Success, Try} val log = LoggerFactory.getLogger(this.getClass) val conf = ConfigFactory.load("app") // Reading file configurations is equivalent to reading app.properties configuration files for Java programmers log.info("creating lucene index for wikipedia titles...") // A directory for saving lucene index files, such as / path/to/index/ val indexDir = conf.getString("lucene.indexDir") log.debug(s"lucene index writing direcotry: $indexDir") import org.apache.lucene // Create Directory objects, or create FSDirectories written to hard disks or RAMDirectories that operate directly in memory. The former needs to provide a directory parameter to be indexed. // val idxDir = FSDirectory.open(java.nio.file.Paths.get(indexDir)) //Small example, direct operation in memory. val idxDir = new RAMDirectory() val stopWordsFiles = conf.getString("lucene.stopWordsFiles") log.debug(s"lucene.stopWordsFiles: $stopWordsFiles") // "Stop Words" table to filter out useless stop words in word segmentation results val stopWords = stopWordsFiles.split(",").flatMap(f => Try { if (f.trim.nonEmpty) Source.fromFile(f).getLines() else Iterator.empty } match { case Success(x) => x case Failure(e) => log.warn(s"error in loading stop words file: $f", e) Iterator.empty }).toList log.debug(s"stop words size: ${stopWords.length}") val smartcn = new SmartChineseAnalyzer(new CharArraySet(stopWords, true)) val iwConf = new IndexWriterConfig(smartcn) // In iwConf.setOpenMode(OpenMode.CREATE) // RAM, it doesn't matter how many write modes are "recreated" or "appended". val indexWriter = new IndexWriter(idxDir, iwConf) // $lucene.wikiIdTitle represents the path to the document collection file, which in the example reads as follows /* 1832186 Courage and ambition 5376724 Yongsan-dong 5420049 Underground lover 5431949 Lively 5455483 Changlong 5463308 Albert Bridge 5470979 Okada 5511092 Shaw Dickey station 5544906 Mononga Sheila (Disambiguation) 5553846 Penglai cave 5553849 Nam Shan Tung 5566592 Boiling water 5566629 Antimony oxide */ val pgIdTtls=Source.fromFile(conf.getString("lucene.wikiIdTitle")) .getLines() .filter(ln => ln.nonEmpty) .map(ln => { val idTtl = ln.split("\t") (idTtl(0), idTtl(1)) }) //The pgIdTtls variable is a binary. For Java programmers, you don't need to know what a binary is. You can imagine two columns. The first column is the unique id, and the second column is the text. pgIdTtls.foreach(e => { import lucene.document._ //Create a Document val ldoc = new lucene.document.Document() //Since the first column is the unique id of the document, it does not need to be indexed, but it needs to be saved so that the id field can be "seen" after retrieving the result. If not saved (.setStored(false)), even if added to the document object, the field can not be seen in the result. val pageIdFieldType = new FieldType() pageIdFieldType.setStored(true) // Save, because we want to get this value in the result. pageIdFieldType.setIndexOptions(lucene.index.IndexOptions.NONE) //Don't index on the id field // In fact, for performance, you can move pageIdFieldType out of the loop, and the example is here to make the appearance of FieldType seem more reasonable. // Create a field for the document, called pageId (field name is optional, but you have to remember the name of the field when extracting the search results), and the value is the first column (that is, the unique number of the document). val piFld = new lucene.document.Field("pageId", e._1, pageIdFieldType) ldoc.add(piFld) //Add the text content of the document to the title field to save (Field.Store.YES) ldoc.add(new lucene.document.TextField("title", e._2, Field.Store.YES)) indexWriter.addDocument(ldoc) // Write to index }) //indexWriter.close()// If you write to your hard disk (FSDirectory.open mode), remember to close it val searcher = new IndexSearcher(DirectoryReader.open(indexWriter)) // If you read an index in a directory, you should use // val searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(java.nio.file.Paths.get("/path/to/index/")))) import lucene.queryparser.classic._ // Search in title Field val queryParser = new QueryParser("title", smartcn) // Search for "South Cave" val query = queryParser.parse("Nam Shan Tung") // val hitPageIdTitles = searcher .search(query, 30) // Up to 30 results are returned, and there are more I don't want (limit 30 like SQL) .scoreDocs.map(searcher doc _.doc) .map(d => (d.get("pageId"), d.get("title"))) //The last scala statement may be a little laborious for a java programmer to understand, translated into java, like this /* ScoreDoc[] scoreDocs=searcher.search(query,30).scoreDocs; for (ScoreDoc scoreDoc : scoreDocs) { //Get the retrieved Document org.apache.lucene.document.Document d= searcher.doc(scoreDoc.doc); // Obtain document number and text content from Document System.out.println(d.get("pageId")+", "+d.get("title")) } */ //Two results were retrieved (5553849, South Cave), (5376724, Longshan Cave)
- Write Analyr subclass to implement word segmentation using custom word segmentation tools
//TODO
solr
solr is an application server built on lucence, which belongs to a full-text retrieval tool to provide services with http.