Recently, we have encountered the need for keyword query on large text. The effect to be achieved is that when a user enters a word or a word, all the contents containing the word or a word should be searched. It's the same as MySQL's LIKE query.
In this scenario, the first thing I think about is to directly use MySQL storage, and then query, which fully meets the business requirements. But after trying, when the amount of data is large, the speed of query is too slow to be accepted.
So we can only consider other solutions, because it is related to search, so we think of the Solr we used before. After a series of setbacks, the final plan is as follows.
- When designing the schema, you need to customize the field type, use the solr.NGramTokenizerFactory word breaker, and set the fragment parameter to 1.
<fieldType name="text_ng1" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="1"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="1"/> </analyzer> </fieldType>
- Set the field to be queried to a custom type.
<field name="question_content_list" type="text_ng1" indexed="true" stored="true" required="false" multiValued="true" />
- In the query conditions, double quotation marks should be added to the keywords to indicate that the query words are not segmented.
/select?q=question_content_list%3A"space"&start=0&rows=1&wt=json&indent=true
- Query result response test.
{ "responseHeader": { "status": 0, "QTime": 1, "params": { "q": "question_content_list:\"space\"", "indent": "true", "start": "0", "rows": "1", "wt": "json" } }, "response": { "numFound": 62, "start": 0, "docs": [ { "question_content_list": [ "18088", "qq space", "http://photo.qq.com" ], "question_id": "9DSOPHKVYF4G", "_version_": 1661304065051590660 } ] } }