ES[7.6.x] learning notes IK Chinese word breaker

Keywords: Java github ElasticSearch less

In the last section, we introduced the ES analyzer. I believe you have been impressed by the full-text search of ES. The analyzer consists of three parts: character filter, word breaker and word breaker. In the example in the previous section, we found that they are all examples of English, right? Because es is written by foreigners. If China wants to catch up in this respect, it still needs the friends in front of the screen~

In English, we can segment a sentence or an article according to the space, then filter the segmentation, and finally leave meaningful words. But how to divide Chinese? There is no space in a Chinese sentence, so there should be a strong Chinese vocabulary. When the word appears in your content, it will be extracted. Here, you don't need to make wheels again. With the efforts of predecessors, this Chinese word breaker has already been built. It is the IK Chinese word breaker to be introduced to you today.

Installation of IK Chinese word breaker

By default, there is no IK Chinese word breaker in ES. We want to install IK Chinese word breaker into es as a plug-in. The installation steps are also simple:

  1. Download the IK Chinese word breaker suitable for your ES version from GitHub at https://github.com/medcl/elasticsearch-analysis-ik/releases.
  2. Create the ik directory in our ES plug-in directory (${ES_HOME}/plugins),

    mkdir ik
  3. Unzip the IK word breaker we downloaded to the IK directory. Here, we install unzip command to unzip it.
  4. Restart all our ES services.

So far, our IK Chinese word breaker has been installed.

On IK Chinese word segmentation

In the previous section, we visited the analyzer interface of ES, specified the analyzer and text content, and we can see the result of word segmentation. Now that we have installed the IK Chinese word breaker, of course we need to see the effect. Before looking at the effect, let's say that the IK Chinese word breaker plug-in provides us with two analyzers.

  • IK ﹣ Max ﹣ word: will split the text in the most fine-grained way
  • ik_smart: can do the most coarse-grained split

Let's take a look at the analysis effect of IK ﹣ Max ﹣ word,

POST _analyze
{
  "analyzer": "ik_max_word",
  "text":     "National Anthem of the people's Republic of China"
}

We designated the word breaker as ik_max_word and the text content as the National Anthem of the people's Republic of China. Let's look at the result of participle:

{
    "tokens": [
        {
            "token": "The People's Republic of China",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "The Chinese people",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "The Chinese people",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "Chinese",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "People's Republic",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "the people",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "republic",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "Republic",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "country",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "national anthem",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
        }
    ]
}

We can see that the segmentation is very detailed, and we can search the text of the National Anthem of the people's Republic of China when we use these above to search. Let's take a look at another analyzer, IK Gu smart,

POST _analyze
{
  "analyzer": "ik_smart",
  "text":     "National Anthem of the people's Republic of China"
}

Our text is also the National Anthem of the people's Republic of China. Let's see the effect of participle,

{
    "tokens": [
        {
            "token": "The People's Republic of China",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "national anthem",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

In the same text, when ik_smart is used for word segmentation, only two words are divided, and the ik_max_word word participator is much less. This is the difference between the two word splitters, but both analyzers can segment Chinese words.

Specify IK participator when creating index

Now that we have installed the plug-in of IK Chinese word breaker, we can specify IK Chinese word breaker for fields of text type when creating index. Take a look at the following example,

PUT ik_index
{
    "mappings": {
        "properties": {
            "id": {
                "type": "long"
            },
            "title": {
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

We created the index ik_index and specified the word breaker ik_max_word for the field title. Let's execute it and create it successfully. Then we can see the mapping of this index through GET request.

GET ik_index/_mapping

The results returned are as follows:

{
    "ik_index": {
        "mappings": {
            "properties": {
                "id": {
                    "type": "long"
                },
                "title": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                }
            }
        }
    }
}

We can see that the analyzer for the title field is ik_max_word.

Assign a default IK word breaker to the index

In the previous section, we have introduced the method of specifying the default word breaker for the index. Here we can directly change the word breaker to IK word breaker, as follows:

PUT ik_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_max_word"
        }
      }
    }
  }
}

In this way, we don't need to create every field in the index. We can map the String type field to the text type through dynamic field mapping, and specify the word breaker as ik_max_word. Let's try to add a record to the ik'u index index.

POST ik_index/_doc/1
{
    "id": 1,
    "title": "Watermelon in Panggezhuang, Daxing",
    "desc": "The watermelons in Panggezhuang, Daxing are delicious. They are crispy, Sandy and sweet"
}

Execution succeeded. Let's run the search again, as follows:

POST ik_index/_search
{
  "query": { "match": { "title": "watermelon" } }
}

We search the title field to match the watermelon, and the execution results are as follows:

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "ik_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "id": 1,
                    "title": "Watermelon in Panggezhuang, Daxing",
                    "desc": "The watermelons in Panggezhuang, Daxing are delicious. They are crispy, Sandy and sweet"
                }
            }
        ]
    }
}

We can see that the record just inserted has been searched. It seems that our IK Chinese word breaker works, and the search results also meet our expectations. Let's see if we can find the result when we search the word Xi,

POST ik_index/_search
{
  "query": { "match": { "title": "west" } }
}

The results are as follows:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

There is no search result, indicating that watermelon appears as a word in word segmentation, and does not split into every word, which is also in line with our expectations.

OK ~ the IK Chinese word breaker in this section will introduce you here~~

Posted by sangoku on Thu, 07 May 2020 01:46:36 -0700