Implement like query with elastic search

Keywords: ElasticSearch MySQL

problem

The elastic search query needs to achieve the like query effect similar to mysql. For example, the record with the value of hello China 233 can be queried either through China or through llo.

However, the query of elastic search is based on word segmentation. The default word segmentation of hello China 233 is hello, China, country and 233. This record can be matched when using a hello query, but not when using a llo query.

Solve

Because the granularity of the result of recording content segmentation is not fine enough, the query of segmentation cannot match the record, so the solution is to segment the recording content with each character. That is to say, the word "hello China 233" is divided into h, e, l, o, China, Guo, 2, 3.

By default, there is no word breaker with the above effect in elasticsearch, which can be achieved through a custom word breaker: through character filter, add a space between each character of the string, and then use the space word breaker to split the string into characters.

Effect

Default participle

PUT /like_search
{
  "mappings": {
    "like_search_type": {
      "properties": {
        "name": {
          "type": "text"
        }
      }
    }
  }
}

PUT /like_search/like_search_type/1
{
  "name": "hello China 233"
}

Word segmentation effect

GET /like_search/_analyze
{
  "text": [
    "hello China 233"
    ]
}
{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "country",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "233",
      "start_offset": 7,
      "end_offset": 10,
      "type": "<NUM>",
      "position": 3
    }
  ]
}

By default, elasticsearch uses the standard word breaker. As follows, you can't find the record of hello China 233 through llo.

GET /like_search/_search
{
  "query": {
    "match_phrase": {
      "name": "llo"
    }
  }
}

Custom participle

PUT /like_search
{
  "settings": {
    "analysis": {
      "analyzer": {
        "char_analyzer": {
          "char_filter": [
            "split_by_whitespace_filter"
          ],
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        "split_by_whitespace_filter": {
          "type": "pattern_replace",
          "pattern": "(.+?)",
          "replacement": "$1 "
        }
      }
    }
  },
  "mappings": {
    "like_search_type": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "char_analyzer"
        }
      }
    }
  }
}

PUT /like_search/like_search_type/1
{
  "name": "hello China 233"
}

Word segmentation effect

GET /like_search/_analyze
{
  "analyzer": "char_analyzer", 
  "text": [
    "hello China 233"
    ]
}
{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "e",
      "start_offset": 1,
      "end_offset": 1,
      "type": "word",
      "position": 1
    },
    {
      "token": "l",
      "start_offset": 2,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "l",
      "start_offset": 3,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "o",
      "start_offset": 4,
      "end_offset": 4,
      "type": "word",
      "position": 4
    },
    {
      "token": "in",
      "start_offset": 5,
      "end_offset": 5,
      "type": "word",
      "position": 5
    },
    {
      "token": "country",
      "start_offset": 6,
      "end_offset": 6,
      "type": "word",
      "position": 6
    },
    {
      "token": "2",
      "start_offset": 7,
      "end_offset": 7,
      "type": "word",
      "position": 7
    },
    {
      "token": "3",
      "start_offset": 8,
      "end_offset": 8,
      "type": "word",
      "position": 8
    },
    {
      "token": "3",
      "start_offset": 9,
      "end_offset": 9,
      "type": "word",
      "position": 9
    }
  ]
}

Using the custom word breaker, you can find the record of hello China 233 through llo.

GET /like_search/_search
{
  "query": {
    "match_phrase": {
      "name": "llo"
    }
  }
}

Posted by VDarkAzN on Wed, 13 Nov 2019 10:03:24 -0800