Search (10) - elastic4s multi match: multi field full text search

Keywords: Scala

In full-text search, we often match the same query criteria in multiple fields or different criteria in different fields. For example:

GET /books/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "Peace and war" }},
        { "match": { "author": "Tostei"   }}
      ] 
    }
  } 
}

We can use boolQuery to combine query statements. Full text search produces a matching score. boolQuery adopts a scoring strategy: the more qualified statements, the higher the score. If the query results are sorted by score, the most likely result is the top one. boolQuery can include boolQuery, as follows:

GET /books/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "Peace and war" }},
        { "match": { "author": "Tostei"   }},
        "bool" : {
          "should" : [
            {"match" : { "translator" : "Chen"}},
            {"match" : { "translator" : "king"}}
            ]
        }
      ] 
    }
  } 
}

Adding conditions means that if the translator's surname is Chen or Wang, the score will be high. However, embedding boolQuery into another boolQuery will affect the score of external boolQuery. Because embedded boolQuery only accounts for a third of the total score. Of course, you can use boost to balance the specific gravity, as follows:

GET /books/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": 
          { 
            "title":  {
              "query": "Peace and war",
              "boost": 2
            }
          }
        },
        { "match": { "author": "Tostei"   }},
        "bool" : {
          "should" : [
            {"match" : { "translator" : "Chen"}},
            {"match" : { "translator" : "king"}}
            ]
        }
      ] 
    }
  } 
}

From the above example, we can see that boolQuery is a typical multi field and multi condition matching query, and users must clearly distinguish which conditions match in those fields. But people are used to expressing conditions of multiple fields in a sentence. Or they don't want to distinguish any fields at all, expecting to get the desired result in one sentence. At this time, boolQuery is not suitable for use.

First, we can try to match the same comprehensive statement in multiple fields, such as peace and war tostay. At this time, we may face three options:

1. Best fields: the same condition matches in different fields to produce multiple scores, and the overall query only takes the best score

2. Most fields: this method is a bit complicated. When building an index, a field should be divided into multiple fields according to word segmentation. When querying, the score of the field that meets the most conditions should be taken

3. Cross fields: synthesize all the fields involved into a large field, and then match the combined field with the condition. This method should be most suitable for our requirements

Let's first analyze the specific scenario: if a person wants to find a book on the website, he / she should provide query conditions from the book name, author and publishing house (although it is input in an input box), that is to say, a query condition provided by the user may contain information about the book name, author and publishing house. Then the search request of the first version is as follows:

GET /books/_search
{
   "query": {
     "multi_match": {
       "query": "Peace and war",
       "type": "cross_fields", 
       "fields": ["title","author","publisher"]
     }
   }
}

In principle, the proportion of titles should be higher than that of authors and publishers, so the proportion of titles should be increased:

GET /books/_search
{
   "query": {
     "multi_match": {
       "query": "Peace and war",
       "type": "cross_fields", 
       "fields": ["title^2","author","publisher"]
     }
   }
}

In order to filter more accurately, terms should be associated with and

GET /books/_search
{
   "query": {
     "multi_match": {
       "query": "Peace and war",
       "type": "cross_fields", 
       "fields": ["title","author","publisher"],
       "operator": "and"
     }
   }
}

The gathering of results was greatly shortened. The user can cancel some conditions to increase the result range. We can also do some work on the content of the book:

GET /books/_search
{
   "query": {
     "multi_match": {
       "query": "Peace and war",
       "type": "cross_fields", 
       "fields": ["title^3","author^2","publisher^2","toc","intro"],
       "operator": "and"
     }
   }
}

toc directory and intro are added. But their proportion is the lowest.

An example of elastic4 is as follows:

 val qMultiMatch = search("books").query(
    multiMatchQuery("Peace and war")
      .matchType("cross_fields")
      .operator("and")
      .fields(
        "title^3",
        "author^2",
        "publisher^2",
        "toc",
        "intro"
      )
  ).sourceInclude("ISBN","title","publisher","price","author")

Posted by xgab on Sat, 09 May 2020 00:45:12 -0700