How does Elasticsearch handle data that has an association?

Keywords: Hadoop ElasticSearch Database Fragment

Three main paradigms of relational databases

What is a paradigm? A paradigm is the rule of data modeling.

First paradigm: Ensure that each column maintains atomicity.
All fields in a database table are indivisible atomic values.
Second paradigm: Ensure that each column in the table is related to the primary key.
A database table can only hold one kind of data, not many kinds of data in the same database table. For example, information about orders will design three tables: order table, order item table, commodity table.
Third, make sure that each column is directly related to the primary key, not indirectly.
For example, an order form only needs to save the userId, not the entire user information.

The three main paradigms of relational databases simplify the write operation, but the read operation performance is not high (join operation is very performance-intensive), and the scalability is poor. However, the anti-normalized design keeps redundant data in the document, does not need to process join operation, and the data read performance is good, but the anti-normalized design is not suitable for scenarios where data is frequently modified.

Elasticsearch handles data with associated relationships

Elasticsearch uses a non-relational data storage engine, which is an anti-normalized design. How does Elasticsearch deal with data that is related? There are three ways, three data types.

Object Type
Nested Type
Join Type (Join)

Object Type

Use Object data types to store movie and actor information in a doc.

(1) Define Mapping

PUT /my_movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

(2) Add data

PUT /my_movies/_doc/1
{
  "title": "Speed",
  "actors": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

(3) Search

GET /my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "actors.first_name": "Keanu"
          }
        },
        {
          "match": {
            "actors.last_name": "Hopper"
          }
        }
      ]
    }
  }
}

Result:

"hits" : [
  {
    "_index" : "my_movies",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 0.723315,
    "_source" : {
      "title" : "Speed",
      "actors" : [
        {
          "first_name" : "Keanu",
          "last_name" : "Reeves"
        },
        {
          "first_name" : "Dennis",
          "last_name" : "Hopper"
        }
      ]
    }
  }
]

We want the search result to be empty, but Elasticsearch returned a result. Why? This is because the object array is processed into a flat key-value pair structure:

"title":"Speed"
"actors.first_name":["Keanu","Dennis"]
"actors.last_name":["Reeves","Hopper"]

So we can't return what we want when searching. That is, the object type is not appropriate for dealing with associations.

Nested Type

From the example above, we know that object arrays are not independent when they are indexed upside down, which ultimately results in inaccurate results. Nested data types indexed object arrays indexed by each object indexed by nested query to get the desired results.

(1) Define Maping

PUT /my_movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "type": "nested",
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

(2) Add data

PUT /my_movies/_doc/1
{
  "title": "Speed",
  "actors": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

(3) Search

GET /my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "actors",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "actors.first_name": "Keanu"
                    }
                  },
                  {
                    "match": {
                      "actors.last_name": "Hopper"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Result:

"hits" : {
	"total" : {
	  "value" : 0,
	  "relation" : "eq"
	},
	"max_score" : null,
	"hits" : [ ]
}

Join Type (Join)

Nested types have a limitation in dealing with association relationships, that is, each update requires a re-index of the entire object, including the root and nested objects.

Elasticsearch provides an implementation of Join in a relational database similar to Join data types. Join data types define parent-child relationships between documents, separating two objects.

Parent and child documents are two separate documents.
Updating parent documents does not require reindexing child documents.
Subdocuments are added, updated or deleted without affecting parent and other subdocuments.

Let's take a look at an example of blogs and comments.

(1) Define Mapping

PUT /my_blogs
{
  "settings": {
    "number_of_shards": 2
  }, 
  "mappings": {
    "properties": {
      "title": {
        "type": "keyword"
      },
      "content": {
        "type": "text"
      },
      "comment": {
        "type": "text"
      },
      "username": {
        "type": "keyword"
      },
      "blog_comments_relation" : {
        "type": "join",
        "relations": {
          "blog": "comment"
        }
      }
    }
  }
}

Notice here that the number of main fragments is defined as 2, and there is a parent-child relationship between blog and comment.

(2) Add data

a. Add blog data

PUT /my_blogs/_doc/blog1
{
  "title": "Learning Elasticsearch",
  "content": "learning ELK @ tyshawn",
  "blog_comments_relation": {
    "name": "blog"
  }
}

PUT /my_blogs/_doc/blog2
{
  "title": "Learning Hadoop",
  "content": "learning Hadoop @ tyshawn",
  "blog_comments_relation": {
    "name": "blog"
  }
}

blog1 and blog2 are _ids, note that _ids are not necessarily numbers.

b. Add commentary data

PUT /my_blogs/_doc/comment1?routing=blog1
{
  "comment": "I am learning ELK",
  "username": "Jack",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}

PUT /my_blogs/_doc/comment2?routing=blog2
{
  "comment": "I like Hadoop!!!!!",
  "username": "Jack",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog2"
  }
}

Routing is specified when adding comments to ensure that both parent and child documents are indexed into the same fragment. The purpose is to ensure the performance of join queries.

(3) Query

Join type specific queries:

parent_id
Returns all related subdocuments by querying the parent document id.
has_child
Query the sub-documents and return the parent document with the related sub-documents. The parent and child documents are on the same slice, so Join is efficient.
has_parent
Query the parent document and return all related subdocuments.

a. parent_id

GET /my_blogs/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "blog2"
    }
  }
}

Result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "comment2",
    "_score" : 0.6931472,
    "_routing" : "blog2",
    "_source" : {
      "comment" : "I like Hadoop!!!!!",
      "username" : "Jack",
      "blog_comments_relation" : {
        "name" : "comment",
        "parent" : "blog2"
      }
    }
  }
]

b. has_child

GET /my_blogs/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "username": "Jack"
        }
      }
    }
  }
}

Result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "blog1",
    "_score" : 1.0,
    "_source" : {
      "title" : "Learning Elasticsearch",
      "content" : "learning ELK @ tyshawn",
      "blog_comments_relation" : {
        "name" : "blog"
      }
    }
  },
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "blog2",
    "_score" : 1.0,
    "_source" : {
      "title" : "Learning Hadoop",
      "content" : "learning Hadoop @ tyshawn",
      "blog_comments_relation" : {
        "name" : "blog"
      }
    }
  }
]

c. has_parent

GET /my_blogs/_search
{
  "query": {
    "has_parent": {
      "parent_type": "blog",
      "query": {
        "match": {
          "title": "Learning Hadoop"
        }
      }
    }
  }
}

Result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "comment2",
    "_score" : 1.0,
    "_routing" : "blog2",
    "_source" : {
      "comment" : "I like Hadoop!!!!!",
      "username" : "Jack",
      "blog_comments_relation" : {
        "name" : "comment",
        "parent" : "blog2"
      }
    }
  }
]

(4) Update subdocuments

Updating a child document does not affect the parent document.

POST /my_blogs/_update/comment2?routing=blog2
{
  "doc": {
    "comment": "Hello Hadoop??"
  }
}

Query by id and routing

GET /my_blogs/_doc/comment2?routing=blog2

Result:

{
  "_index" : "my_blogs",
  "_type" : "_doc",
  "_id" : "comment2",
  "_version" : 2,
  "_seq_no" : 4,
  "_primary_term" : 1,
  "_routing" : "blog2",
  "found" : true,
  "_source" : {
    "comment" : "Hello Hadoop??",
    "username" : "Jack",
    "blog_comments_relation" : {
      "name" : "comment",
      "parent" : "blog2"
    }
  }
}

Nested versus Join

Object data types are not suitable for dealing with data with associations, so what scenarios do Nested and Join types apply to? Let's look at the comparison.

Contrast	Nested	Join
Advantage	Document storage, read performance	Parent-child documents can be updated independently
shortcoming	When updating nested subdocuments, the entire document needs to be updated	Require additional memory maintenance relationship, read performance is relatively poor
Scenarios applicable	Subdocuments are updated occasionally, mainly by queries	Subdocuments are frequently updated

Other Processing

In practice, we can also use Nested and Join types to process data with related relationships. We can directly establish a one-to-one relationship between database tables and ES indexes, and then process the related relationships on the application side after querying the data through ES. Or we can merge the related data tables to create an ES index, which is the simplest way to do so.

Coconut Tyshawn

324 original articles were published. 572 were approved. 560,000 visits+

His message board follow

Posted by brmcdani on Tue, 14 Jan 2020 16:43:11 -0800

Programmer Group

How does Elasticsearch handle data that has an association?

Three main paradigms of relational databases

Elasticsearch handles data with associated relationships

Hot Keywords