Three main paradigms of relational databases
What is a paradigm? A paradigm is the rule of data modeling.
- First paradigm: Ensure that each column maintains atomicity.
All fields in a database table are indivisible atomic values. - Second paradigm: Ensure that each column in the table is related to the primary key.
A database table can only hold one kind of data, not many kinds of data in the same database table. For example, information about orders will design three tables: order table, order item table, commodity table. - Third, make sure that each column is directly related to the primary key, not indirectly.
For example, an order form only needs to save the userId, not the entire user information.
The three main paradigms of relational databases simplify the write operation, but the read operation performance is not high (join operation is very performance-intensive), and the scalability is poor. However, the anti-normalized design keeps redundant data in the document, does not need to process join operation, and the data read performance is good, but the anti-normalized design is not suitable for scenarios where data is frequently modified.
Elasticsearch handles data with associated relationships
Elasticsearch uses a non-relational data storage engine, which is an anti-normalized design. How does Elasticsearch deal with data that is related? There are three ways, three data types.
- Object Type
- Nested Type
- Join Type (Join)
Object Type
Use Object data types to store movie and actor information in a doc.
(1) Define Mapping
PUT /my_movies { "mappings": { "properties": { "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "actors": { "properties": { "first_name": { "type": "keyword" }, "last_name": { "type": "keyword" } } } } } }
(2) Add data
PUT /my_movies/_doc/1 { "title": "Speed", "actors": [ { "first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }
(3) Search
GET /my_movies/_search { "query": { "bool": { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } }
Result:
"hits" : [ { "_index" : "my_movies", "_type" : "_doc", "_id" : "1", "_score" : 0.723315, "_source" : { "title" : "Speed", "actors" : [ { "first_name" : "Keanu", "last_name" : "Reeves" }, { "first_name" : "Dennis", "last_name" : "Hopper" } ] } } ]
We want the search result to be empty, but Elasticsearch returned a result. Why? This is because the object array is processed into a flat key-value pair structure:
"title":"Speed" "actors.first_name":["Keanu","Dennis"] "actors.last_name":["Reeves","Hopper"]
So we can't return what we want when searching. That is, the object type is not appropriate for dealing with associations.
Nested Type
From the example above, we know that object arrays are not independent when they are indexed upside down, which ultimately results in inaccurate results. Nested data types indexed object arrays indexed by each object indexed by nested query to get the desired results.
(1) Define Maping
PUT /my_movies { "mappings": { "properties": { "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "actors": { "type": "nested", "properties": { "first_name": { "type": "keyword" }, "last_name": { "type": "keyword" } } } } } }
(2) Add data
PUT /my_movies/_doc/1 { "title": "Speed", "actors": [ { "first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }
(3) Search
GET /my_movies/_search { "query": { "bool": { "must": [ { "nested": { "path": "actors", "query": { "bool": { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } } } ] } } }
Result:
"hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }
Join Type (Join)
Nested types have a limitation in dealing with association relationships, that is, each update requires a re-index of the entire object, including the root and nested objects.
Elasticsearch provides an implementation of Join in a relational database similar to Join data types. Join data types define parent-child relationships between documents, separating two objects.
- Parent and child documents are two separate documents.
- Updating parent documents does not require reindexing child documents.
- Subdocuments are added, updated or deleted without affecting parent and other subdocuments.
Let's take a look at an example of blogs and comments.
(1) Define Mapping
PUT /my_blogs { "settings": { "number_of_shards": 2 }, "mappings": { "properties": { "title": { "type": "keyword" }, "content": { "type": "text" }, "comment": { "type": "text" }, "username": { "type": "keyword" }, "blog_comments_relation" : { "type": "join", "relations": { "blog": "comment" } } } } }
Notice here that the number of main fragments is defined as 2, and there is a parent-child relationship between blog and comment.
(2) Add data
a. Add blog data
PUT /my_blogs/_doc/blog1 { "title": "Learning Elasticsearch", "content": "learning ELK @ tyshawn", "blog_comments_relation": { "name": "blog" } } PUT /my_blogs/_doc/blog2 { "title": "Learning Hadoop", "content": "learning Hadoop @ tyshawn", "blog_comments_relation": { "name": "blog" } }
blog1 and blog2 are _ids, note that _ids are not necessarily numbers.
b. Add commentary data
PUT /my_blogs/_doc/comment1?routing=blog1 { "comment": "I am learning ELK", "username": "Jack", "blog_comments_relation": { "name": "comment", "parent": "blog1" } } PUT /my_blogs/_doc/comment2?routing=blog2 { "comment": "I like Hadoop!!!!!", "username": "Jack", "blog_comments_relation": { "name": "comment", "parent": "blog2" } }
Routing is specified when adding comments to ensure that both parent and child documents are indexed into the same fragment. The purpose is to ensure the performance of join queries.
(3) Query
Join type specific queries:
- parent_id
Returns all related subdocuments by querying the parent document id. - has_child
Query the sub-documents and return the parent document with the related sub-documents. The parent and child documents are on the same slice, so Join is efficient. - has_parent
Query the parent document and return all related subdocuments.
a. parent_id
GET /my_blogs/_search { "query": { "parent_id": { "type": "comment", "id": "blog2" } } }
Result:
"hits" : [ { "_index" : "my_blogs", "_type" : "_doc", "_id" : "comment2", "_score" : 0.6931472, "_routing" : "blog2", "_source" : { "comment" : "I like Hadoop!!!!!", "username" : "Jack", "blog_comments_relation" : { "name" : "comment", "parent" : "blog2" } } } ]
b. has_child
GET /my_blogs/_search { "query": { "has_child": { "type": "comment", "query": { "match": { "username": "Jack" } } } } }
Result:
"hits" : [ { "_index" : "my_blogs", "_type" : "_doc", "_id" : "blog1", "_score" : 1.0, "_source" : { "title" : "Learning Elasticsearch", "content" : "learning ELK @ tyshawn", "blog_comments_relation" : { "name" : "blog" } } }, { "_index" : "my_blogs", "_type" : "_doc", "_id" : "blog2", "_score" : 1.0, "_source" : { "title" : "Learning Hadoop", "content" : "learning Hadoop @ tyshawn", "blog_comments_relation" : { "name" : "blog" } } } ]
c. has_parent
GET /my_blogs/_search { "query": { "has_parent": { "parent_type": "blog", "query": { "match": { "title": "Learning Hadoop" } } } } }
Result:
"hits" : [ { "_index" : "my_blogs", "_type" : "_doc", "_id" : "comment2", "_score" : 1.0, "_routing" : "blog2", "_source" : { "comment" : "I like Hadoop!!!!!", "username" : "Jack", "blog_comments_relation" : { "name" : "comment", "parent" : "blog2" } } } ]
(4) Update subdocuments
Updating a child document does not affect the parent document.
POST /my_blogs/_update/comment2?routing=blog2 { "doc": { "comment": "Hello Hadoop??" } }
Query by id and routing
GET /my_blogs/_doc/comment2?routing=blog2
Result:
{ "_index" : "my_blogs", "_type" : "_doc", "_id" : "comment2", "_version" : 2, "_seq_no" : 4, "_primary_term" : 1, "_routing" : "blog2", "found" : true, "_source" : { "comment" : "Hello Hadoop??", "username" : "Jack", "blog_comments_relation" : { "name" : "comment", "parent" : "blog2" } } }
Nested versus Join
Object data types are not suitable for dealing with data with associations, so what scenarios do Nested and Join types apply to? Let's look at the comparison.
Contrast | Nested | Join |
---|---|---|
Advantage | Document storage, read performance | Parent-child documents can be updated independently |
shortcoming | When updating nested subdocuments, the entire document needs to be updated | Require additional memory maintenance relationship, read performance is relatively poor |
Scenarios applicable | Subdocuments are updated occasionally, mainly by queries | Subdocuments are frequently updated |
Other Processing
In practice, we can also use Nested and Join types to process data with related relationships. We can directly establish a one-to-one relationship between database tables and ES indexes, and then process the related relationships on the application side after querying the data through ES. Or we can merge the related data tables to create an ES index, which is the simplest way to do so.