Drill down to elastic search metric aggregation (1)
In this paper, we focus on two types of aggregation: one is to generate single valued aggregation, the other is to generate multi valued aggregation. Single valued measurement aggregation mainly includes average, weighted average, minimum, maximum and cardinality. Multi value aggregation includes statistical aggregation and extended statistical aggregation.
1. Environmental preparation
To demonstrate the above metric aggregation, we need to create a sports index and store some documents. Readers can Download here , please refer to for bulk insertion of documents Preceding text . The index data results are as follows:
PUT /sports { "mappings": { "properties": { "birthdate": { "type": "date", "format": "dateOptionalTime" }, "location": { "type": "geo_point" }, "name": { "type": "keyword" }, "rating": { "type": "integer" }, "sport": { "type": "keyword" }, "age": { "type":"integer" }, "goals": { "type": "integer" }, "role": { "type":"keyword" }, "score_weight": { "type": "float" } } } }
When you're ready for the environment, let's start with the most commonly used single value aggregation, and start with the average.
2. Single value measurement aggregation
2.1. Average aggregation
The average aggregate calculates the arithmetic average of numeric type fields in the document. As with other metric aggregations, averages require numeric types or scripts to generate values. This article mainly refers to the first kind of scene, which readers can read Here Learn more.
Let's calculate the average age of all athletes:
GET /sports/_search?size=0 { "aggs" : { "avg_age" : { "avg" : { "field" : "age" } } } }
We specify age as the field and avg as the aggregation type. The output result is:
{ "took" : 5104, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 22, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "avg_age" : { "value" : 27.318181818181817 } } }
Returns the result of the aggregation included in the aggregations object. The mean age aggregation of the subjects was 27.318.
Let's consider a little more complicated and calculate the average age of a specific type of player: football, basketball, hockey, handball. Average aggregation and group aggregation are required. Group aggregation groups based on certain conditions, and then calculates the average number of each group.
We use the keyword aggregation terms, which generates a group for each value. There are four types of motions in the sample document, so four groups will be generated.
GET /sports/_search?size=0 { "aggs": { "sport_type": { "terms": { "field": "sport" }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } }
We specify the sport field as the grouping condition and the age field as the average measurement. Result response:
"aggregations" : { "sport_type" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Football", "doc_count" : 9, "avg_age" : { "value" : 26.444444444444443 } }, { "key" : "Basketball", "doc_count" : 5, "avg_age" : { "value" : 28.6 } }, { "key" : "Hockey", "doc_count" : 5, "avg_age" : { "value" : 27.4 } }, { "key" : "Handball", "doc_count" : 3, "avg_age" : { "value" : 27.666666666666668 } } ] } }
As we expected, four groups are generated. Each group object includes the group name key, the number of documents in the group doc_count and the average age of each group. We see that the highest average age is 28.6 in the basketball group.
2.2. Default value
Sometimes the target field in the document may be empty. The default behavior for metric aggregation is simply to ignore these documents, but we changed the settings to have a default value for the missing value.
GET /sports/_search?size=0 { "aggs" : { "avg_grade" : { "avg" : { "field" : "grade" , "missing": 20 } } } }
When the grade field has no value, the default value is 20
2.3. Weighted average aggregation
Weighted average aggregation was introduced from version 6.4. In order to use this aggregation, you first need to understand the difference between a regular average and a weighted average. When calculating the arithmetic mean, all values are weighted equally. However, each value in the weighted average has a different weight, which is calculated as follows: ∑ (value * weight) / ∑ (weight).
Let's see why sports indexes need to use weighted averages instead of normal arithmetic averages. In different sports, the total number of goals of the best shooter sometimes varies greatly. For example, on average, hockey players score more than football players and basketball players score more than hockey players.
Second, the frequency of scoring usually depends on the player's position on the field. Forwards score more than midfielders, midfielders score more than defenders. If we do not take these differences into account in our calculation of the average score, the result may be biased towards high-frequency scoring sports and high scoring positions, such as forwards.
We can solve the first problem by calculating the average score of each sport. The second problem can be solved by assigning different weights to different locations. The highest weight can be assigned to defenders because they score less (so if they score, the weight will be more), while the lowest weight is assigned to forwards because scoring is what they should do. We implemented this idea in the score? Weight field, which has weights of 2 (forward), 3 (midfield) and 4 (guard). These weights will ensure that the final result correctly reflects the average score. Relative to these weights, the normal average can be considered as a special case of weighted average, except that each value implies a weight of 1.
Note that these values are given at will and do not represent the actual scoring frequency of each position. These weights are set only to show how weighted average clustering works.
GET /sports/_search?size=0 { "aggs" : { "scoring_weighted_average": { "terms": { "field": "sport" }, "aggs": { "weighted_goals_in_sport": { "weighted_avg": { "value": { "field": "goals" }, "weight": { "field": "score_weight" } } } } } } }
Response results:
"aggregations" : { "scoring_weighted_average" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Football", "doc_count" : 9, "weighted_goals_in_sport" : { "value" : 53.214285714285715 } }, { "key" : "Basketball", "doc_count" : 5, "weighted_goals_in_sport" : { "value" : 1147.090909090909 } }, { "key" : "Hockey", "doc_count" : 5, "weighted_goals_in_sport" : { "value" : 134.30769230769232 } }, { "key" : "Handball", "doc_count" : 3, "weighted_goals_in_sport" : { "value" : 212.77777777777777 } } ] } }
We compared the differences between the two averages:
GET /sports/_search?size=0 { "aggs": { "sports":{ "terms" : { "field" : "sport" }, "aggs": { "avg_goals":{ "avg": {"field":"goals"} } } } } }
Response results:
"aggregations" : { "sports" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Football", "doc_count" : 9, "avg_goals" : { "value" : 54.888888888888886 } }, { "key" : "Basketball", "doc_count" : 5, "avg_goals" : { "value" : 1177.0 } }, { "key" : "Hockey", "doc_count" : 5, "avg_goals" : { "value" : 139.2 } }, { "key" : "Handball", "doc_count" : 3, "avg_goals" : { "value" : 245.33333333333334 } } ] } }
It can be seen that in handball (245.3 vs. 212.7), basketball (1177 vs. 1147), Hockey (139.2 vs. 134.3) and football (54.8 vs. 53.2), the conventional average is significantly higher than the weighted average. If the weight represents the true pattern in the calculated value, then the weighted average is often more accurate than the conventional average.
2.4. Cardinality aggregation
Cardinality aggregation calculates unique values for specific fields in a document. We apply cardinality aggregation to joints fields:
GET /sports/_search?size=0 { "aggs": { "sports":{ "cardinality" : { "field" : "sport" } } } }
The response results are:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 22, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "sports" : { "value" : 4 } } }
It doesn't take much time to calculate the base number of sports, and it doesn't take much memory, because there are only four sports in our index. However, if the index has many unique values, calculating the cardinality aggregation may consume more resource memory. For example, calculating the age base for our 22 indexes obviously uses more computing resources:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 22, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "sports" : { "value" : 16 } } }
If the index has thousands of documents, cardinality clustering can consume a lot of memory. The exact cardinality needs to load all values into the hash set, and then return its size. This method can't be extended well on the high cardinality set, because it needs more memory and causes high latency in the distributed cluster environment.
How can Elasticsearch solve this problem? The bottom layer of Elasticsearch calculates cardinality aggregation based on the HyperLogLog + + algorithm. Its characteristics are as follows:
- Configurable precision, which determines how to use memory to exchange precision;
- Excellent accuracy in low cardinality set;
- Fixed memory usage: no matter how many documents are in the index, memory usage depends on the accuracy of the configuration.
In other words, as in the above example, the cardinality is very low, so the algorithm calculation is completely accurate. If the data set cardinality is very high, you can set precision "threshold to exchange the accuracy of memory. This setting defines the maximum count threshold below which counts should be close to accuracy. Counting beyond this value can become less accurate. The maximum value is 40000.
2.5. Minimum and maximum aggregation
Minimum and maximum aggregation is a simple single value aggregation, which is used to calculate the maximum and minimum values of numeric fields in documents.
GET /sports/_search?size=0 { "aggs": { "max_age":{ "max" : { "field" : "age" } } } }
The response showed that the maximum age was 41
{ "took" : 51, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 22, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_age" : { "value" : 41.0 } } }
Minimum age aggregation:
GET /sports/_search?size=0 { "aggs": { "min_age":{ "min": { "field" : "age" } } } }
The response shows a minimum age of 18:
{ "took" : 41, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 22, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "min_age" : { "value" : 18.0 } } }
As in the previous example, we can get the maximum and minimum ages of different types of athletes:
GET /sports/_search?size=0 { "aggs": { "sports":{ "terms": {"field":"sport"}, "aggs": { "max_age":{ "max": {"field":"age"} }, "min_age":{ "min": {"field":"age"} } } } } }
The response is as follows:
"aggregations" : { "sports" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Football", "doc_count" : 9, "max_age" : { "value" : 35.0 }, "min_age" : { "value" : 19.0 } }, { "key" : "Basketball", "doc_count" : 5, "max_age" : { "value" : 36.0 }, "min_age" : { "value" : 18.0 } }, { "key" : "Hockey", "doc_count" : 5, "max_age" : { "value" : 41.0 }, "min_age" : { "value" : 18.0 } }, { "key" : "Handball", "doc_count" : 3, "max_age" : { "value" : 29.0 }, "min_age" : { "value" : 25.0 } } ] } }
The maximum and minimum values are returned at the same time. Let's look at multi value metric aggregation.
3. Multi value measurement aggregation
In the previous section, single value aggregation is mainly discussed. Elastic search also provides multi value aggregation - Statistical aggregation and extended statistical aggregation, which are used for fields of numerical type in documents to generate different statistical value measures, such as maximum, minimum, average, sum, count, standard deviation, square difference, square sum, etc., and return multiple values in an object. Extended statistical aggregation is very convenient to get all statistical measures at once.
GET /sports/_search?size=0 { "aggs": { "age_stats":{ "extended_stats": {"field":"age"} } } }
The above aggregation calculates the age statistical measures for all documents. The extended? Stats aggregation is calculated for numerical types, and the response results are as follows:
"aggregations" : { "age_stats" : { "count" : 22, "min" : 18.0, "max" : 41.0, "avg" : 27.318181818181817, "sum" : 601.0, "sum_of_squares" : 17181.0, "variance" : 34.67148760330581, "std_deviation" : 5.888249961007584, "std_deviation_bounds" : { "upper" : 39.09468174019698, "lower" : 15.541681896166649 } } }
The most important part of extended statistical aggregation is the standard deviation. At this time, the main statistical indicator: measure the variable amount of a group of data. Low standard deviation means that the data is close to the average, while high standard deviation means that the data is distributed in a larger range.
In addition to the normal standard deviation, the extended stats aggregation also returns an object named std_deviation_bounds, which provides an interval of plus or minus two standard deviations from the mean. This metric is useful for visualizing differences in data. If you want a different boundary, for example, three standard deviations, you can set the sigma parameter at the request:
GET /sports/_search?size=0 { "aggs": { "age_stats":{ "extended_stats": { "field":"age", "sigma":3 } } } }
The sigma parameter controls how many standard deviations are added or subtracted from the average value. Note that in order for the standard deviation to display an accurate value, the data must follow a normal distribution.
4. summary
This paper discusses several elastic search metric aggregations. When used alone, measurement aggregation reflects many useful insights from the data. If you combine measurement aggregation with group aggregation, you can get a deeper understanding of the measurement of various categories of data.
Next time, we will discuss other measurement type aggregation, such as geo bounds, geo centric, percentiles, percentiles ranges, etc. Please look forward to it.