Elasticsearch aggregations: A beginner's guide
Created just over 10 years ago, Elasticsearch is today’s most popular enterprise search engine and one of the 10 most popular database management systems.
Whether you are a software developer or a website owner looking for great user experience and the latest trends in searching, this article will be of great help. Here we will talk about what are Elasticsearch and Elasticsearch aggregations and how you can use them to improve your business and user experience.
What is Elasticsearch?
Elasticsearch is a distributed, free, and open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. It’s built on Apache Lucene and developed in Java. Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack (Elasticsearch, Kibana, Beats, and Logstash), a set of free and open-source tools for data ingestion, enrichment, storage, analysis, and visualization.
Elasticsearch supports a variety of languages and official clients are available for:
- Java
- JavaScript (Node.js)
- Go
- .NET (C#)
- PHP
- Perl
- Python
- Ruby
What is Elasticsearch used for?
Elasticsearch allows you to store, search, and analyze huge volumes of data quickly, almost in real-time and give back answers in milliseconds. Moreover, the speed and scalability of Elasticsearch and its ability to index many types of content mean that it can be used for multiple use cases:
- Application search - For applications that rely heavily on a search platform for the access, retrieval, and reporting of data.
- Website search - Websites that store a lot of content find Elasticsearch an especially useful tool for effective and accurate searching. It’s no surprise that Elasticsearch is increasingly gaining ground in the site search domain sphere.
- Logging and log analytics - Elasticsearch is commonly used for ingesting and analysing log data in a scalable manner in near-real-time. It also provides important operational insights on log metrics to drive actions.
- Infrastructure metrics and performance monitoring - Many companies use the Elastic stack to analyse various metrics. This may involve gathering data across several performance parameters that vary by use case.
- Security analytics - Another major application of Elasticsearch is security analysis. Access logs and similar logs concerning system security can be analysed with the ELK stack, providing a more complete picture of what’s going on across your systems in real-time.
- Business analytics - Many of the built-in features available within the Elastic stack make it a good option as a business analytics tool. However, there is a steep learning curve for implementing this product in most organisations. This is especially true in cases where companies have multiple data sources besides Elasticsearch since Kibana only works with Elasticsearch data.
- Enterprise search - Elasticsearch allows enterprise-wide search that includes document search, E-commerce product search, blog search, people search, and any form of search you can think of. In fact, it has steadily penetrated and replaced the search solutions of most of the popular websites we use daily. From a more enterprise-specific perspective, Elasticsearch is used to great success in company intranets. In this article, we will focus further on the enterprise search.
Modern search interfaces are generally expected to have some sort of faceted navigation, a place where users can get a quick understanding of the distribution of the search results. For example, how many books are of a particular author, in a certain price range, or with a certain rating? These are implemented using aggregations in Elasticsearch, and they come in many forms. You can aggregate terms, numerical ranges, date ranges, geo distance, and more.
Benefits of Elasticsearch
- Elasticsearch is fast. Elasticsearch is built on top of Lucene and excels at full-text search. It is also a near real-time search platform, meaning the latency from the time a document is indexed until it becomes searchable is very short — typically one second. As a result, Elasticsearch is well suited for time-sensitive use cases such as security analytics and infrastructure monitoring.
- Elasticsearch is distributed. The documents stored in Elasticsearch are distributed across different containers known as shards, which are duplicated to provide redundant copies of the data in case of hardware failure. The distributed nature of Elasticsearch allows it to scale out to hundreds (or even thousands) of servers and handle petabytes of data.
- Elasticsearch comes with a wide set of features. In addition to its speed, scalability, and resiliency, Elasticsearch has a number of powerful built-in features that make storing and searching data even more efficient.
- Elasticsearch allows data visualization. Searching, viewing, and visualizing data indexed in Elasticsearch and analyzing the data through the creation of bar charts, pie charts, tables, histograms, and maps. Kibana’s dashboard view combines these visual elements to then be shared via browser to provide real-time analytical views into large data volumes.
Elasticsearch aggregations
Elasticsearch aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query.
This article will describe the different types of aggregations and how to run them. Additionally, it will provide a few practical examples of aggregations, illustrating how useful they can be.
Here are the examples:
- You’re running an online clothing business and want to know the average total price of all the products in your catalogue. The Average aggregation will calculate this number for you.
- You want to check how many products you have within the “up to $100” price range and the “$100 to $200” price range. In this case, you can use the Range aggregation.
To make things clearer, I will give you a real-life example from a project we developed at Accedia. Here is the implementation of Elasticsearch aggregations for a client offering an online collaboration environment, forming the basis of a social intranet:
As you can see, users can search for a colleague using various facets/filters and see how many colleagues are matching that filter. Each select is refreshing facets and results immediately which makes it extremely useful for thousands of scenarios. For example, these were the initial facet options:
Getting started with Elasticsearch aggregations
In order to start using aggregations, you should have a working setup of Elastic Stack. The Elastic Stack can be installed using a variety of methods and on a wide array of different operating systems and environments. Since we can’t cover each scenario, I will give you 2 ways to do it. Either download and install Elasticsearch and Kibana locally on Windows or do it with Docker following this quick start or step by step guide.
You will also need some data/schema in your Elasticsearch index. In this article, I am using sample eCommerce order data and sample web logs provided by Kibana. To get this sample data, visit your Kibana homepage and click on “Load a data set and a Kibana dashboard.”
The aggregations syntax
It is important to be familiar with the basic building blocks used to define an aggregation. The following syntax will help you to understand how it works:
“aggs”: { “name_of_aggregation”: { “type_of_aggregation”: { “field”: “document_field_name” }
aggs — This shows that you are using an aggregation.
name_of_aggregation — The name of aggregation that the user defines.
type_of_aggregation — The type of aggregation being used.
field — The field keyword.
document_field_name — The column name of the document being targeted.
The following example shows the total counts of the “clientip” address in the index “kibana_sample_data_logs.”
The code written below is executed in the Dev Tools of Kibana. The resulting output is shown in the screenshot below.
GET /kibana_sample_data_logs/_search { "size": 0, "aggs": { "ip_count": { "value_count": { "field": "clientip" } } } }
Aggregation categories
Elasticsearch organizes aggregation into three categories:
Bucket aggregation - Bucket aggregation is a method of grouping documents. They are used for grouping or creating data buckets, which can be made based on an existing field, customized filters, ranges, etc.
Metric aggregation – Aggregation that calculates metrics, such as a sum or average, from field values.
Pipeline aggregation - As the name suggests, it takes input from the output results of other aggregations.
key aggregation types
All of the above aggregations can be further classified. Here are the 5 most important types and some examples on each one of them:
- Cardinality aggregation
- Stats aggregation
- Filter aggregation
- Terms aggregation
- Nested aggregation
Cardinality aggregation
Needing to find the number of unique values for a particular field is a common requirement. The cardinality aggregation can be used to determine the number of unique elements.
Let’s see how many unique sku’s can be found in our e-commerce data.
GET /kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "unique_skus": { "cardinality": { "field":"sku" } } } }
Output
Stats aggregation
Statistics derived from your data are often needed when your aggregated document is large. The statistics aggregation allows you to get a min, max, sum, avg, and count of data in a single go. The statistics aggregation structure is similar to that of the other aggregations.
Let’s check the stats of field “total_quantity” in our data.
GET /kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "quantity_stats": { "stats": { "field": "total_quantity" } } } }
Output
Filter aggregation
As its name suggests, the filter aggregation helps you filter documents into a single bucket. Within that bucket, you can calculate metrics.
In the example below, we are filtering the documents based on the username “eddie” and calculating the average price of the products he purchased.
GET /kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "User_based_filter": { "filter": { "term": { "user": "eddie" } }, "aggs": { "avg_price": { "avg": { "field": "products.price" } } } } } }
Output
Terms aggregation
The term aggregation generates buckets by field values. Once you select a field, it will generate buckets for each of the values and place all of the records separately.
In our example, we have run the terms aggregation on the field “user” which holds the name of users. In return, we have buckets for each user with their document counts.
GET /kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "Terms_Aggregation": { "terms": { "field": "user" } } } }
Output
Nested aggregation
This is one of the most important types of bucket aggregations. A nested aggregation allows you to aggregate a field with nested documents—a field that has multiple sub-fields.
The field type must be “nested” in the index mapping if you are intending to apply a nested aggregation to it.
The sample eCommerce data which we have used up until this point hasn’t had a field with the type “nested.” We have created a new index with the field “Employee” which has its field type as “nested.”
Run the code below in DevTools to create a new index “nested_aggregation” and set the mapping as “nested” for the field “Employee.”
PUT nested_aggregation { "mappings": { "properties": { "Employee": { "type": "nested", "properties": { "first": { "type": "text" }, "last": { "type": "text" }, "salary": { "type": "double" } } } } } }
Afterwards, execute the code below in DevTools to insert some sample data into the index you have just created.
PUT nested_aggregation/_doc/1 { "group": "Logz", "Employee": [ { "first": "Ana", "last": "Roy", "salary": "70000" }, { "first": "Jospeh", "last": "Lein", "salary": "64000" }, { "first": "Chris", "last": "Gayle", "salary": "82000" }, { "first": "Brendon", "last": "Maculum", "salary": "58000" }, { "first": "Vinod", "last": "Kambli", "salary": "63000" }, { "first": "DJ", "last": "Bravo", "salary": "71000" }, { "first": "Jaques", "last": "Kallis", "salary": "75000" } ] }
Now the sample data is in our index “nested_aggregation.” Execute the following code to see how a nested aggregation works:
GET /nested_aggregation/_search { "aggs": { "Nested_Aggregation": { "nested": { "path": "Employee" }, "aggs": { "Min_Salary": { "min": { "field": "Employee.salary" } } } } } }
Output
As you can see, we have successfully called the sub-fields/nested fields of the main field “Employee.”
Some final words
This article has detailed several techniques for taking advantage of aggregations. Additionally, there are other versions of aggregations that you might find useful as well. Some of these include:
Date histogram aggregation — used with date values
Scripted aggregation — used with scripts
Top hits aggregation — used with top matching documents
Range aggregation — used with a set of range values
As a next step, consider immersing yourself in these aggregations to find out how they might help you meet your needs. You can also visit Elastic’s official page on aggregations.
Meanwhile, if you need advice on using Elasticsearch aggregations, let me know, and I’ll be more than happy to answer any of your questions.