Integrate. Transform. Explore. Operationalize.
Prepare and Explore Data at Scale
Empower data engineers, analysts, and scientists with one platform.

Integrate. Transform. Explore. Operationalize.
Prepare and Explore Data at Scale
Empower data engineers, analysts, and scientists with one platform.
Elasticsearch is a feature-rich, open-source search-engine built on top of Apache Lucene, one of the most important full-text search engines on the market.
Elasticsearch is best known for the expansive and versatile REST API experience it provides, including efficient wrappers for full-text search, sorting and aggregation tasks, making it a lot easier to implement such capabilities in existing backends without the need for complex re-engineering.
Ever since its introduction in 2010, Elasticsearch gained a lot of traction in the software engineering domain and by 2016 it became the most popular enterprise search-engine software stack according to DBMS knowledge base DB-engines, surpassing the industry-standard Apache Solr (which is also built on top of Lucene).
One of the things that makes Elasticsearch so popular is the ecosystem it generated. Engineers across the world developed open-source Elasticsearch integrations and extensions, and many of these projects were absorbed by Elastic (the company behind the Elasticsearch project) as part of their stack.
Some of the projects were Logstash (data processing pipeline, commonly used for parsing text-based files) and Kibana (visualization layer built on top of Elasticsearch), leading towards the now widely adopted ELK (Elasticsearch, Logstash, Kibana) stack.
The ELK stack quickly gained notoriety due to its impressive set of possible applications across both emerging and consolidated tech domains, such as DevOps, Site-Reliability Engineering, and, most recently, Data Analytics.
Chances are that if you’re a data scientist reading this article and have Elasticsearch as part of your employer’s tech stack, you might have had some problems trying to use all the features Elasticsearch provides for data analysis and even for simple machine learning tasks.
Data scientists are generally not used to NoSQL database engines for common tasks or even relying on complex REST APIs for analysis. Dealing with large amounts of data using Elasticsearch’s low-level python clients, for example, is also not that intuitive and has somewhat of a steep learning curve for someone coming from a field different from SWE.
Although Elastic made significant efforts in enhancing the ELK stack for Analytics and Data Science use cases, it still lacked an easy interface with the existing Data Science ecosystem (pandas, numpy, scikit-learn, PyTorch,and other popular libraries).
In 2017, Elastic took it first step towards the data science field and, as an answer to the growing popularity of Machine Learning and predictive technologies in the software industry, released their first ML-capable X-pack (extension pack) for the ELK stack, adding Anomaly Detection and other unsupervised ML tasks to its features. Not long after that, Regression and Classification models were also added to the set of ML tasks available in the ELK stack.
Last month another step towards Elasticsearch achieving widespread adoption in the data science industry, with the release of Eland, a brand new Python Elasticsearch client and toolkit with a powerful (and familiar) pandas-like API for analysis, ETL and Machine Learning.
Eland enables data scientists to efficiently use the already robust Elasticsearch analysis and ML capabilities without requiring a deep knowledge of Elasticsearch and its many intricacies.
Features and concepts from Elasticsearch were translated into a much more recognizable setting. For instance, an Elasticsearch index, with its documents, mappings, and fields, becomes a dataframe, with rows and columns, much like we are used to seeing when using pandas.
For demo purpose, make sure you are running an elasticsearch and kibana node, enable ” Sample eCommerce orders” in Kibana home page as follow
Let’s start with importing the libraries we need and reading our data from elasticsearch with the help of python, pandas, and eland :
All code snippets are available as jupyter notebook in github and other advanced examples are also available on github
# Importing Eland
# Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
# https://eland.readthedocs.io/en/latest/
# https://github.com/elastic/eland
import eland as ed
# https://elasticsearch-dsl.readthedocs.io/en/latest/
# https://github.com/elastic/elasticsearch-dsl-py
# High level Python client for Elasticsearch
from elasticsearch_dsl import Search, Q
# https://elasticsearch-py.readthedocs.io/en/latest/
# https://github.com/elastic/elasticsearch-py
# Official Python low-level client for Elasticsearch
from elasticsearch import Elasticsearch
# Import pandas and numpy for data wrangling
import pandas as pd
import numpy as np
# For pretty-printing
# Function for pretty-printing JSON
def json(raw):
import json
print(json.dumps(raw, indent=2, sort_keys=True))
# Connect to an Elasticsearch instance
# here we use the official Elastic Python client
# check it on https://github.com/elastic/elasticsearch-py
es = Elasticsearch(
['http://localhost:9200'],
http_auth=("es_kbn", "changeme")
)
# print the connection object info (same as visiting http://localhost:9200)
# make sure your elasticsearch node/cluster respond to requests
json(es.info())
{
"cluster_name": "churn",
"cluster_uuid": "K3nB4fp_QcyjpY-e2XVUbA",
"name": "node-01",
"tagline": "You Know, for Search",
"version": {
"build_date": "2020-07-22T19:31:37.655268Z",
"build_flavor": "default",
"build_hash": "bbbd2282a6668869c41efc5713ad8214d44c0ad1",
"build_snapshot": true,
"build_type": "zip",
"lucene_version": "8.6.0",
"minimum_index_compatibility_version": "7.0.0",
"minimum_wire_compatibility_version": "7.10.0",
"number": "8.0.0-SNAPSHOT"
}
}
Common data science use cases such as reading an entire Elasticsearch index into a pandas dataframe for Exploratory Data Analysis or training an ML model would usually require some not-so-efficient shortcuts.
# name of the index we want to query
index_name = 'kibana_sample_data_ecommerce'
# defining the search statement to get all records in an index
search = Search(using=es, index=index_name).query("match_all")
# retrieving the documents from the search
documents = [hit.to_dict() for hit in search.scan()]
# converting the list of hit dictionaries into a pandas dataframe:
df_ecommerce = pd.DataFrame.from_records(documents)
# visualizing the dataframe with the results:
df_ecommerce.head()['geoip']
0 {'country_iso_code': 'EG', 'location': {'lon':...
1 {'country_iso_code': 'AE', 'location': {'lon':...
2 {'country_iso_code': 'US', 'location': {'lon':...
3 {'country_iso_code': 'GB', 'location': {'lon':...
4 {'country_iso_code': 'EG', 'location': {'lon':...
Name: geoip, dtype: object
# retrieving a summary of the columns in the dataset:
df_ecommerce.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675 entries, 0 to 4674
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 4675 non-null object
1 currency 4675 non-null object
2 customer_first_name 4675 non-null object
3 customer_full_name 4675 non-null object
4 customer_gender 4675 non-null object
5 customer_id 4675 non-null int64
6 customer_last_name 4675 non-null object
7 customer_phone 4675 non-null object
8 day_of_week 4675 non-null object
9 day_of_week_i 4675 non-null int64
10 email 4675 non-null object
11 manufacturer 4675 non-null object
12 order_date 4675 non-null object
13 order_id 4675 non-null int64
14 products 4675 non-null object
15 sku 4675 non-null object
16 taxful_total_price 4675 non-null float64
17 taxless_total_price 4675 non-null float64
18 total_quantity 4675 non-null int64
19 total_unique_products 4675 non-null int64
20 type 4675 non-null object
21 user 4675 non-null object
22 geoip 4675 non-null object
23 event 4675 non-null object
dtypes: float64(2), int64(5), object(17)
memory usage: 876.7+ KB
# getting descriptive statistics from the dataframe
df_ecommerce[['taxful_total_price', 'taxless_total_price', 'total_quantity', 'total_unique_products']].describe()
The procedure described above would lead us to get all the documents in the index as a list of dictionaries, only to load them into a pandas dataframe. That means having both the documents themselves and the resulting dataframe in memory at some point in the process. For Big Data applications, this procedure would not always be feasible, and exploring the dataset in the Jupyter Notebook environment could become very complicated, very fast.
Eland enables us to perform very similar operations to the ones described above, without any of the friction involved in adapting them to the Elasticsearh context, while still using Elasticsearch aggregation speed and search features.
# loading the data from the Sample Ecommerce data from Kibana into Eland dataframe:
ed_ecommerce = ed.read_es(es, index_name)
# visualizing the results:
ed_ecommerce[['customer_id', 'category', 'customer_first_name', 'customer_full_name']].head()
As an added feature that would require a bit more wrangling on the pandas side, the field geoip
(which is a nested json object in the index) was seamlessly parsed into columns in our dataframe. We can see that by calling the .info()
method on the Eland dataframe.
# retrieving a summary of the columns in the dataframe:
ed_ecommerce.info()
<class 'eland.dataframe.DataFrame'>
Index: 4675 entries, Fy_X0nMBxx5L21Ced97s to XC_X0nMBxx5L21CejPB3
Data columns (total 46 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 4675 non-null object
1 currency 4675 non-null object
2 customer_birth_date 0 non-null datetime64[ns]
3 customer_first_name 4675 non-null object
4 customer_full_name 4675 non-null object
5 customer_gender 4675 non-null object
6 customer_id 4675 non-null object
7 customer_last_name 4675 non-null object
8 customer_phone 4675 non-null object
9 day_of_week 4675 non-null object
10 day_of_week_i 4675 non-null int64
11 email 4675 non-null object
12 event.dataset 4675 non-null object
13 geoip.city_name 4094 non-null object
14 geoip.continent_name 4675 non-null object
15 geoip.country_iso_code 4675 non-null object
16 geoip.location 4675 non-null object
17 geoip.region_name 3924 non-null object
18 manufacturer 4675 non-null object
19 order_date 4675 non-null datetime64[ns]
20 order_id 4675 non-null object
21 products._id 4675 non-null object
22 products.base_price 4675 non-null float64
23 products.base_unit_price 4675 non-null float64
24 products.category 4675 non-null object
25 products.created_on 4675 non-null datetime64[ns]
26 products.discount_amount 4675 non-null float64
27 products.discount_percentage 4675 non-null float64
28 products.manufacturer 4675 non-null object
29 products.min_price 4675 non-null float64
30 products.price 4675 non-null float64
31 products.product_id 4675 non-null int64
32 products.product_name 4675 non-null object
33 products.quantity 4675 non-null int64
34 products.sku 4675 non-null object
35 products.tax_amount 4675 non-null float64
36 products.taxful_price 4675 non-null float64
37 products.taxless_price 4675 non-null float64
38 products.unit_discount_amount 4675 non-null float64
39 sku 4675 non-null object
40 taxful_total_price 4675 non-null float64
41 taxless_total_price 4675 non-null float64
42 total_quantity 4675 non-null int64
43 total_unique_products 4675 non-null int64
44 type 4675 non-null object
45 user 4675 non-null object
dtypes: datetime64[ns](3), float64(12), int64(5), object(26)
memory usage: 64.0 bytes
We can also notice the memory usage went from around 876 Kbs in the pandas dataframe to only 64 bytes in the Eland dataframe. This happens because we don’t need to hold the entire dataset in memory to retrieve the information we require from the index, and most of the workload remains in the Elasticsearch cluster (as aggregations or specific queries).
For such a small dataset, this is not that important, but as we scale to Gigabytes of data, the benefits of not holding everything in memory for simple computations and analysis are much more noticeable.
Eland abstracts a lot of the already existing APIs in Elasticsearch, without data scientists needing to learn Elasticsearch’s specific syntax. For example, it is possible to get the mapping of an index (equivalent to retrieving thedtypes
attribute of a pandas DataFrame), but it is not immediately obvious or how to do it. With the Eland DataFrame object, we can simply retrieve thedtypes
attribute as we would do on a regular pandas DataFrame.
# getting the dtypes from pandas dataframe:
df_ecommerce.dtypes
category object
currency object
customer_first_name object
customer_full_name object
customer_gender object
customer_id int64
customer_last_name object
customer_phone object
day_of_week object
day_of_week_i int64
email object
manufacturer object
order_date object
order_id int64
products object
sku object
taxful_total_price float64
taxless_total_price float64
total_quantity int64
total_unique_products int64
type object
user object
geoip object
event object
dtype: object
# retrieving the Data types for the index normally would require us to perform the following Elasticsearch query:
mapping = es.indices.get_mapping(index_name)
# which by itself is an abstraction of the GET request for mapping retrieval
json(mapping)
{
"kibana_sample_data_ecommerce": {
"mappings": {
"properties": {
"category": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"currency": {
"type": "keyword"
},
"customer_birth_date": {
"type": "date"
},
"customer_first_name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"customer_full_name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"customer_gender": {
"type": "keyword"
},
"customer_id": {
"type": "keyword"
},
"customer_last_name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"customer_phone": {
"type": "keyword"
},
"day_of_week": {
"type": "keyword"
},
"day_of_week_i": {
"type": "integer"
},
"email": {
"type": "keyword"
},
"event": {
"properties": {
"dataset": {
"type": "keyword"
}
}
},
"geoip": {
"properties": {
"city_name": {
"type": "keyword"
},
"continent_name": {
"type": "keyword"
},
"country_iso_code": {
"type": "keyword"
},
"location": {
"type": "geo_point"
},
"region_name": {
"type": "keyword"
}
}
},
"manufacturer": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"order_date": {
"type": "date"
},
"order_id": {
"type": "keyword"
},
"products": {
"properties": {
"_id": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"base_price": {
"type": "half_float"
},
"base_unit_price": {
"type": "half_float"
},
"category": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"created_on": {
"type": "date"
},
"discount_amount": {
"type": "half_float"
},
"discount_percentage": {
"type": "half_float"
},
"manufacturer": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"min_price": {
"type": "half_float"
},
"price": {
"type": "half_float"
},
"product_id": {
"type": "long"
},
"product_name": {
"analyzer": "english",
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"quantity": {
"type": "integer"
},
"sku": {
"type": "keyword"
},
"tax_amount": {
"type": "half_float"
},
"taxful_price": {
"type": "half_float"
},
"taxless_price": {
"type": "half_float"
},
"unit_discount_amount": {
"type": "half_float"
}
}
},
"sku": {
"type": "keyword"
},
"taxful_total_price": {
"type": "half_float"
},
"taxless_total_price": {
"type": "half_float"
},
"total_quantity": {
"type": "integer"
},
"total_unique_products": {
"type": "integer"
},
"type": {
"type": "keyword"
},
"user": {
"type": "keyword"
}
}
}
}
}
With these abstractions in place, Eland allows us to use core Elasticsearch features that are not part of pandas (or at least are not as efficient), such as full-text search, Elasticsearch’s biggest use case.
# defining the full-text query we need: Retrieving records for either Elitelligence or Primemaster manufacturer
query = {
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Elitelligence OR Primemaster"
}
}
# using full-text search capabilities with Eland:
text_search_df = ed_ecommerce.es_query(query)
# visualizing price of products for each manufacturer using pandas column syntax:
text_search_df[['manufacturer','products.price']]
This article only touches the surface of the possibilities Eland opens for data scientists and other data professionals using Elasticsearch on day-to-day operations.
Especially in DevOps and AIOps contexts, where ML-based tools are not very mature yet, data professionals can benefit from Python’s existing machine learning ecosystem to analyze large amounts of Observability and Metrics data, which will be a topic for another article.
Eland is certainly a big step towards Elasticsearch, and I look forward to what future versions of the ELK stack will bring to the table.
This article was inspired from Mateus Picanco