Integrate. Transform. Explore. Operationalize.

Prepare and Explore Data at Scale

Empower data engineers, analysts, and scientists with one platform.

Elasticsearch for Data Science just got way easier

Eland is a brand new python package that bridges the gap between Elasticsearch and the Data Science ecosystem.

Elasticsearch is a feature-rich, open-source search-engine built on top of Apache Lucene, one of the most important full-text search engines on the market.

Elasticsearch is best known for the expansive and versatile REST API experience it provides, including efficient wrappers for full-text search, sorting and aggregation tasks, making it a lot easier to implement such capabilities in existing backends without the need for complex re-engineering.

Ever since its introduction in 2010, Elasticsearch gained a lot of traction in the software engineering domain and by 2016 it became the most popular enterprise search-engine software stack according to DBMS knowledge base DB-engines, surpassing the industry-standard Apache Solr (which is also built on top of Lucene).

One of the things that makes Elasticsearch so popular is the ecosystem it generated. Engineers across the world developed open-source Elasticsearch integrations and extensions, and many of these projects were absorbed by Elastic (the company behind the Elasticsearch project) as part of their stack.

Some of the projects were Logstash (data processing pipeline, commonly used for parsing text-based files) and Kibana (visualization layer built on top of Elasticsearch), leading towards the now widely adopted ELK (Elasticsearch, Logstash, Kibana) stack.

The ELK stack quickly gained notoriety due to its impressive set of possible applications across both emerging and consolidated tech domains, such as DevOps, Site-Reliability Engineering, and, most recently, Data Analytics.

But what about Data Science?

Chances are that if you’re a data scientist reading this article and have Elasticsearch as part of your employer’s tech stack, you might have had some problems trying to use all the features Elasticsearch provides for data analysis and even for simple machine learning tasks.

Data scientists are generally not used to NoSQL database engines for common tasks or even relying on complex REST APIs for analysis. Dealing with large amounts of data using Elasticsearch’s low-level python clients, for example, is also not that intuitive and has somewhat of a steep learning curve for someone coming from a field different from SWE.

Although Elastic made significant efforts in enhancing the ELK stack for Analytics and Data Science use cases, it still lacked an easy interface with the existing Data Science ecosystem (pandas, numpy, scikit-learn, PyTorch,and other popular libraries).

In 2017, Elastic took it first step towards the data science field and, as an answer to the growing popularity of Machine Learning and predictive technologies in the software industry, released their first ML-capable X-pack (extension pack) for the ELK stack, adding Anomaly Detection and other unsupervised ML tasks to its features. Not long after that, Regression and Classification models were also added to the set of ML tasks available in the ELK stack.

Last month another step towards Elasticsearch achieving widespread adoption in the data science industry, with the release of Eland, a brand new Python Elasticsearch client and toolkit with a powerful (and familiar) pandas-like API for analysis, ETL and Machine Learning.

Eland: Elastic and Data

Eland enables data scientists to efficiently use the already robust Elasticsearch analysis and ML capabilities without requiring a deep knowledge of Elasticsearch and its many intricacies.

Features and concepts from Elasticsearch were translated into a much more recognizable setting. For instance, an Elasticsearch index, with its documents, mappings, and fields, becomes a dataframe, with rows and columns, much like we are used to seeing when using pandas.

For demo purpose, make sure you are running an elasticsearch and kibana node, enable ” Sample eCommerce orders” in Kibana home page as follow



Sample data, visualizations, and dashboards for tracking eCommerce orders.

Let’s start with importing the libraries we need and reading our data from elasticsearch with the help of python, pandas, and eland :

All code snippets are available as jupyter notebook in github and other advanced examples are also available on github

# Importing Eland
# Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
# https://eland.readthedocs.io/en/latest/
# https://github.com/elastic/eland
import eland as ed

# https://elasticsearch-dsl.readthedocs.io/en/latest/
# https://github.com/elastic/elasticsearch-dsl-py
# High level Python client for Elasticsearch
from elasticsearch_dsl import Search, Q

# https://elasticsearch-py.readthedocs.io/en/latest/
# https://github.com/elastic/elasticsearch-py
# Official Python low-level client for Elasticsearch
from elasticsearch import Elasticsearch

# Import pandas and numpy for data wrangling
import pandas as pd
import numpy as np

# For pretty-printing
# Function for pretty-printing JSON
def json(raw):
    import json
    print(json.dumps(raw, indent=2, sort_keys=True))

# Connect to an Elasticsearch instance
# here we use the official Elastic Python client
# check it on https://github.com/elastic/elasticsearch-py
es = Elasticsearch(
  ['http://localhost:9200'],
  http_auth=("es_kbn", "changeme")
)
# print the connection object info (same as visiting http://localhost:9200)
# make sure your elasticsearch node/cluster respond to requests
json(es.info())

{
  "cluster_name": "churn",
  "cluster_uuid": "K3nB4fp_QcyjpY-e2XVUbA",
  "name": "node-01",
  "tagline": "You Know, for Search",
  "version": {
    "build_date": "2020-07-22T19:31:37.655268Z",
    "build_flavor": "default",
    "build_hash": "bbbd2282a6668869c41efc5713ad8214d44c0ad1",
    "build_snapshot": true,
    "build_type": "zip",
    "lucene_version": "8.6.0",
    "minimum_index_compatibility_version": "7.0.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "number": "8.0.0-SNAPSHOT"
  }
}

Common data science use cases such as reading an entire Elasticsearch index into a pandas dataframe for Exploratory Data Analysis or training an ML model would usually require some not-so-efficient shortcuts.

# name of the index we want to query
index_name = 'kibana_sample_data_ecommerce' 

# defining the search statement to get all records in an index
search = Search(using=es, index=index_name).query("match_all") 

# retrieving the documents from the search
documents = [hit.to_dict() for hit in search.scan()] 

# converting the list of hit dictionaries into a pandas dataframe:
df_ecommerce = pd.DataFrame.from_records(documents)

# visualizing the dataframe with the results:
df_ecommerce.head()['geoip']

0    {'country_iso_code': 'EG', 'location': {'lon':...
1    {'country_iso_code': 'AE', 'location': {'lon':...
2    {'country_iso_code': 'US', 'location': {'lon':...
3    {'country_iso_code': 'GB', 'location': {'lon':...
4    {'country_iso_code': 'EG', 'location': {'lon':...
Name: geoip, dtype: object


# retrieving a summary of the columns in the dataset:
df_ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675 entries, 0 to 4674
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   category               4675 non-null   object 
 1   currency               4675 non-null   object 
 2   customer_first_name    4675 non-null   object 
 3   customer_full_name     4675 non-null   object 
 4   customer_gender        4675 non-null   object 
 5   customer_id            4675 non-null   int64  
 6   customer_last_name     4675 non-null   object 
 7   customer_phone         4675 non-null   object 
 8   day_of_week            4675 non-null   object 
 9   day_of_week_i          4675 non-null   int64  
 10  email                  4675 non-null   object 
 11  manufacturer           4675 non-null   object 
 12  order_date             4675 non-null   object 
 13  order_id               4675 non-null   int64  
 14  products               4675 non-null   object 
 15  sku                    4675 non-null   object 
 16  taxful_total_price     4675 non-null   float64
 17  taxless_total_price    4675 non-null   float64
 18  total_quantity         4675 non-null   int64  
 19  total_unique_products  4675 non-null   int64  
 20  type                   4675 non-null   object 
 21  user                   4675 non-null   object 
 22  geoip                  4675 non-null   object 
 23  event                  4675 non-null   object 
dtypes: float64(2), int64(5), object(17)
memory usage: 876.7+ KB

# getting descriptive statistics from the dataframe
df_ecommerce[['taxful_total_price', 'taxless_total_price', 'total_quantity', 'total_unique_products']].describe()

The procedure described above would lead us to get all the documents in the index as a list of dictionaries, only to load them into a pandas dataframe. That means having both the documents themselves and the resulting dataframe in memory at some point in the process. For Big Data applications, this procedure would not always be feasible, and exploring the dataset in the Jupyter Notebook environment could become very complicated, very fast.

Eland enables us to perform very similar operations to the ones described above, without any of the friction involved in adapting them to the Elasticsearh context, while still using Elasticsearch aggregation speed and search features.

# loading the data from the Sample Ecommerce data from Kibana into Eland dataframe:
ed_ecommerce = ed.read_es(es, index_name)
# visualizing the results:
ed_ecommerce[['customer_id', 'category', 'customer_first_name', 'customer_full_name']].head()

As an added feature that would require a bit more wrangling on the pandas side, the field geoip (which is a nested json object in the index) was seamlessly parsed into columns in our dataframe. We can see that by calling the .info() method on the Eland dataframe.

# retrieving a summary of the columns in the dataframe:
ed_ecommerce.info()

<class 'eland.dataframe.DataFrame'>
Index: 4675 entries, Fy_X0nMBxx5L21Ced97s to XC_X0nMBxx5L21CejPB3
Data columns (total 46 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   category                       4675 non-null   object        
 1   currency                       4675 non-null   object        
 2   customer_birth_date            0 non-null      datetime64[ns]
 3   customer_first_name            4675 non-null   object        
 4   customer_full_name             4675 non-null   object        
 5   customer_gender                4675 non-null   object        
 6   customer_id                    4675 non-null   object        
 7   customer_last_name             4675 non-null   object        
 8   customer_phone                 4675 non-null   object        
 9   day_of_week                    4675 non-null   object        
 10  day_of_week_i                  4675 non-null   int64         
 11  email                          4675 non-null   object        
 12  event.dataset                  4675 non-null   object        
 13  geoip.city_name                4094 non-null   object        
 14  geoip.continent_name           4675 non-null   object        
 15  geoip.country_iso_code         4675 non-null   object        
 16  geoip.location                 4675 non-null   object        
 17  geoip.region_name              3924 non-null   object        
 18  manufacturer                   4675 non-null   object        
 19  order_date                     4675 non-null   datetime64[ns]
 20  order_id                       4675 non-null   object        
 21  products._id                   4675 non-null   object        
 22  products.base_price            4675 non-null   float64       
 23  products.base_unit_price       4675 non-null   float64       
 24  products.category              4675 non-null   object        
 25  products.created_on            4675 non-null   datetime64[ns]
 26  products.discount_amount       4675 non-null   float64       
 27  products.discount_percentage   4675 non-null   float64       
 28  products.manufacturer          4675 non-null   object        
 29  products.min_price             4675 non-null   float64       
 30  products.price                 4675 non-null   float64       
 31  products.product_id            4675 non-null   int64         
 32  products.product_name          4675 non-null   object        
 33  products.quantity              4675 non-null   int64         
 34  products.sku                   4675 non-null   object        
 35  products.tax_amount            4675 non-null   float64       
 36  products.taxful_price          4675 non-null   float64       
 37  products.taxless_price         4675 non-null   float64       
 38  products.unit_discount_amount  4675 non-null   float64       
 39  sku                            4675 non-null   object        
 40  taxful_total_price             4675 non-null   float64       
 41  taxless_total_price            4675 non-null   float64       
 42  total_quantity                 4675 non-null   int64         
 43  total_unique_products          4675 non-null   int64         
 44  type                           4675 non-null   object        
 45  user                           4675 non-null   object        
dtypes: datetime64[ns](3), float64(12), int64(5), object(26)
memory usage: 64.0 bytes

We can also notice the memory usage went from around 876 Kbs in the pandas dataframe to only 64 bytes in the Eland dataframe. This happens because we don’t need to hold the entire dataset in memory to retrieve the information we require from the index, and most of the workload remains in the Elasticsearch cluster (as aggregations or specific queries).

For such a small dataset, this is not that important, but as we scale to Gigabytes of data, the benefits of not holding everything in memory for simple computations and analysis are much more noticeable.

Elasticsearch capabilities with DataFrames

Eland abstracts a lot of the already existing APIs in Elasticsearch, without data scientists needing to learn Elasticsearch’s specific syntax. For example, it is possible to get the mapping of an index (equivalent to retrieving thedtypes attribute of a pandas DataFrame), but it is not immediately obvious or how to do it. With the Eland DataFrame object, we can simply retrieve thedtypesattribute as we would do on a regular pandas DataFrame.

# getting the dtypes from pandas dataframe:
df_ecommerce.dtypes

category                  object
currency                  object
customer_first_name       object
customer_full_name        object
customer_gender           object
customer_id                int64
customer_last_name        object
customer_phone            object
day_of_week               object
day_of_week_i              int64
email                     object
manufacturer              object
order_date                object
order_id                   int64
products                  object
sku                       object
taxful_total_price       float64
taxless_total_price      float64
total_quantity             int64
total_unique_products      int64
type                      object
user                      object
geoip                     object
event                     object
dtype: object


# retrieving the Data types for the index normally would require us to perform the following Elasticsearch query:
mapping = es.indices.get_mapping(index_name) 
# which by itself is an abstraction of the GET request for mapping retrieval
json(mapping)


{
  "kibana_sample_data_ecommerce": {
    "mappings": {
      "properties": {
        "category": {
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "currency": {
          "type": "keyword"
        },
        "customer_birth_date": {
          "type": "date"
        },
        "customer_first_name": {
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "customer_full_name": {
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "customer_gender": {
          "type": "keyword"
        },
        "customer_id": {
          "type": "keyword"
        },
        "customer_last_name": {
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "customer_phone": {
          "type": "keyword"
        },
        "day_of_week": {
          "type": "keyword"
        },
        "day_of_week_i": {
          "type": "integer"
        },
        "email": {
          "type": "keyword"
        },
        "event": {
          "properties": {
            "dataset": {
              "type": "keyword"
            }
          }
        },
        "geoip": {
          "properties": {
            "city_name": {
              "type": "keyword"
            },
            "continent_name": {
              "type": "keyword"
            },
            "country_iso_code": {
              "type": "keyword"
            },
            "location": {
              "type": "geo_point"
            },
            "region_name": {
              "type": "keyword"
            }
          }
        },
        "manufacturer": {
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "order_date": {
          "type": "date"
        },
        "order_id": {
          "type": "keyword"
        },
        "products": {
          "properties": {
            "_id": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "base_price": {
              "type": "half_float"
            },
            "base_unit_price": {
              "type": "half_float"
            },
            "category": {
              "fields": {
                "keyword": {
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "created_on": {
              "type": "date"
            },
            "discount_amount": {
              "type": "half_float"
            },
            "discount_percentage": {
              "type": "half_float"
            },
            "manufacturer": {
              "fields": {
                "keyword": {
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "min_price": {
              "type": "half_float"
            },
            "price": {
              "type": "half_float"
            },
            "product_id": {
              "type": "long"
            },
            "product_name": {
              "analyzer": "english",
              "fields": {
                "keyword": {
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "quantity": {
              "type": "integer"
            },
            "sku": {
              "type": "keyword"
            },
            "tax_amount": {
              "type": "half_float"
            },
            "taxful_price": {
              "type": "half_float"
            },
            "taxless_price": {
              "type": "half_float"
            },
            "unit_discount_amount": {
              "type": "half_float"
            }
          }
        },
        "sku": {
          "type": "keyword"
        },
        "taxful_total_price": {
          "type": "half_float"
        },
        "taxless_total_price": {
          "type": "half_float"
        },
        "total_quantity": {
          "type": "integer"
        },
        "total_unique_products": {
          "type": "integer"
        },
        "type": {
          "type": "keyword"
        },
        "user": {
          "type": "keyword"
        }
      }
    }
  }
}

With these abstractions in place, Eland allows us to use core Elasticsearch features that are not part of pandas (or at least are not as efficient), such as full-text search, Elasticsearch’s biggest use case.

# defining the full-text query we need: Retrieving records for either Elitelligence or Primemaster manufacturer
query = {
        "query_string" : {
            "fields" : ["manufacturer"],
            "query" : "Elitelligence OR Primemaster"
        }
    }
# using full-text search capabilities with Eland:
text_search_df = ed_ecommerce.es_query(query)

# visualizing price of products for each manufacturer using pandas column syntax:
text_search_df[['manufacturer','products.price']]

More possibilities for integrations

This article only touches the surface of the possibilities Eland opens for data scientists and other data professionals using Elasticsearch on day-to-day operations.

Especially in DevOps and AIOps contexts, where ML-based tools are not very mature yet, data professionals can benefit from Python’s existing machine learning ecosystem to analyze large amounts of Observability and Metrics data, which will be a topic for another article.

Eland is certainly a big step towards Elasticsearch, and I look forward to what future versions of the ELK stack will bring to the table.

This article was inspired from Mateus Picanco

Posted on August 9, 2020 by Yassine, LASRI