From ES|QL to native Pandas dataframes in Python

Since Elasticsearch 8.15 or with Elasticsearch Serverless, ES|QL responses support the Apache Arrow streaming format. This blog post will show you how to take advantage of it in Python. In an earlier blog post, I demonstrated how to convert ES|QL queries to Pandas dataframes using CSV as an intermediate representation. Unfortunately, CSV requires explicit type declarations, is slow (especially for larger datasets) and does not handle nested arrays and objects. Apache Arrow lifts all these limitations.

ES|QL to Pandas dataframes in Python

Importing test data

First, let's import some test data. As before, we will be using the employees sample data and mappings. The easiest way to load this dataset is to run these two Elasticsearch API requests in the Kibana Console.

Converting dataset to a Pandas DataFrame object

OK, with that out of the way, let's convert the full employees dataset to a Pandas DataFrame object using the ES|QL Arrow export:

Even though this dataset only contains 100 records, we use a LIMIT command to avoid ES|QL warning us about potentially missing records. This prints the following dataframe:

OK, so what actually happened here?

Given format="arrow", Elasticsearch returns binary Arrow streaming data
The Elasticsearch Python client looks at the Content-Type header and creates a PyArrow object
Finally, PyArrow's Pandas integration converts the PyArrow object to a Pandas dataframe.

Note that the types_mapper=pd.ArrowDtype parameter asks Pandas to use a PyArrow backend instead of a NumPy backend, since the source data is PyArrow. While this backend is not enabled by default for compatibility reasons, it has many advantages: it handles missing values, is faster, more interopable and supports more types. (This is not a zero copy conversion, however.)

For this example to work, the Pandas and PyArrow optional dependencies need to be installed. If you want to use another dataframe library such as Polars instead, you don't need Pandas and can directly use polars.from_arrow to create a Polars DataFrame from the PyArrow table returned by the Elasticsearch client.

One limitation is that Elasticsearch does not currently handle multi-valued fields, which is why we had to drop the is_rehired, job_positions and salary_change columns. This limitation will be lifted in a future version of Elasticsearch.

Anyway, you now have a Pandas dataframe that you can use to analyze your data further. But you can also continue massaging the data using ES|QL, which is particularly useful when queries return more than 10,000 rows, the current maximum number of rows that ES|QL queries can return.

More complex queries

In the next example, we're counting how many employees are speaking a given language by using STATS ... BY (not unlike GROUP BY in SQL). And then we sort the result with the languages column using SORT:

Unlike with CSV, we did not have to specify any types, as Arrow data already includes types. Here's the result:

21 employees speak 5 languages, wow! And 10 employees did not declare any spoken language. The missing value is denoted by <NA>, which is consistently used for missing data with the PyArrow backend. If we had used the NumPy backend instead, this column would have been converted to floats and the missing value would have been a confusing NaN, as NumPy integers don't have any sentinel value for missing data.

Queries with parameters

Finally, suppose that you want to expand the query from the previous section to only consider employees that speak N or more languages, with N being a variable parameter. For this we can use ES|QL's built-in support for parameters, which eliminates the risk of an injection attack associated with manually assembling queries with variable parts:

which prints the following:

Conclusion

As we saw, ES|QL's native Arrow support makes working with Pandas and other DataFrame libraries even nicer than using CSV and it will continue to improve over time, with the multi-value support coming in a future version of Elasticsearch.

Additional resources

If you want to learn more about ES|QL, the ES|QL documentation is the best place to start. You can also check out this other Python example using Boston Celtics data. To know more about the Python Elasticsearch client itself, you can refer to the documentation, ask a question on Discuss with the language-clients tag or open a new issue if you found a bug or have a feature request. Thank you!

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Report an issue