Elasticsearch for Apache Hadoop 2.2.0 and 2.1.3 released
The release bonanza continues…
Today, we announce new versions of the entire Elastic Stack, including a tighter integration of Shield with Kibana and an updated version of ES-Hadoop. Detailed blogs for product releases are available in the releases category of the blog. And yes, the blog has categories -- you know, for searchability.
I am pleased to announce that ES-Hadoop is joining the release bonanza through the GA release of Elasticsearch for Apache Hadoop (ES-Hadoop) 2.2.0 and the bug fix release of ES-Hadoop 2.1.3.
As always, the artifacts are available at the downloads page and or Maven.
Highlights in ES-Hadoop 2.2
Bug-fixes aside, ES-Hadoop 2.2 introduces a series of new features:
GA release compatible with ES 2.x
ES-Hadoop 2.2 is officially compatible with Elasticsearch 2.x while maintaining backwards compatibility with Elasticsearch 1.X (though we really recommend upgrading). ES-Hadoop automatically detects the target Elasticsearch version and act accordingly without any user intervention.
Overhauled geo support
Similar to Elasticsearch, geo
support has been overhauled in ES-Hadoop 2.2 - not only geo_point
and geo_shape
types are properly detected, but also their schema is inferred (despite being over a dozen data formats across both types).
Network improvements
ES-Hadoop 2.2 introduces support for wan/cloud/gated Elasticsearch environments where access is done only through one central point. This extends the number of topologies that ES-Hadoop works with, along side client-node only and direct connection. The latter scenario has also been optimized by specifically routing traffic only to data nodes and filtering out master nodes. The configuration options have been improved to allow configuration of the JVM HTTPS proxy along with resolving of hostnames to IPs (useful when using Elasticsearch with network publishing enabled).
Better runtime diagnostics
To prevent user error and misconfigurations, ES-Hadoop 2.2 introduced classpath checks to make sure only one version is used at a given time; this alleviates scenarios where different versions of the project are deployed leading to an unsupported scenario. Further more, incorrect usage of libraries (such as saving a DataFrame
without the Spark SQL support) are also reported.
Apache Spark 1.5 and 1.6 support
ES-Hadoop 2.2 tracked the releases of all its libraries, in particular those of Apache Spark, in both cases leveraging the new features added such as null-safe equality or the simplified "es" DataSourcedeclaration, both available in Spark 1.5 or eliminating double filtering (Spark 1.6). All while still maintainingbackwards compatibility with the previous versions of Spark.
Such features provide not just richer constructs for the user but also improve performance by pushing down to Elasticsearch more and more of Spark SQL.
Extended configuration options
The support for multi-dimensional fields (arrays) has been enhanced as one can now specify upfront the dimensions for a given field (whether nested or not), quite useful in strictly typed environments (like Spark SQL) especially when the data does not conform exactly to its declaration. Additionally, options to include or exclude certain fields as long as the number of documents being read were added.
YARN enhancements
A batch of updates were done to the YARN module by upgrading to Elasticsearch 2.2.x and introducing the option for the JVM system properties to be passed directly to the children container.
Repository HDFS is moving soon
The HDFS snapshot and restore plugin (repository HDFS) has been ported to Elasticsearch master and is undergoing a significant overhaul in terms of security. Shout out to Robert for his support in making this happen. It has been quite an effort considering Hadoop is not compatible with the Java Security Manager, simply asking a plethora of permissions with many of them way too dangerous (such as execute on all permissions during a basic startup). The current plan is for the plugin to be officially part of Elasticsearch proper as an official plugin in an upcoming release. Until that happens, it is still available as part of the ES-Hadoop project.
More about it, in a future blog post.
Improved reliability
While not something tangible to the user, behind the scenes ES-Hadoop 2.2 has increased its test suites by 50% (!) closing to over 4900 tests. The plan is for the next major release to pass over the 5K threshold.
Last 2.1.X release
Along side 2.2, ES-Hadoop 2.1.3 is released as the last planned maintenance release in the 2.1.X line. It contains a series of backported bug-fixes for those with conservatory upgrade paths. However even if you are on ES 1.x, upgrading to ES-Hadoop 2.2 is highly recommended.
Feedback
Looking forward to hearing your feedback on ES-Hadoop 2.2! You can find us on GitHub, Twitter (@elastic) or the forums. IRC works too.