This Week in Elasticsearch and Apache Lucene - Core Changes
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
What would life be like without #Elasticsearch? #Elasticon attendees answer: https://t.co/46K48dul4Y pic.twitter.com/3dDINVzDec
— elastic (@elastic) February 19, 2016
Elasticsearch Core
Changes in 2.2:
- Snapshot/restore now verifies that the index being restored is compatible with the version of the node doing the restore.
- The bulk API no longer broadcasts deletes to all shards, and will fail if custom routing is enabled and no routing value is specified.
Changes in 2.3:
- Nodes will only accept transport requests once they are fully initialized.
- Groovy accepted our pull request which means that the suppressAccessChecks permission is no longer required.
Changes in master:
- Document IDs now have a hard limit of 512 bytes
- The HTTP address and port is now available in cat-nodes and cat-nodeattrs
- The Painless scripting language is now a module, which means that it will ship by default.
- Log4J is now the only supported logger wrapper and may yet be removed in favour of java.util.logging.
- Using a custom network.host setting as a proxy for "production cluster" allows us to upgrade soft warnings (in dev mode) to hard exceptions. This change has proved controversial as configuring max open file handles on OSX is overly complex.
- Elasticsearch now checks on startup that all data paths are writable.
- G1GC on early versions of HotSpot v25 are buggy.
- Some hot methods have been refactored so that they can be inlined.
- Various unused/unneeded settings have been removed: es.max-open-files, es.netty.gathering, es.useLinkedTransferQueue, line.separator, action.search.optimize_single_shard
Ongoing changes:
- Tasks now have timestamps to see how long they have been running.
- Task IDs are now represented as single strings instead of tuples of node ID and task ID.
- Index names will no longer be tied to the name of the index folder on disk.
- Dangling indices will no longer be imported if the cluster UUID of the index is the same as the current cluster UUID (which indicates that the index was deleted while a node was incommunicado).
- The segments API will be able to return the disk use by Lucene file-type.
- Work continues on trying to allow dots in field names.
Apache Lucene
- Lucene 5.5.0 was officially released on February 22nd, but whether a 5.6.0 release will happen even after we switch to 6.x stable releases has proven to be strangely contentious
- Lucene 6.0.0 release process will begin early this week, with cutting the 6.x branch
- The
CheckIndex
tool would sometimes hit an exception-during-exception (an exception when trying to throw another exception due to index corruption), becauseBytesRefBuilder.<wbr>toString
is not allowed - Lots of scrutiny and many improvements to the new points queries in preparation for the 6.0.0 release:
- support for
BigInteger
andInetAddress
(v4 and v6!) - better validation of the incoming arguments
- a new
PointInSetQuery,
matching any documents that have any of the values in the set of points - javadocs improvements
- improving geo3d apis
- removing sandbox's
PointInRectQuery
in favor of the faster corePointRangeQuery
- moving all encode/decode methods onto the
XXXPoint
classes - a cleaner API, where the
XXXPoint
classes have static factory methods to generate their matching queries, and additional API improvements
- support for
- Even more verbosity for a non-reproducible test failure that only fails on OS X, rarely
- Another fix in the long tail of our switch from Subversion to git
- The silly things we must do to silence our overly naggy java compiler
- More improvements to
MMapDirectory
in preparation for Java 9, but we continue to uncover new Java 9 bugs like this serious bug in method handles though progress is being made towards a fix - The Java 9 bug Lucene's tests uncovered last week has been resolved as a duplicate of another (already fixed but not yet released) bug
- Creating a
hashCode
that does not accidentally cause high collision rate is not easy! - 800+ new top-level-domains have been created since we last fixed
StandardTokenizer
to detect them! - Heavy delete-by-query use in Lucene is costly
- Lucene's range faceting can't yet handle multi-valued fields (patches welcome!)
- The legacy spatial code will move to a new
spatial-extras
module soon
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!