07 March 2016

This Week in Elasticsearch and Apache Lucene - 2016-03-07

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Top News

Wondering why queries don't always work? @gmoskovicz dives into the details of phrase-matching in #Elasticsearch: https://t.co/b5N96J246P
— elastic (@elastic) March 7, 2016

Elasticsearch Core

Changes in 2.x:

Debian's init script was not waiting for the pidfile.
GCE Discovery plugin was missing permissions and tests.
Index deletions missed by disconnected nodes will no longer be re-imported when the node rejoins.
Terms queries are now considered costly, which means they will be cached more eagerly.
A has_parent query on non-parent types no longer causes an NPE.
Update mapping should update the metadata for all affected types.
Speeded up shard allocator when using include/exclude shard allocation rules.
Fixed a bug with empty buckets in the stats aggregator.
Azure Storage client upgraded to 4.0.0.
Deprecation logs added for:
- Use of old script/template syntax
- Use of multicast-plugin
- Use of _source transform mapping
- Use of deprecated queries

Changes in master:

Upgrade to Lucene 6 snapshot.
Reindex API has landed, and now supports ingest pipelines.
The index stats API now supports the include_segment_file_sizes to report on how much disk space is used by each Lucene file.
Ingest nodes and available processors are reported by nodes info and by the _cat/nodes API, and the ingest_took time is available in bulk requests.
Ingest metadata now uses cluster state diffs for lighter weight updates.
Usage of guice has been reduced by removing DiscoveryService.
Cygwin is not tested and not supported, so a cygwin block in bin/elasticsearch has been removed.
Replacing string fields with text/keyword fields:
- Doc values no longer controlled by fielddata parameter
- String fields are deprecated in favour of text/keyword fields
The mapper attachment plugin has been deprecated in favour of the ingest attachment plugin.
Client nodes are no longer special, and will be connected to (and report stats) like all other nodes.
Bootstrap checks are now in their own class and are enforced if networking is configured. The check for file handles is lower on OS/X because of the difficulty of setting it and the low likelihood of using OS/X in production. Checks added for max processes and that mlockall was successful.
The _optimize end point has been removed in favour of _forcemerge.
Index-time field boosting is now applied as a query time boost, and payloads for per-field boosts in the _all field now use 1 byte instead of 4.
The shard writeLockTimeout is no longer required.

Ongoing:

Rewrite range queries to match_all/match_none where the range covers all or none of the docs in a shard for better result caching.
You shouldn't be able to delete or close an index while it is being restored.
Keyword fields should support limited analysis.
Removing node.client setting in favour of setting other node roles to false.
Add ingest stats to node stats API.

Apache Lucene

Both 6.x and 6.0.x branches are now cut, requiring fun changes to switch Lucene's master branch to a 7.x world, including the rare but exciting time when TestBackwardsCompatibilit<wbr>y has no indices to test!
Point values finally support earth surface distance queries, with a delightfully simple and accurate implementation, allowing for exact accuracy for testing (no fuzz!) besides quantization error. It also has performance on par with GeoPointsDistanceQuery, despite potent possible future optimizations if we can make the 2D geo math more accurate.
Merging 2D point values across segments is suddenly 21% faster
MultiPhraseQuery is now immutable
Clean up the overlapping methods in NumericUtils vs Leg<wbr>acyNumericUtils, and bring back lost test cases
Add missing getters for various queries
Optimize point range queries that match all documents, likely a common case in time-based indices
Point values now expose size and docCount statistics per field, for example letting us compute whether a point field is multi-valued, in addition to the existing per-dimension global min and max values
The spatial4j dependency for the spatial-extras module is now upgraded to version 0.6
The semantics of point values intersect API is now sharper: in the 1D case, all points are visited in order
The new (in 6.0) point queries get a simpler API
Duplicate code from NumericUtils is removed
Uwe tweaks TSTLookup to dodge an old javac compiler bug
The useful checkReader, used in many Lucene tests, was failing to check points
LatLonPoint API becomes simpler
Sometimes randomized tests are a bit too evil
RandomCodec now also randomizes the points format
The legacy spatial code, with optional external spatial4j dependency, has moved to a new spatial-extras module, but required some javadocs hacks since the same package name appears in two modules now
Improve randomized testing for the new point distance query
Don't try to estimate match count while collecting: it's inaccurate in multi-valued cases, and doesn't seem to help performance
The sometimes costly TermsQuery and point queries are now cached more aggressively
Make it easier to understand why your environment prevents Lucene's MMapDirectory unmap hack from working
Lucene now always sorts in unicode order, allowing us to consolidate and remove some of the the numerous BytesRef comparator APIs
The debate rages on about how to refactor the spatial3d module
Global ordinals query time join does not explain itself very well

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2016-03-07

Top News

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS