This Week in Elasticsearch and Apache Lucene - 2016-03-07
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Wondering why queries don't always work? @gmoskovicz dives into the details of phrase-matching in #Elasticsearch: https://t.co/b5N96J246P
— elastic (@elastic) March 7, 2016
Elasticsearch Core
Changes in 2.x:
- Debian's init script was not waiting for the pidfile.
- GCE Discovery plugin was missing permissions and tests.
- Index deletions missed by disconnected nodes will no longer be re-imported when the node rejoins.
- Terms queries are now considered costly, which means they will be cached more eagerly.
- A has_parent query on non-parent types no longer causes an NPE.
- Update mapping should update the metadata for all affected types.
- Speeded up shard allocator when using include/exclude shard allocation rules.
- Fixed a bug with empty buckets in the stats aggregator.
- Azure Storage client upgraded to 4.0.0.
- Deprecation logs added for:
- Use of old script/template syntax
- Use of multicast-plugin
- Use of _source transform mapping
- Use of deprecated queries
Changes in master:
- Upgrade to Lucene 6 snapshot.
- Reindex API has landed, and now supports ingest pipelines.
- The index stats API now supports the include_segment_file_sizes to report on how much disk space is used by each Lucene file.
- Ingest nodes and available processors are reported by nodes info and by the _cat/nodes API, and the ingest_took time is available in bulk requests.
- Ingest metadata now uses cluster state diffs for lighter weight updates.
- Usage of guice has been reduced by removing DiscoveryService.
- Cygwin is not tested and not supported, so a cygwin block in bin/elasticsearch has been removed.
- Replacing string fields with text/keyword fields:
- Doc values no longer controlled by fielddata parameter
- String fields are deprecated in favour of text/keyword fields
- The mapper attachment plugin has been deprecated in favour of the ingest attachment plugin.
- Client nodes are no longer special, and will be connected to (and report stats) like all other nodes.
- Bootstrap checks are now in their own class and are enforced if networking is configured. The check for file handles is lower on OS/X because of the difficulty of setting it and the low likelihood of using OS/X in production. Checks added for max processes and that mlockall was successful.
- The _optimize end point has been removed in favour of _forcemerge.
- Index-time field boosting is now applied as a query time boost, and payloads for per-field boosts in the _all field now use 1 byte instead of 4.
- The shard writeLockTimeout is no longer required.
Ongoing:
- Rewrite range queries to match_all/match_none where the range covers all or none of the docs in a shard for better result caching.
- You shouldn't be able to delete or close an index while it is being restored.
- Keyword fields should support limited analysis.
- Removing node.client setting in favour of setting other node roles to false.
- Add ingest stats to node stats API.
Apache Lucene
- Both 6.x and 6.0.x branches are now cut, requiring fun changes to switch Lucene's master branch to a 7.x world, including the rare but exciting time when
TestBackwardsCompatibilit<wbr>y
has no indices to test! - Point values finally support earth surface distance queries, with a delightfully simple and accurate implementation, allowing for exact accuracy for testing (no fuzz!) besides quantization error. It also has performance on par with
GeoPointsDistanceQuery,
despite potent possible future optimizations if we can make the 2D geo math more accurate. - Merging 2D point values across segments is suddenly 21% faster
MultiPhraseQuery
is now immutable- Clean up the overlapping methods in
NumericUtils
vsLeg<wbr>acyNumericUtils,
and bring back lost test cases - Add missing getters for various queries
- Optimize point range queries that match all documents, likely a common case in time-based indices
- Point values now expose
size
anddocCount
statistics per field, for example letting us compute whether a point field is multi-valued, in addition to the existing per-dimension global min and max values - The
spatial4j
dependency for thespatial-extras
module is now upgraded to version 0.6 - The semantics of point values intersect API is now sharper: in the 1D case, all points are visited in order
- The new (in 6.0) point queries get a simpler API
- Duplicate code from
NumericUtils
is removed - Uwe tweaks
TSTLookup
to dodge an old javac compiler bug - The useful
checkReader,
used in many Lucene tests, was failing to check points LatLonPoint
API becomes simpler- Sometimes randomized tests are a bit too evil
RandomCodec
now also randomizes the points format- The legacy spatial code, with optional external
spatial4j
dependency,has moved to a new spatial-extras
module, but required some javadocs hacks since the same package name appears in two modules now - Improve randomized testing for the new point distance query
- Don't try to estimate match count while collecting: it's inaccurate in multi-valued cases, and doesn't seem to help performance
- The sometimes costly
TermsQuery
and point queries are now cached more aggressively - Make it easier to understand why your environment prevents Lucene's
MMapDirectory
unmap hack from working - Lucene now always sorts in unicode order, allowing us to consolidate and remove some of the the numerous
BytesRef
comparator APIs - The debate rages on about how to refactor the spatial3d module
- Global ordinals query time join does not explain itself very well
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!