This Week in Elasticsearch and Apache Lucene - 2016-03-14
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
How does @Zymergen build, test, & analyze DNA mods to microbes at scale? https://t.co/7OUmnHavz8 #Elasticsearch pic.twitter.com/dfPZCPmPwc
— elastic (@elastic) March 8, 2016
Elasticsearch Core
Changes in 2.x:
- The Tribe node now passes an explicit whitelist of settings through to the client nodes which connect to each cluster. Later, plugins will have an extension point for adding plugin-specific settings to the whitelist.
- Any deprecated parameters parsed by ParseFieldMatcher now get deprecation logging for free.
- Trying to close or delete an index while it is being restored will now fail the close/delete request.
- The `lat_lon` and `precision_step` parameters to `geo_point` fields are deprecated as they are no longer configurable with the new geo-point format. The`validate` and `normalize` parameters now have deprecation logging.
- The geo distance and geo range distance queries no longer support the `.geohash` suffix as it is not needed and makes the query ambiguous.
- The `has_child` query now respects the configured similarity.
- Multi-index expressions starting with `*` were ignoring exclude expressions.
Changes in master:
- Index lookups now use the index UUID instead of by name, and index names are resolved to UUID as early as possible.
- `string` fields will be replaced by `text` and `keyword` fields in 5.0, with the following bwc layer:
- String mappings in old indices will not be upgraded.
- Text/Keyword mappings can be added to old and new indices.
- String mappings on new indices will be upgraded automatically to text/keyword mappings, if possible, with deprecation logging.
- If it is not possible to automatically upgrade, an exception will be thrown.
- Norms can no longer be lazy loaded. This is no longer needed as they are no longer loaded into memory. The `norms` setting now take a boolean. Index time boosts are no longer stored as norms.
- Command line settings can no longer use the -- style. Instead, they should be specified with a `-E` prefix.
- Trying to close or delete an index while it is being snapshotted will now fail the close/delete request.
- Scripting engines no longer try to compile hidden files in the script directory.
- The `-XX+AlwaysPreTouch` flag means all memory pages are now committed to memory at startup.
- The deprecated `ignore_unmapped` parameter has been removed from sorting.
- Queries deprecated in 2.0 have now been removed.
- The `multi_field` field datatype, deprecated in 1.0, has been removed.
- The generic thread pool is now bound to 4x the number of processors.
- The `collect_payloads` parameter to `span_near` is deprecated. Payloads are now loaded when needed.
- The cat-recovery API now supports the raw values `bytes_recovered` and `files_recovered`, and the `translog` and `translog_ops` columns have been renamed to be more explicit.
Ongoing changes:
- Dynamic field addition now happens at the end of doc parsing, in preparation for supporting dots in field names.
- The search refactoring is nearing its end with only suggesters, sort, and inner hits outstanding.
- The percolator API will be deprecated in favour of a percolator query, which will deliver a number of requested features to the percolator.
- Once "primary terms" have been added to master, we will be able to enable the acked indexing test.
- The reindex API will support throttling.
- Index data folders will be named according to the index UUID, rather than the index name.
- Storing the cluster UUID in index metadata will allow Elasticsearch to no longer import dangling indices which were deleted while a node was disconnected from the cluster.
Apache Lucene
- The new dimensional points feature for 6.0.0 is getting intense pre-release scrutiny, which is uncovering number of fun bugs and API usability issues, causing us to delay the first 6.0.0 release candidate:
PointRangeQuery
'sequals
method was broken, returning false when queries were in fact the same - Sparse points fields were not always handled correctly on merge; a dedicated sparse points test should help uncover any other sparse points issues
- The copy constructor for
FieldType
completely ignored points! - A newly added
Test2BPointValues
,
to ensure you can index more than 2.1 billion points in a single segment, uncovered an int overflow bug after running for 22 hours - The default codec's points implementation was missing some
checkIntegrity
calls - The legacy
SlowCompositeReaderWrap<wbr>per,
an awful class that inefficiently tries to pretend you have only one segment in your index, does not support points, and is now moved out of Lucene's core - The
SimpleText
codec falsely failed itsCheckIndex
if a points field has zero points - The
MIGRATE.txt
andCHANGES.<wbr>txt
descriptions of dimensional points is better now - The
newSetQuery
API now also conveniently accepts aCollection
of boxed values in addition to existing varargs of each primitive type CheckIndex
forgot to tell you it was in fact checking points- Dead code is being removed
- Cutover existing users from legacy numeric fields to the new dimensional points:
- The legacy uninverting
FieldCache
can now un-invert single-valued points fields - Both the flexible and XML query parsers now support points
UninvertingReader
still needs to support multi-valued points- The
join
module still needs to support points - The
spatial-extras
module still needs to switch to points MemoryIndex
does not yet support points
- The legacy uninverting
- Lucene's default codec will now also use prefix compression on fixed-width doc values data (e.g. derived from
InetAddress
orBigInteger
) - A new
LatLonPoint.nearest
methodfinds the nearest indexed point to a query point, something KD trees excel at, but the latest patch is still vulnerable to adversaries - Spatial3d now exposes only the WGS84 planet model which is the most accurate one it supports
OfflineSorter
will be faster in 6.1.0, by reducing unnecessary byte copying- Group search hits by hamming distance
- All queries should be immutable since they can be enrolled as a cache key
ant precommit
now fails on code comparing already identical values, and also on useless assignments- Join
TopDocs
by docs while keeping the result ranks - A rare test bug, which only happens when we randomly generate exactly the same bytes already in an index file, is fixed
NRTCachingDirectory
optionally logs tostdout
- Codec level encryption remains controversial
- Add doc values support to
MemoryIndex
- 800+ new top-level-domains have been created since we last fixed
StandardTokenizer
to detect them! - The nightly smoke tester was confused by newly old back compat indices
- This test was too evil, taking more than 2 hours to run with just the right seed
PointRangeQuery
now optimizes the likely common case when all documents will match- A few missing
s's
touched a lot of source files - Split out the geo3d math-only APIs under a
geom
sub-package - A troublesome facets test has been removed since it tested floats (and struggled with 1 ulp differences) when in fact facets only supports doubles
MemoryIndex
now also acceptsIterable
overIndexDoc<wbr>ument
instead of a document- Most
FilterX
classes are now abstract
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!