This Week in Elasticsearch and Apache Lucene - 2019-06-07
Elasticsearch Highlights
Transport Client
We have removed of the transport client from the codebase, meaning elasticsearch 8.0.0 will not include the java TransportClient. The client jar support has been removed from the build; client jars were jar files that needed to be published so that a component (e.g. reindex, percolator, mustache) could be used with the transport client. We also has removed the transport client from our documentation.
Snapshot Restore UI
We started work on the delete snapshots functionality and opened a PR that fixes some i18n issues.
Better storage of _source
We revived an old issue about alternative ways to store our _source
field in Lucene indices and wrote a quick proof of concept that stores each top-level field of the _source
document in a separate stored field. The main benefit of this approach is that Lucene gives an identifier to each field and that we can identify fields by these numbers rather than having to repeat field names over and over again. This gave a 10-20% reduction of disk usage for stored fields on geonames depending on the codec. This is one of several ideas that are being explored these days and could help reduce disk usage, like dropping the _id field on documents that are below the global checkpoint for append-only use-cases, and possibly not index the @timestamp field and use index sorting + binary search on doc values to search on it.
Faster range query on sorted index
We are looking into optimizing range queries when the index is sorted (https://issues.apache.org/jira/browse/LUCENE-7714). We wrote a first prototype that shows an improvement over standard range queries. Another interesting aspect of this optimization is the fact that it enables searching on a numeric field without using the KD tree. So our users who take disk usage seriously could be interested in this feature regardless of whether this performs faster or slower than range queries on KD trees.
Encrypted Snapshots
As part of our ongoing plan to make it easier and safer to snapshot the security index we want to have a well supported option for storing encrypted snapshots. We currently support server-side encryption at rest (when the cloud provider can do so transparently) but we don’t offer anything client-side, and we don’t have a solution for file system based repositories.
We’re a good way through our thinking about how this would look, but we need to make sure it will be a good fit for our users needs, and that we can build something that will last.
Rollup GA
We opened an issue detailing what we need for Rollup GA. GA work has commenced with on one of the more straightforward items, allowing the DeleteJob API to also delete the rolled-up data. Today this requires manual user involvement and it is not user-friendly: a delete-by-query followed by updating the _meta
mapping field. This will instead be handled by the DeleteJob API with an optional flag to also delete the data when the job is deleted. The rest of the Rollup GA items are of a similar nature, improvements to management ergonomics.
Snapshot Resiliency - it's all about hygiene
The snapshot resiliency effort currently focuses on the problem of left-over snapshot files not being cleaned up, resulting in large amounts of left-over files for repositories that have actively been used over longer periods of time. Left-over files can be caused by both snapshot deletions that failed mid-way through as well as snapshots that did not successfully complete. There can be plenty of reasons for these operations to fail, from those that are under Elasticsearch's control, such as coordinating repository access among the nodes in the cluster, and those that are outside of ES’s control, e.g. machines crashing or connectivity issues with the service where snapshots are stored.
Many of the resiliency-related issues that were under Elasticsearch's control have been fixed in recent versions (6.4+). ES will also have to do clean-ups for cases where things went wrong outside of ES's control, as well as do clean-ups for historical ES versions that might have accumulated a lot of left-over files. We are focusing on adding clean-up functionality of left-over files that will be done automatically by future ES versions during regular snapshot operations, and working on a tool that can be periodically run by our Cloud team to clean up repositories that are actively being snapshotted to by older ES versions without the automatic clean-up support.
Last week's snapshot resiliency sync focused on the performance characteristics of the auto-clean-up functionality that will be coming in newer ES versions, given that existing repositories might have accumulated a large number of left-over files. We did some preliminary benchmarking of the functionality and it appeared that the main issue would be with Azure since the Azure blob store does not allow for bulk deleting multiple blobs in a single request. Given that Azure does not provide a native API for bulk deletes (unlike S3 and GCS), we decided to approach the problem by adding a thread-pool to the Azure plugin to parallelize delete operations on Azure, which we implemented here.
We also added support for listing subdirectories on all our repository implementations, which is a prerequisite for efficiently finding left-over index folders. While blob stores, in contrast to file systems, generally do not provide a hierarchical view of files, most implementations (S3, GCP, Azure) provide virtual directory support by allowing certain operations to treat files that share common prefixes with configurable delimiters as belonging to the same virtual directory. We have also beefed up the testing for this, which we need to do against the actual Cloud service in order to be confident that these operations are implemented correctly.
We prepared a document that outlines the options for implementing a tool for removing left-over snapshot files on Cloud. In contrast to the automated clean-up that is being implemented in Elasticsearch and that will be running as part of the regular snapshot lifecycle, the Cloud clean-up tool can make no assumption about concurrently running snapshots, as that would interfere with existing snapshot operations on Cloud and require complex coordination between the snapshot process triggered by clusters on Cloud and the clean-up tool that will be run on different infrastructure. This means however that this tool needs to take a different algorithmic approach for detecting left-over files, as it might otherwise not correctly distinguish between files that are left-over and those that are in the process of being snapshotted by a concurrently running snapshot process. We've explored two options. The first one relies on file timestamps to be somewhat accurate, which is an option when limiting ourselves to the three Cloud services (S3, GCS, Azure). A second option requires a multi-pass approach, where the tool will require a second run after a certain period of time has passed, having the tool account for the time passed between the runs. We are looking at the operational impact of both approaches, the testing aspect as well as how to package the clean-up tool so that it can be easily run on Cloud infrastructure.
Apache Lucene
Change of score for fuzzy queries
Fuzzy queries are currently rewritten as a disjunction, which has the downside that documents that contain multiple terms from the rewritten query will score better than documents that contain multiple occurrences of one term. For instance foobar~1
could be rewritten as fobar OR foobaz
, and a document that has one occurrence of fobar
and another one of foobaz
will likely get a better score than a document that has 2 occurrences of fobar
.We argue that a better way to score fuzzy queries would be to use a SynonymQuery, which would sum up term frequency across all matched terms in order to compute the score.
Other
- We iterated on two-phase support for disjunctions. This change would help disjunctions run faster when they include phrase queries for instance.
- We are discussing how query visitors can be used in order to count the number of clauses of a query in order to prevent running queries that have too many clauses.
- We are discussing a better way to enable preloading on MMapDirectory. Currently enabling preloading on a subset of the index files requires using a FileSwitchDirectory. Maybe we could make this easier to use, and also automate based on the I/O context so that users would not have to be aware of the extensions that are used by codec formats.
- We are fixing FileSwitchDirectory so that wrapping two directories that have the same content (but behave differently, eg. mmap vs. nio) doesn't cause issues with pending deletes.
- Should the new collector-based rescorer expose a way to parallelize rescoring across multiple threads?
- We found a bug in the tessellator, which is due to support of Steiner points, a feature that we don't use.
- We are looking into a different way to expose the ability to find intervals that don't overlap since the current implementation proved buggy.
- There is potential to seek the terms dictionary of doc values more efficiently if two consecutive seeks would go to the same block, would it be useful to some of our workloads?
- We made tessellation more robust.
- We fixed LatLonShapeBoundingBoxQuery#hashCode is computed. It would include the system hashcode, which would in-turn make almost every object have a different hashcode.
- A change allows sorting by a FeatureField was merged, so he followed up with a new change that allows creating a DoubleValuesSource over feature fields. A benefit to Elasticsearch users is that they would be able to sort and aggregate by
rank_feature(s)
fields without having to introduce afloat
sub field.
Changes in Elasticsearch
Changes in 8.0:
- Remove the transport client #42538
- Skip shadow jar logic for javadoc and sources jars #42904
- BREAKING: Removes type from TermVectors APIs #42198
- Make high level rest client a fat jar #42771
- BREAKING: RollupStart endpoint should return OK if job already started #41502
Changes in 7.3:
- Reindex max_docs parameter name #41894
- Deprecation info for joda-java migration on 7.x #42659
- Add a merge policy that prunes ID postings for soft-deleted but retained documents #40741
- Omit JDK sources archive from bundled JDK #42821
- Add custom metadata to snapshots #41281
- Fix Infinite Loops in ExceptionsHelper#unwrap #42716
- Add Ability to List Child Containers to BlobContainer #42653
- Enable console audit logs for docker #42671
- Enable Parallel Deletes in Azure Repository #42783
- Deduplicate alias and concrete fields in query field expansion #42328
- Permit API Keys on Basic License #42787
- Replicate aliases in cross-cluster replication #41815
- Eclipse libs projects setup fix #42852
- Remove unnecessary usage of Gradle dependency substitution rules #42773
- Fix error with test conventions on tasks that require Docker #42719
- Remove "template" field in IndexTemplateMetaData #42099
- Read the default pipeline for bulk upsert through an alias #41963
Changes in 7.2:
- Skip installation of pre-bundled integ-test modules #42900
- Fix NPE when rejecting bulk updates #42923
- Use reader attributes to control term dict memory useage #42838
- Avoid clobbering shared testcluster JAR files when installing modules #42879
- NullPointerException when creating a watch with Jira action (#41922) #42081
- Don't require TLS for single node clusters #42826
Changes in 6.8:
- Fix concurrent search and index delete #42621
- Wire query cache into sorting nested-filter computation #42906
- Enable testing against JDK 13 EA builds #40829
- Fixes a bug in AnalyzeRequest.toXContent() #42795
Changes in Elasticsearch Management UI
Changes in 7.3:
- Add repository-azure autocompletion settings #37935
Changes in Rally
Changes in 1.2.0: