07 June 2019

This Week in Elasticsearch and Apache Lucene - 2019-06-07

Adrien Grand

•

•

•

•

•

•

Elasticsearch Highlights

Transport Client

We have removed of the transport client from the codebase, meaning elasticsearch 8.0.0 will not include the java TransportClient. The client jar support has been removed from the build; client jars were jar files that needed to be published so that a component (e.g. reindex, percolator, mustache) could be used with the transport client. We also has removed the transport client from our documentation.

Snapshot Restore UI

We started work on the delete snapshots functionality and opened a PR that fixes some i18n issues.

Better storage of _source

We revived an old issue about alternative ways to store our _source field in Lucene indices and wrote a quick proof of concept that stores each top-level field of the _source document in a separate stored field. The main benefit of this approach is that Lucene gives an identifier to each field and that we can identify fields by these numbers rather than having to repeat field names over and over again. This gave a 10-20% reduction of disk usage for stored fields on geonames depending on the codec. This is one of several ideas that are being explored these days and could help reduce disk usage, like dropping the _id field on documents that are below the global checkpoint for append-only use-cases, and possibly not index the @timestamp field and use index sorting + binary search on doc values to search on it.

Faster range query on sorted index

We are looking into optimizing range queries when the index is sorted (https://issues.apache.org/jira/browse/LUCENE-7714). We wrote a first prototype that shows an improvement over standard range queries. Another interesting aspect of this optimization is the fact that it enables searching on a numeric field without using the KD tree. So our users who take disk usage seriously could be interested in this feature regardless of whether this performs faster or slower than range queries on KD trees.

Encrypted Snapshots

As part of our ongoing plan to make it easier and safer to snapshot the security index we want to have a well supported option for storing encrypted snapshots. We currently support server-side encryption at rest (when the cloud provider can do so transparently) but we don’t offer anything client-side, and we don’t have a solution for file system based repositories.

We’re a good way through our thinking about how this would look, but we need to make sure it will be a good fit for our users needs, and that we can build something that will last.

Rollup GA

We opened an issue detailing what we need for Rollup GA. GA work has commenced with on one of the more straightforward items, allowing the DeleteJob API to also delete the rolled-up data. Today this requires manual user involvement and it is not user-friendly: a delete-by-query followed by updating the _meta mapping field. This will instead be handled by the DeleteJob API with an optional flag to also delete the data when the job is deleted. The rest of the Rollup GA items are of a similar nature, improvements to management ergonomics.

Snapshot Resiliency - it's all about hygiene

The snapshot resiliency effort currently focuses on the problem of left-over snapshot files not being cleaned up, resulting in large amounts of left-over files for repositories that have actively been used over longer periods of time. Left-over files can be caused by both snapshot deletions that failed mid-way through as well as snapshots that did not successfully complete. There can be plenty of reasons for these operations to fail, from those that are under Elasticsearch's control, such as coordinating repository access among the nodes in the cluster, and those that are outside of ES’s control, e.g. machines crashing or connectivity issues with the service where snapshots are stored.

Many of the resiliency-related issues that were under Elasticsearch's control have been fixed in recent versions (6.4+). ES will also have to do clean-ups for cases where things went wrong outside of ES's control, as well as do clean-ups for historical ES versions that might have accumulated a lot of left-over files. We are focusing on adding clean-up functionality of left-over files that will be done automatically by future ES versions during regular snapshot operations, and working on a tool that can be periodically run by our Cloud team to clean up repositories that are actively being snapshotted to by older ES versions without the automatic clean-up support.

Last week's snapshot resiliency sync focused on the performance characteristics of the auto-clean-up functionality that will be coming in newer ES versions, given that existing repositories might have accumulated a large number of left-over files. We did some preliminary benchmarking of the functionality and it appeared that the main issue would be with Azure since the Azure blob store does not allow for bulk deleting multiple blobs in a single request. Given that Azure does not provide a native API for bulk deletes (unlike S3 and GCS), we decided to approach the problem by adding a thread-pool to the Azure plugin to parallelize delete operations on Azure, which we implemented here.

We also added support for listing subdirectories on all our repository implementations, which is a prerequisite for efficiently finding left-over index folders. While blob stores, in contrast to file systems, generally do not provide a hierarchical view of files, most implementations (S3, GCP, Azure) provide virtual directory support by allowing certain operations to treat files that share common prefixes with configurable delimiters as belonging to the same virtual directory. We have also beefed up the testing for this, which we need to do against the actual Cloud service in order to be confident that these operations are implemented correctly.

We prepared a document that outlines the options for implementing a tool for removing left-over snapshot files on Cloud. In contrast to the automated clean-up that is being implemented in Elasticsearch and that will be running as part of the regular snapshot lifecycle, the Cloud clean-up tool can make no assumption about concurrently running snapshots, as that would interfere with existing snapshot operations on Cloud and require complex coordination between the snapshot process triggered by clusters on Cloud and the clean-up tool that will be run on different infrastructure. This means however that this tool needs to take a different algorithmic approach for detecting left-over files, as it might otherwise not correctly distinguish between files that are left-over and those that are in the process of being snapshotted by a concurrently running snapshot process. We've explored two options. The first one relies on file timestamps to be somewhat accurate, which is an option when limiting ourselves to the three Cloud services (S3, GCS, Azure). A second option requires a multi-pass approach, where the tool will require a second run after a certain period of time has passed, having the tool account for the time passed between the runs. We are looking at the operational impact of both approaches, the testing aspect as well as how to package the clean-up tool so that it can be easily run on Cloud infrastructure.

Apache Lucene

Change of score for fuzzy queries

Fuzzy queries are currently rewritten as a disjunction, which has the downside that documents that contain multiple terms from the rewritten query will score better than documents that contain multiple occurrences of one term. For instance foobar~1 could be rewritten as fobar OR foobaz, and a document that has one occurrence of fobar and another one of foobaz will likely get a better score than a document that has 2 occurrences of fobar.We argue that a better way to score fuzzy queries would be to use a SynonymQuery, which would sum up term frequency across all matched terms in order to compute the score.

Other

We iterated on two-phase support for disjunctions. This change would help disjunctions run faster when they include phrase queries for instance.
We are discussing how query visitors can be used in order to count the number of clauses of a query in order to prevent running queries that have too many clauses.
We are discussing a better way to enable preloading on MMapDirectory. Currently enabling preloading on a subset of the index files requires using a FileSwitchDirectory. Maybe we could make this easier to use, and also automate based on the I/O context so that users would not have to be aware of the extensions that are used by codec formats.
We are fixing FileSwitchDirectory so that wrapping two directories that have the same content (but behave differently, eg. mmap vs. nio) doesn't cause issues with pending deletes.
Should the new collector-based rescorer expose a way to parallelize rescoring across multiple threads?
We found a bug in the tessellator, which is due to support of Steiner points, a feature that we don't use.
We are looking into a different way to expose the ability to find intervals that don't overlap since the current implementation proved buggy.
There is potential to seek the terms dictionary of doc values more efficiently if two consecutive seeks would go to the same block, would it be useful to some of our workloads?
We made tessellation more robust.
We fixed LatLonShapeBoundingBoxQuery#hashCode is computed. It would include the system hashcode, which would in-turn make almost every object have a different hashcode.
A change allows sorting by a FeatureField was merged, so he followed up with a new change that allows creating a DoubleValuesSource over feature fields. A benefit to Elasticsearch users is that they would be able to sort and aggregate by rank_feature(s) fields without having to introduce a float sub field.

Changes in Elasticsearch

Changes in 8.0:

Remove the transport client #42538
Skip shadow jar logic for javadoc and sources jars #42904
BREAKING: Removes type from TermVectors APIs #42198
Make high level rest client a fat jar #42771
BREAKING: RollupStart endpoint should return OK if job already started #41502

Changes in 7.3:

Reindex max_docs parameter name #41894
Deprecation info for joda-java migration on 7.x #42659
Add a merge policy that prunes ID postings for soft-deleted but retained documents #40741
Omit JDK sources archive from bundled JDK #42821
Add custom metadata to snapshots #41281
Fix Infinite Loops in ExceptionsHelper#unwrap #42716
Add Ability to List Child Containers to BlobContainer #42653
Enable console audit logs for docker #42671
Enable Parallel Deletes in Azure Repository #42783
Deduplicate alias and concrete fields in query field expansion #42328
Permit API Keys on Basic License #42787
Replicate aliases in cross-cluster replication #41815
Eclipse libs projects setup fix #42852
Remove unnecessary usage of Gradle dependency substitution rules #42773
Fix error with test conventions on tasks that require Docker #42719
Remove "template" field in IndexTemplateMetaData #42099
Read the default pipeline for bulk upsert through an alias #41963

Changes in 7.2:

Skip installation of pre-bundled integ-test modules #42900
Fix NPE when rejecting bulk updates #42923
Use reader attributes to control term dict memory useage #42838
Avoid clobbering shared testcluster JAR files when installing modules #42879
NullPointerException when creating a watch with Jira action (#41922) #42081
Don't require TLS for single node clusters #42826

Changes in 6.8:

Fix concurrent search and index delete #42621
Wire query cache into sorting nested-filter computation #42906
Enable testing against JDK 13 EA builds #40829
Fixes a bug in AnalyzeRequest.toXContent() #42795

Changes in Elasticsearch Management UI

Changes in 7.3:

Add repository-azure autocompletion settings #37935

Changes in Rally

Changes in 1.2.0:

Add Rally Docker image to release process #702
Add download subcommand #704
Provide default for datastore.secure in all cases #705

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

This Week in Elasticsearch and Apache Lucene - 2019-06-07

Elasticsearch Highlights

Transport Client

Snapshot Restore UI

Better storage of _source

Faster range query on sorted index

Encrypted Snapshots

Rollup GA

Snapshot Resiliency - it's all about hygiene

Apache Lucene

Change of score for fuzzy queries

Other

Changes in Elasticsearch

Changes in Elasticsearch Management UI

Changes in Rally

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS