Interval queries: why they are true positional queries, and how to transition from Span

Span queries have long been a tool for ordered and proximity search. These are especially useful for specific domains, such as legal or patent search. But the relatively new Interval queries actually fit this job much better. Unlike Span queries, Interval queries are true positional queries that score documents only based on positional proximity (expanded upon below).

Starting from Elasticsearch v8.16, we have brought Interval queries into parity with Span queries. Specifically:

Interval queries now support "range" and "regexp" rules.
Interval rules based on multiple terms similar to Span queries can expand up to indices.query.bool.max_clause_count terms instead of previous 128 value.

Our future plan is to deprecate Span queries in favor of Intervals queries, which cover the same functional capability but do so in a more user-friendly way.

Advantages of Interval queries over Span queries

Interval queries rank documents based on the order and proximity of matching terms. Some advantages of Interval queries:

True positional queries
Grounded in academic research, based on the minimal interval semantics paper with proven algorithms that scale linearly with the number of positions
Simpler syntax
Slightly faster (no need of score calculations based on corpus statistics)
Ability to use scripts for specialized use cases

Interval queries are true positional queries and only consider positional information while scoring documents (scores are inversely proportional to interval's length). This is unlike Span queries that also consider standard metrics like TF-IDF. Below is an example that illustrates how interval queries can do better ranking.

We want to find documents where the term "she" is near the term "sells". The desired ranking would return the 1st document followed by the 2nd document, as these terms occur closer to each other in the 1st document than in the second document.

But if we run a Span query, we will get a different ranking: [doc2, doc1], because Span queries in addition to proximity calculations also incorporate corpus stats such as TF and IDF metrics that will distort ranking purely by proximity.

In contrast, Interval queries calculate scores based on proximity and don't consider corpus stats and length of documents. We will get the desired ranking: [doc1, doc2].

This makes Interval queries an ideal choice for true proximity queries.

Interval queries allow to extract the proximity score as a signal for the overall relevance score. They are optimised to be mixed with other relevance signals like BM25, for instance:

Note that this could also be applied to rescoring: we can make the first pass with BM25 alone and then add a rescorer with BM25 + Intervals combination.

Note that if we need to model Span queries behaviour in matching and scoring by BM25 and proximity, we can do it by combining interval queries with BM25 queries as must clauses in a boolean query with appropriate boosts set.

Transition guide

Below we show ways to transition from the following Span queries to the equivalent Interval queries:

span_containing
span_field_masking
span_first
span_multi
span_near
span_not
span_or
span_term
span_within

SPAN NEAR

SPAN FIRST

SPAN OR

SPAN CONTAINING

SPAN WITHIN

SPAN NOT

SPAN_MULTI

wildcard

fuzzy

prefix

regexp

range

span_field_masking

use use_field of Intervals

Conclusion

Interval queries is a powerful tool to do true positional search. Try them with expanded functionalities from 8.16 release.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Report an issue