Knn query

edit

Finds the k nearest vectors to a query vector, as measured by a similarity metric. knn query finds nearest vectors through approximate search on indexed dense_vectors. The preferred way to do approximate kNN search is through the top level knn section of a search request. knn query is reserved for expert cases, where there is a need to combine this query with other queries.

Example request

edit
PUT my-image-index
{
  "mappings": {
    "properties": {
       "image-vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": true,
        "similarity": "l2_norm"
      },
      "file-type": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      }
    }
  }
}
  1. Index your data.

    POST my-image-index/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "image-vector": [1, 5, -20], "file-type": "jpg", "title": "mountain lake" }
    { "index": { "_id": "2" } }
    { "image-vector": [42, 8, -15], "file-type": "png", "title": "frozen lake"}
    { "index": { "_id": "3" } }
    { "image-vector": [15, 11, 23], "file-type": "jpg", "title": "mountain lake lodge" }
  2. Run the search using the knn query, asking for the top 3 nearest vectors.

    response = client.search(
      index: 'my-image-index',
      body: {
        size: 3,
        query: {
          knn: {
            field: 'image-vector',
            query_vector: [
              -5,
              9,
              -12
            ],
            num_candidates: 10
          }
        }
      }
    )
    puts response
    POST my-image-index/_search
    {
      "size" : 3,
      "query" : {
        "knn": {
          "field": "image-vector",
          "query_vector": [-5, 9, -12],
          "num_candidates": 10
        }
      }
    }

knn query doesn’t have a separate k parameter. k is defined by size parameter of a search request similar to other queries. knn query collects num_candidates results from each shard, then merges them to get the top size results.

Top-level parameters for knn

edit
field

(Required, string) The name of the vector field to search against. Must be a dense_vector field with indexing enabled.

query_vector

(Required, array of floats) Query vector. Must have the same number of dimensions as the vector field you are searching against.

num_candidates

(Required, integer) The number of nearest neighbor candidates to consider per shard. Cannot exceed 10,000. Elasticsearch collects num_candidates results from each shard, then merges them to find the top results. Increasing num_candidates tends to improve the accuracy of the final results.

filter

(Optional, query object) Query to filter the documents that can match. The kNN search will return the top documents that also match this filter. The value can be a single query or a list of queries. If filter is not provided, all documents are allowed to match.

The filter is a pre-filter, meaning that it is applied during the approximate kNN search to ensure that num_candidates matching documents are returned.

similarity

(Optional, float) The minimum similarity required for a document to be considered a match. The similarity value calculated relates to the raw similarity used. Not the document score. The matched documents are then scored according to similarity and the provided boost is applied.

boost

(Optional, float) Floating point number used to multiply the scores of matched documents. This value cannot be negative. Defaults to 1.0.

_name

(Optional, string) Name field to identify the query

Pre-filters and post-filters in knn query

edit

There are two ways to filter documents that match a kNN query:

  1. pre-filtering – filter is applied during the approximate kNN search to ensure that k matching documents are returned.
  2. post-filtering – filter is applied after the approximate kNN search completes, which results in fewer than k results, even when there are enough matching documents.

Pre-filtering is supported through the filter parameter of the knn query. Also filters from aliases are applied as pre-filters.

All other filters found in the Query DSL tree are applied as post-filters. For example, knn query finds the top 3 documents with the nearest vectors (num_candidates=3), which are combined with term filter, that is post-filtered. The final set of documents will contain only a single document that passes the post-filter.

response = client.search(
  index: 'my-image-index',
  body: {
    size: 10,
    query: {
      bool: {
        must: {
          knn: {
            field: 'image-vector',
            query_vector: [
              -5,
              9,
              -12
            ],
            num_candidates: 3
          }
        },
        filter: {
          term: {
            "file-type": 'png'
          }
        }
      }
    }
  }
)
puts response
POST my-image-index/_search
{
  "size" : 10,
  "query" : {
    "bool" : {
      "must" : {
        "knn": {
          "field": "image-vector",
          "query_vector": [-5, 9, -12],
          "num_candidates": 3
        }
      },
      "filter" : {
        "term" : { "file-type" : "png" }
      }
    }
  }
}

Hybrid search with knn query

edit

Knn query can be used as a part of hybrid search, where knn query is combined with other lexical queries. For example, the query below finds documents with title matching mountain lake, and combines them with the top 10 documents that have the closest image vectors to the query_vector. The combined documents are then scored and the top 3 top scored documents are returned.

+

POST my-image-index/_search
{
  "size" : 3,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "mountain lake",
              "boost": 1
            }
          }
        },
        {
          "knn": {
            "field": "image-vector",
            "query_vector": [-5, 9, -12],
            "num_candidates": 10,
            "boost": 2
          }
        }
      ]
    }
  }
}

Knn query inside a nested query

edit

knn query can be used inside a nested query. The behaviour here is similar to top level nested kNN search:

  • kNN search over nested dense_vectors diversifies the top results over the top-level document
  • filter over the top-level document metadata is supported and acts as a post-filter
  • filter over nested field metadata is not supported

A sample query can look like below:

{
  "query" : {
    "nested" : {
      "path" : "paragraph",
        "query" : {
          "knn": {
            "query_vector": [
                0.45,
                45
            ],
            "field": "paragraph.vector",
            "num_candidates": 2
        }
      }
    }
  }
}

Knn query with aggregations

edit

knn query calculates aggregations on num_candidates from each shard. Thus, the final results from aggregations contain num_candidates * number_of_shards documents. This is different from the top level knn section where aggregations are calculated on the global top k nearest documents.