Sampler Aggregation

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

Example use cases:

  • Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
  • Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

{
    "query": {
        "match": {
            "text": "iphone"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "text"
                    }
                }
            }
        }
    }
}

Response:

{
    ...
        "aggregations": {
        "sample": {
            "doc_count": 1000,
            "keywords": {
                "doc_count": 1000,
                "buckets": [
                    ...
                    {
                        "key": "bend",
                        "doc_count": 58,
                        "score": 37.982536582524276,
                        "bg_count": 103
                    },
                    ....
}

1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

shard_size

edit

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

edit

Cannot be nested under breadth_first aggregations

edit

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.