Histogram field type

edit

A field to store pre-aggregated numerical data representing a histogram. This data is defined using two paired arrays:

  • A values array of double numbers, representing the buckets for the histogram. These values must be provided in ascending order.
  • A corresponding counts array of long numbers, representing how many values fall into each bucket. These numbers must be positive or zero.

Because the elements in the values array correspond to the elements in the same position of the count array, these two arrays must have the same length.

  • A histogram field can only store a single pair of values and count arrays per document. Nested arrays are not supported.
  • histogram fields do not support sorting.

Uses

edit

histogram fields are primarily intended for use with aggregations. To make it more readily accessible for aggregations, histogram field data is stored as a binary doc values and not indexed. Its size in bytes is at most 13 * numValues, where numValues is the length of the provided arrays.

Because the data is not indexed, you only can use histogram fields for the following aggregations and queries:

Building a histogram

edit

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. Some possibilities include:

  • For the T-Digest mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.
  • For the High Dynamic Range (HDR) histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Synthetic _source

edit

Synthetic _source is Generally Available only for TSDB indices (indices that have index.mode set to time_series). For other indices synthetic _source is in technical preview. Features in technical preview may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

histogram fields support synthetic _source in their default configuration. Synthetic _source cannot be used together with ignore_malformed or copy_to.

To save space, zero-count buckets are not stored in the histogram doc values. As a result, when indexing a histogram field in an index with synthetic source enabled, indexing a histogram including zero-count buckets will result in missing buckets when fetching back the histogram.

Examples

edit

The following create index API request creates a new index with two field mappings:

  • my_histogram, a histogram field used to store percentile data
  • my_text, a keyword field used to store a title for the histogram
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        my_histogram: {
          type: 'histogram'
        },
        my_text: {
          type: 'keyword'
        }
      }
    }
  }
)
puts response
PUT my-index-000001
{
  "mappings" : {
    "properties" : {
      "my_histogram" : {
        "type" : "histogram"
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

The following index API requests store pre-aggregated for two histograms: histogram_1 and histogram_2.

response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
    my_text: 'histogram_1',
    my_histogram: {
      values: [
        0.1,
        0.2,
        0.3,
        0.4,
        0.5
      ],
      counts: [
        3,
        7,
        23,
        12,
        6
      ]
    }
  }
)
puts response

response = client.index(
  index: 'my-index-000001',
  id: 2,
  body: {
    my_text: 'histogram_2',
    my_histogram: {
      values: [
        0.1,
        0.25,
        0.35,
        0.4,
        0.45,
        0.5
      ],
      counts: [
        8,
        17,
        8,
        7,
        6,
        2
      ]
    }
  }
)
puts response
PUT my-index-000001/_doc/1
{
  "my_text" : "histogram_1",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}

PUT my-index-000001/_doc/2
{
  "my_text" : "histogram_2",
  "my_histogram" : {
      "values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], 
      "counts" : [8, 17, 8, 7, 6, 2] 
   }
}

Values for each bucket. Values in the array are treated as doubles and must be given in increasing order. For T-Digest histograms this value represents the mean value. In case of HDR histograms this represents the value iterated to.

Count for each bucket. Values in the arrays are treated as long integers and must be positive or zero. Negative values will be rejected. The relation between a bucket and a count is given by the position in the array.