Understanding Groups

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of cpu_time on an hourly basis. This is limiting; if, at a future date, the admin wishes to see the average of cpu_time on an hourly basis and partitioned by `host_name`, they are out of luck.

Of course, the admin can decide to rollup the [hour, host] tuple on an hourly basis, but as the number of grouping keys grows, so do the number of tuples the admin needs to configure. Furthermore, these [hours, host] tuples are only useful for hourly rollups…​ daily, weekly, or monthly rollups all require new configurations.

Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch’s Rollup jobs are configured based on which groups are potentially useful to future queries. For example, this configuration:

"groups" : {
  "date_histogram": {
    "field": "timestamp",
    "interval": "1h",
    "delay": "7d"
  },
  "terms": {
    "fields": ["hostname", "datacenter"]
  },
  "histogram": {
    "fields": ["load", "net_in", "net_out"],
    "interval": 5
  }
}

Allows date_histogram's to be used on the "timestamp" field, terms aggregations to be used on the "hostname" and "datacenter" fields, and histograms to be used on any of "load", "net_in", "net_out" fields.

Importantly, these aggs/fields can be used in any combination. This aggregation:

"aggs" : {
  "hourly": {
    "date_histogram": {
      "field": "timestamp",
      "interval": "1h"
    },
    "aggs": {
      "host_names": {
        "terms": {
          "field": "hostname"
        }
      }
    }
  }
}

is just as valid as this aggregation:

"aggs" : {
  "hourly": {
    "date_histogram": {
      "field": "timestamp",
      "interval": "1h"
    },
    "aggs": {
      "data_center": {
        "terms": {
          "field": "datacenter"
        }
      },
      "aggs": {
        "host_names": {
          "terms": {
            "field": "hostname"
          }
        },
        "aggs": {
          "load_values": {
            "histogram": {
              "field": "load",
              "interval": 5
            }
          }
        }
      }
    }
  }
}

You’ll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on "hostname", illustrating how the order of aggregations does not matter to rollups. Similarly, while the date_histogram is required for rolling up data, it isn’t required while querying (although often used). For example, this is a valid aggregation for Rollup Search to execute:

"aggs" : {
  "host_names": {
    "terms": {
      "field": "hostname"
    }
  }
}

Ultimately, when configuring groups for a job, think in terms of how you might wish to partition data in a query at a future date…​ then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)