Common Issues

edit

This page describes how to resolve common problems you might encounter with Alerting.

Rules with small check intervals run late

edit

Problem

Rules with a small check interval, such as every two seconds, run later than scheduled.

Solution

Rules run as background tasks at a cadence defined by their check interval. When a Rule check interval is smaller than the Task Manager poll_interval, the rule will run late.

Either tweak the Kibana Task Manager settings or increase the check interval of the rules in question.

For more details, see Tasks with small schedule intervals run late.

Rules with the inconsistent cadence

edit

Problem

Scheduled rules run at an inconsistent cadence, often running late.

Actions run long after the status of a rule changes, sending a notification of the change too late.

Solution

Rules and actions run as background tasks by each Kibana instance at a default rate of ten tasks every three seconds. When diagnosing issues related to alerting, focus on the tasks that begin with alerting: and actions:.

Alerting tasks always begin with alerting:. For example, the alerting:.index-threshold tasks back the index threshold stack rule. Action tasks always begin with actions:. For example, the actions:.index tasks back the index action.

For more details on monitoring and diagnosing tasks in Task Manager, refer to Health monitoring.

Connectors have TLS errors when running actions

edit

Problem

A connector gets a TLS socket error when connecting to the server to run an action.

Solution

Configuration options are available to specialize connections to TLS servers, including ignoring server certificate validation and providing certificate authority data to verify servers using custom certificates. For more details, see Action settings.

Rules take a long time to run

edit

Problem

Rules are taking a long time to run and are impacting the overall health of your deployment.

By default, only users with a superuser role can query the [preview] This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features. Kibana event log because it is a system index. To enable additional users to run this query, assign read privileges to the .kibana-event-log* index.

Solution

By default, rules have a 5m timeout. Rules that run longer than this timeout are automatically canceled to prevent them from consuming too much of Kibana’s resources. Alerts and actions that may have been scheduled before the rule timed out are discarded. When a rule times out, you will see this error in the Kibana logs:

[2022-03-28T13:14:04.062-04:00][WARN ][plugins.taskManager] Cancelling task alerting:.index-threshold "a6ea0070-aec0-11ec-9985-dd576a3fe205" as it expired at 2022-03-28T17:14:03.980Z after running for 05m 10s (with timeout set at 5m).

and in the details page:

Rule details page with timeout error

If you want your rules to run longer, update the xpack.alerting.rules.run.timeout configuration in your Alerting settings. You can also target a specific rule type by using xpack.alerting.rules.run.ruleTypeOverrides.

Rules that consistently run longer than their check interval may produce unexpected results. If the average run duration, visible on the details page, is greater than the check interval, consider increasing the check interval.

To get all long-running rules, you can query for a list of rule ids, bucketed by their run times:

GET /.kibana-event-log*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1d", 
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "event.action": {
              "value": "execute"
            }
          }
        },
        {
          "term": {
            "event.provider": {
              "value": "alerting" 
            }
          }
        }
      ]
    }
  },
  "runtime_mappings": { 
    "event.duration_in_seconds": {
      "type": "double",
      "script": {
        "source": "emit(doc['event.duration'].value / 1E9)"
      }
    }
  },
  "aggs": {
    "ruleIdsByExecutionDuration": {
      "histogram": {
        "field": "event.duration_in_seconds",
        "min_doc_count": 1,
        "interval": 1 
      },
      "aggs": {
        "ruleId": {
          "nested": {
            "path": "kibana.saved_objects"
          },
          "aggs": {
            "ruleId": {
              "terms": {
                "field": "kibana.saved_objects.id",
                "size": 10 
              }
            }
          }
        }
      }
    }
  }
}

This queries for rules run in the last day. Update the values of lte and gte to query over a different time range.

Use event.provider: actions to query for long-running actions.

Run durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.

This interval buckets the event.duration_in_seconds runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets event.duration and use nanoseconds for the interval.

This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.

This query returns the following:

{
  "took" : 322,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 326,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ruleIdsByExecutionDuration" : {
      "buckets" : [
        {
          "key" : 0.0, 
          "doc_count" : 320,
          "ruleId" : {
            "doc_count" : 320,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
                  "doc_count" : 140
                },
                {
                  "key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
                  "doc_count" : 130
                },
                {
                  "key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
                  "doc_count" : 50
                }
              ]
            }
          }
        },
        {
          "key" : 30.0, 
          "doc_count" : 6,
          "ruleId" : {
            "doc_count" : 6,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
                  "doc_count" : 6
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Most run durations fall within the first bucket (0 - 1 seconds).

A single rule with id 41893910-6bca-11eb-9e0d-85d233e3ee35 took between 30 and 31 seconds to run.

Use the get rule API to retrieve additional information about rules that take a long time to run.

Rule cannot decrypt API key

edit

Problem:

The rule fails to run and has an Unable to decrypt attribute "apiKey" error.

Solution:

This error happens when the xpack.encryptedSavedObjects.encryptionKey value used to create the rule does not match the value used when the rule runs. Depending on the scenario, there are different ways to solve this problem:

If the value in xpack.encryptedSavedObjects.encryptionKey was manually changed, and the previous encryption key is still known.

Ensure any previous encryption key is included in the keys used for decryption only.

If another Kibana instance with a different encryption key connects to the cluster.

The other Kibana instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the Kibana instances are the same, and setting decryption only keys for previously used encryption keys.

If other scenarios don’t apply.

Generate a new API key for the rule by disabling then enabling the rule.

Rules stop running after upgrade

edit

Problem:

Alerting rules that were created or edited in 8.2 stop running after you upgrade to 8.3.0 or 8.3.1. The following error occurs:

<rule-type>:<UUID>: execution failed - security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""

Solution:

Upgrade to 8.3.2 or later releases to avoid the problem. To fix failing rules, go to Stack Management > Rules and Connectors and generate new API keys by selecting Update API key from the actions menu. For more details about API key authorization, refer to Authorization.