Common problems
editCommon problems
editOperator crashes on startup with OOMKilled
editOn very large Kubernetes clusters with many hundreds of resources (pods, secrets, config maps, and so on), the operator may fail to start with its pod getting terminated with a OOMKilled
reason:
kubectl -n elastic-system \ get pods -o=jsonpath='{.items[].status.containerStatuses}' | jq
[ { "containerID": "containerd://...", "image": "docker.elastic.co/eck/eck-operator:2.5.0", "imageID": "docker.elastic.co/eck/eck-operator@sha256:...", "lastState": { "terminated": { "containerID": "containerd://...", "exitCode": 137, "finishedAt": "2022-07-04T09:47:02Z", "reason": "OOMKilled", "startedAt": "2022-07-04T09:46:43Z" } }, "name": "manager", "ready": false, "restartCount": 2, "started": false, "state": { "waiting": { "message": "back-off 20s restarting failed container=manager pod=elastic-operator-0_elastic-system(57de3efd-57e0-4c1e-8151-72b0ac4d6b14)", "reason": "CrashLoopBackOff" } } } ]
This is an issue with the controller-runtime
framework on top of which the operator is built. Even though the operator is only interested in the resources created by itself, the framework code needs to gather information about all relevant resources in the Kubernetes cluster in order to provide the filtered view of cluster state required by the operator. On very large clusters, this information gathering can use up a lot of memory and exceed the default resource limit defined for the operator pod.
The default memory limit for the operator pod is set to 1 Gi. You can increase (or decrease) this limit to a value suited to your cluster as follows:
kubectl patch sts elastic-operator -n elastic-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager", "resources":{"limits":{"memory":"2Gi"}}}]}}}}'
Set limits (spec.containers[].resources.limits
) that match requests (spec.containers[].resources.requests
) to prevent operator’s Pod from being terminated during node-pressure eviction.
Timeout when submitting a resource manifest
editWhen submitting a ECK resource manifest, you may encounter an error message similar to the following:
Error from server (Timeout): error when creating "elasticsearch.yaml": Timeout: request did not complete within requested timeout 30s
This error is usually an indication of a problem communicating with the validating webhook. If you are running ECK on a private Google Kubernetes Engine (GKE) cluster, you may need to add a firewall rule allowing port 9443 from the API server. Another possible cause for failure is if a strict network policy is in effect. Refer to the webhook troubleshooting documentation for more details and workarounds.
Copying secrets with Owner References
editCopying the Elasticsearch Secrets generated by ECK (for instance, the certificate authority or the elastic user) into another namespace wholesale can trigger a Kubernetes bug which can delete all of the Elasticsearch-related resources, for example, the data volumes.
Since ECK 1.3.1, OwnerReference
was removed both from Elasticsearch Secrets containing public certificates and the Secret holding the elastic user credentials. These secrets are likely to be copied.
If Secrets were copied in other namespaces before ECK 1.3.1, make sure you manually remove the OwnerReference
, as these Secrets might still be affected, even if ECK has been upgraded.
For example, a source secret might be:
kubectl get secret quickstart-es-elastic-user -o yaml apiVersion: v1 data: elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0 kind: Secret metadata: creationTimestamp: "2020-06-09T19:11:41Z" labels: common.k8s.elastic.co/type: elasticsearch eck.k8s.elastic.co/credentials: "true" elasticsearch.k8s.elastic.co/cluster-name: quickstart name: quickstart-es-elastic-user namespace: default ownerReferences: - apiVersion: elasticsearch.k8s.elastic.co/v1 blockOwnerDeletion: true controller: true kind: Elasticsearch name: quickstart uid: c7a9b436-aa07-4341-a2cc-b33b3dfcbe29 resourceVersion: "13048277" selfLink: /api/v1/namespaces/default/secrets/quickstart-es-elastic-user uid: 04cdf334-77d3-4de6-a2e8-7a2b23366a27 type: Opaque
To copy it to a different namespace, strip the metadata.ownerReferences
field as well as the object-specific data:
apiVersion: v1 data: elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0 kind: Secret metadata: labels: common.k8s.elastic.co/type: elasticsearch eck.k8s.elastic.co/credentials: "true" elasticsearch.k8s.elastic.co/cluster-name: quickstart name: quickstart-es-elastic-user namespace: default type: Opaque
Failure to do so can cause data loss.
Scale down of Elasticsearch master-eligible Pods seems stuck
editIf a master-eligible Elasticsarch Pod was never successfully scheduled and the Elasticsearch cluster is running version 7.8 or earlier, ECK may fail to scale down the Pod. To find out whether you are affected, check if the Pod in question is pending:
> kubectl get pods pod/<cluster-name>-es-<nodeset>-1 0/1 Pending 0 26m <none> <none>
Check the operator logs for an error similar to:
"unable to add to voting_config_exclusions: 400 Bad Request: add voting config exclusions request for [<cluster-name>-es-<nodeset>-1] matched no master-eligible nodes",
To work around this issue, scale down the underlying StatefulSet manually. First, identify the affected StatefulSet and the number of Pods that are ready (symbolized by m
in this example):
> kubectl get sts -l elasticsearch.k8s.elastic.co/cluster-name=<cluster-name> NAME READY AGE <cluster-name>-es-<nodeset> m/n 44h
Then, scale down the StatefulSet to the right size m
, removing the pending Pod:
> kubectl scale --replicas=m sts/<cluster-name>-es-<nodeset>
Do not use this method to scale down Pods that have already joined the Elasticsearch cluster, as additional data loss protection that ECK applies is sidestepped.
Pods are not replaced after a configuration update
editThe update of an existing Elasticsearch cluster configuration can fail because the operator is unable to apply the changes required while replacing the pods of a given Elasticsearch cluster.
A key indicator is when the Phase of the Elasticsearch resource is in ApplyingChanges
state for too long:
kubectl get es NAME HEALTH NODES VERSION PHASE AGE elasticsearch-sample yellow 2 7.9.2 ApplyingChanges 36m
Possible causes include:
-
The Elasticsearch cluster is not healthy
kubectl get elasticsearch NAME HEALTH NODES VERSION PHASE AGE elasticsearch.elasticsearch.k8s.elastic.co/elasticsearch-sample yellow 1 7.9.2 Ready 3m50s
In this case, you have to check and fix your shard allocations. The cluster health, cat shards, and get Elasticsearch APIs can assist in tracking the shard recover process.
-
Scheduling issues
The scheduling fails with the following message:
kubectl get events --sort-by='{.lastTimestamp}' | tail LAST SEEN TYPE REASON OBJECT MESSAGE 10s Warning FailedScheduling pod/quickstart-es-default-2 0/3 nodes are available: 3 Insufficient memory.
As an alternative, to get more specific information about a given pod, you can use the following command:
kubectl get pod elasticsearch-sample-es-default-2 -o go-template="{{.status}}" map[conditions:[map[lastProbeTime:<nil> lastTransitionTime:2020-12-07T09:31:06Z message:0/3 nodes are available: 3 Insufficient cpu. reason:Unschedulable status:False type:PodScheduled]] phase:Pending qosClass:Guaranteed]
-
The operator is not able to restart some nodes
kubectl -n elastic-system logs statefulset.apps/elastic-operator | tail {"log.level":"info","@timestamp":"2020-11-19T17:34:48.769Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"default","es_name":"quickstart","failed_predicates":{"do_not_restart_healthy_node_if_MaxUnavailable_reached":["quickstart-es-default-1","quickstart-es-default-0"]}}
A pod is stuck in a
Pending
status:kubectl get pods NAME READY STATUS RESTARTS AGE quickstart-es-default-0 1/1 Running 0 146m quickstart-es-default-1 1/1 Running 0 146m quickstart-es-default-2 0/1 Pending 0 134m
In this case, you have to add more K8s nodes, or free up resources.
For more information, check Troubleshooting methods.
ECK operator upgrade stays pending when using OLM
editWhen using Operator Lifecycle Manager (OLM) to install and upgrade the ECK operator an upgrade of ECK will not complete on older versions of OLM. This is due to an issue in OLM itself that is fixed in version 0.16.0 or later. OLM is also used behind the scenes when you install ECK as a Red Hat Certified Operator on OpenShift or as a community operator through operatorhub.io.
> oc get csv NAME DISPLAY VERSION REPLACES PHASE elastic-cloud-eck.v1.3.1 Elasticsearch (ECK) Operator 1.3.1 elastic-cloud-eck.v1.3.0 Replacing elastic-cloud-eck.v1.4.0 Elasticsearch (ECK) Operator 1.4.0 elastic-cloud-eck.v1.3.1 Pending
If you are using one of the affected versions of OLM and upgrading OLM to a newer version is not possible then ECK
can still be upgraded by uninstalling and reinstalling it. This can be done by removing the Subscription
and both ClusterServiceVersion
resources and adding them again.
On OpenShift the same workaround can be performed in the UI by clicking on "Uninstall Operator" and then reinstalling it through OperatorHub.
If you upgraded Elasticsearch to the wrong version
editIf you accidentally upgrade one of your Elasticsearch clusters to a version that does not exist or a version to which a direct upgrade is not possible from your currently deployed version, a validation will prevent you from going back to the previous version. The reason for this validation is that ECK will not allow downgrades as this is not supported by Elasticsearch and once the data directory of Elasticsearch has been upgraded there is no way back to the old version without a snapshot restore.
These two upgrading scenarios, however, are exceptions because Elasticsearch never started up successfully. If you annotate the Elasticsearch resource with eck.k8s.elastic.co/disable-downgrade-validation=true
ECK allows you to go back to the old version at your own risk. If you also attempted an upgrade of other related Elastic Stack applications at the same time you can use the same annotation to go back. Remove the annotation afterwards to prevent accidental downgrades and reduced availability.