Common problems

edit

Operator crashes on startup with OOMKilled

edit

On very large Kubernetes clusters with many hundreds of resources (pods, secrets, config maps, and so on), the operator may fail to start with its pod getting killed with a OOMKilled message. This is an issue with the controller-runtime framework on top of which the operator is built. Even though the operator is only interested in the resources created by itself, the framework code needs to gather information about all relevant resources in the Kubernetes cluster in order to provide the filtered view of cluster state required by the operator. On very large clusters, this information gathering can use up a lot of memory and exceed the default resource limit defined for the operator pod.

The default memory limit for the operator pod is set to 512 MiB. You can increase (or decrease) this limit to a value suited to your cluster as follows:

kubectl patch sts elastic-operator -n elastic-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager", "resources":{"limits":{"memory":"768Mi"}}}]}}}}'

Timeout when submitting a resource manifest

edit

When submitting a ECK resource manifest, you may encounter an error message similar to the following:

Error from server (Timeout): error when creating "elasticsearch.yaml": Timeout: request did not complete within requested timeout 30s

This error is usually an indication of a problem communicating with the validating webhook. If you are running ECK on a private Google Kubernetes Engine (GKE) cluster, you may need to add a firewall rule allowing port 9443 from the API server. Another possible cause for failure is if a strict network policy is in effect. Refer to the webhook troubleshooting documentation for more details and workarounds.

Copying secrets with Owner References

edit
Copying the Elasticsearch Secrets generated by ECK (for instance, the certificate authority or the elastic user) into another namespace wholesale can trigger a link:https://github.com/kubernetes/kubernetes/issues/65200[Kubernetes bug] which can delete all of the Elasticsearch-related resources, for example, the data volumes.
Since ECK 1.3.1, `OwnerReference` was removed both from Elasticsearch Secrets containing public certificates and the Secret holding the elastic user credentials. These secrets are likely to be copied.
If Secrets were copied in other namespaces before ECK 1.3.1, make sure you manually remove the `OwnerReference`, as these Secrets might still be affected, even if ECK has been upgraded.

For example, a source secret might be:

$ kubectl get secret quickstart-es-elastic-user -o yaml
apiVersion: v1
data:
  elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
  creationTimestamp: "2020-06-09T19:11:41Z"
  labels:
    common.k8s.elastic.co/type: elasticsearch
    eck.k8s.elastic.co/credentials: "true"
    elasticsearch.k8s.elastic.co/cluster-name: quickstart
  name: quickstart-es-elastic-user
  namespace: default
  ownerReferences:
  - apiVersion: elasticsearch.k8s.elastic.co/v1
    blockOwnerDeletion: true
    controller: true
    kind: Elasticsearch
    name: quickstart
    uid: c7a9b436-aa07-4341-a2cc-b33b3dfcbe29
  resourceVersion: "13048277"
  selfLink: /api/v1/namespaces/default/secrets/quickstart-es-elastic-user
  uid: 04cdf334-77d3-4de6-a2e8-7a2b23366a27
type: Opaque

To copy it to a different namespace, strip the metadata.ownerReferences field as well as the object-specific data:

apiVersion: v1
data:
  elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
  labels:
    common.k8s.elastic.co/type: elasticsearch
    eck.k8s.elastic.co/credentials: "true"
    elasticsearch.k8s.elastic.co/cluster-name: quickstart
  name: quickstart-es-elastic-user
  namespace: default
type: Opaque

Failure to do so can cause data loss.