kubediag — Diagnostic Rules
This file is auto-generated by
go run ./hack/docgen. Do not edit by hand.
28 rules across 10 categories.
Summary
| Rule ID | Title | Severity | Confidence | Scopes |
|---|---|---|---|---|
| TRG-ACCESS-INSUFFICIENT-READ | kubediag has insufficient RBAC permissions; diagnosis is incomplete | info | — | Pod, Deployment, Namespace, Cluster |
| TRG-CLUSTER-APISERVER-LATENCY | API server latency or availability events detected | high | — | Cluster, Namespace |
| TRG-CLUSTER-NODE-NOT-READY | One or more cluster nodes are NotReady | critical | — | Cluster, Namespace |
| TRG-CLUSTER-NODE-PRESSURE | One or more nodes have Memory, Disk, or PID pressure | high | — | Cluster, Namespace |
| TRG-CLUSTER-QUOTA-EXHAUSTED | Namespace ResourceQuota is exhausted or nearly exhausted | high | — | Namespace, Cluster |
| TRG-DEPLOY-ROLLOUT-STUCK | Deployment rollout has exceeded its progress deadline | critical | — | Deployment |
| TRG-DEPLOY-UNAVAILABLE-REPLICAS | Deployment has unavailable replicas | high | — | Deployment |
| TRG-NS-WARNING-EVENTS | Namespace has recent Warning events | medium | — | Namespace |
| TRG-POD-BAD-ENV-REF | Pod env var references a missing key in a ConfigMap or Secret | high | — | Pod |
| TRG-POD-CRASHLOOPBACKOFF | Container is in CrashLoopBackOff | critical | — | Pod |
| TRG-POD-EXIT-IMMEDIATE | Container exits immediately — exec format error or missing binary | critical | — | Pod |
| TRG-POD-IMAGE-AUTH | Container image pull failed due to authentication/authorisation error | high | — | Pod |
| TRG-POD-IMAGE-NOT-FOUND | Container image tag or repository does not exist | high | — | Pod |
| TRG-POD-IMAGEPULLBACKOFF | Container image pull is failing (ImagePullBackOff / ErrImagePull) | high | — | Pod |
| TRG-POD-INIT-FAILED | Pod init container failed and is not in CrashLoopBackOff | high | — | Pod |
| TRG-POD-LIVENESS-FAILING | Container liveness probe is failing (causing container restarts) | high | — | Pod |
| TRG-POD-MISSING-CONFIGMAP | Pod references a ConfigMap that does not exist | high | — | Pod |
| TRG-POD-MISSING-SECRET | Pod references a Secret that does not exist | high | — | Pod |
| TRG-POD-OOMKILLED | Container was OOMKilled | high | — | Pod |
| TRG-POD-PENDING-INSUFFICIENT-RESOURCES | Pod is Pending due to insufficient CPU or memory on all nodes | high | — | Pod |
| TRG-POD-PENDING-PVC-UNBOUND | Pod is Pending because a referenced PVC is not bound | high | — | Pod |
| TRG-POD-PENDING-SELECTOR-MISMATCH | Pod is Pending due to node selector or affinity mismatch | medium | — | Pod |
| TRG-POD-PENDING-TAINT-MISMATCH | Pod is Pending due to untolerated node taints | medium | — | Pod |
| TRG-POD-READINESS-FAILING | Container readiness probe is failing | medium | — | Pod |
| TRG-POD-STARTUP-FAILING | Container startup probe is failing | high | — | Pod |
| TRG-SVC-NO-ENDPOINTS | Service has no endpoints (no pods are selected) | high | — | Pod, Namespace |
| TRG-SVC-PORT-MISMATCH | Service targetPort is not exposed by any selected pod | high | — | Pod, Namespace |
| TRG-SVC-SELECTOR-MISMATCH | Service selector does not match any pod labels in the namespace | high | — | Pod, Namespace |
Access
TRG-ACCESS-INSUFFICIENT-READ
kubediag has insufficient RBAC permissions; diagnosis is incomplete
- Severity: info
- Scopes: Pod, Deployment, Namespace, Cluster
- Docs: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
kubediag was denied read access to one or more Kubernetes resources needed for a complete diagnosis. Results may be incomplete.
This is not necessarily a problem with your workloads — it is a signal that the user or service account running kubediag does not have the required RBAC permissions to perform a full inspection.
To diagnose with full fidelity, grant read (get/list) on: pods, events, deployments, replicasets, services, endpoints, configmaps, secrets, persistentvolumeclaims, nodes.
Configuration
TRG-POD-BAD-ENV-REF
Pod env var references a missing key in a ConfigMap or Secret
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#define-container-environment-variables-using-configmap-data
An env var in the pod spec uses valueFrom.configMapKeyRef or valueFrom.secretKeyRef to reference a specific key inside a ConfigMap or Secret. The ConfigMap/Secret exists, but the referenced key does not.
Kubernetes will refuse to start the container and report "CreateContainerConfigError" with a message like: "couldn't find key KEY in ConfigMap NS/NAME" "couldn't find key KEY in Secret NS/NAME"
This is distinct from TRG-POD-MISSING-CONFIGMAP / TRG-POD-MISSING-SECRET, which fire when the whole ConfigMap or Secret is absent.
TRG-POD-MISSING-CONFIGMAP
Pod references a ConfigMap that does not exist
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/configuration/configmap/
The pod spec references a ConfigMap (in a volume, envFrom, or env valueFrom) that does not exist in the same namespace. The pod will stay in ContainerCreating or Pending until the ConfigMap is created.
TRG-POD-MISSING-SECRET
Pod references a Secret that does not exist
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/configuration/secret/
The pod spec references a Secret (in a volume, envFrom, env valueFrom, or imagePullSecrets) that does not exist in the same namespace. The pod will remain in ContainerCreating or Pending until the Secret is created.
Image
TRG-POD-IMAGE-AUTH
Container image pull failed due to authentication/authorisation error
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
The registry rejected the pull because credentials are missing, expired, or incorrect. This can also mean the service account's imagePullSecret was not configured, or the secret's credentials have rotated.
TRG-POD-IMAGE-NOT-FOUND
Container image tag or repository does not exist
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/containers/images/
The registry responded that the specified image or tag was not found. This is almost always a typo in the image reference (wrong tag, wrong repo path, deleted image) rather than a network or auth issue.
TRG-POD-IMAGEPULLBACKOFF
Container image pull is failing (ImagePullBackOff / ErrImagePull)
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/containers/images/
The kubelet cannot pull the container image. Kubernetes backs off retries exponentially. The most common sub-causes are: wrong image name/tag, image not found, or authentication failure against a private registry.
See TRG-POD-IMAGE-NOT-FOUND and TRG-POD-IMAGE-AUTH for specialised rules that fire when the event message is more specific.
Networking
TRG-SVC-NO-ENDPOINTS
Service has no endpoints (no pods are selected)
- Severity: high
- Scopes: Pod, Namespace
- Docs: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service
The Service's selector does not match any Running+Ready pod, so its Endpoints object is empty. Traffic to this Service will be dropped or return connection refused.
Common causes:
- Label mismatch between the Service selector and the pod labels.
- All pods are failing their readiness probes.
- No pods exist with the selected labels.
- The pods are in a different namespace.
TRG-SVC-PORT-MISMATCH
Service targetPort is not exposed by any selected pod
- Severity: high
- Scopes: Pod, Namespace
- Docs: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service
A Service's targetPort (the port traffic is forwarded to inside the pod) does not match any containerPort declared by the pods the Service selects.
Note: Kubernetes does not require containerPorts to be declared for traffic to flow — kube-proxy uses iptables/ipvs rules regardless. However, when containerPorts ARE declared and none match the targetPort, this almost always indicates a misconfiguration: a typo, a renamed port, or a changed application port that was not reflected in the Service spec.
This rule only fires when:
- The Service has a non-empty selector (not a headless/external service).
- At least one pod matches the selector.
- Those pods declare containerPorts.
- None of the declared containerPorts match the Service's targetPort.
TRG-SVC-SELECTOR-MISMATCH
Service selector does not match any pod labels in the namespace
- Severity: high
- Scopes: Pod, Namespace
- Docs: https://kubernetes.io/docs/concepts/services-networking/service/
The Service's selector labels do not appear on any pod in the namespace, regardless of pod health or readiness. This is often a misconfiguration — either the pod labels were changed or the Service selector was mistyped.
Probes
TRG-POD-LIVENESS-FAILING
Container liveness probe is failing (causing container restarts)
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
The liveness probe is returning failure. When the probe fails enough times (failureThreshold), Kubernetes kills and restarts the container. Repeated liveness failures appear as increasing restart counts, sometimes leading to CrashLoopBackOff.
A misconfigured liveness probe (wrong path, too-short timeout) is a very common cause of unexpected pod restarts.
TRG-POD-READINESS-FAILING
Container readiness probe is failing
- Severity: medium
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
The container's readiness probe is not returning success. Kubernetes removes the pod from Service Endpoints while readiness fails, so traffic is not routed to it. The pod stays Running but receives no traffic.
This is distinct from a liveness failure: failing readiness does not restart the container; it only removes it from load balancing.
TRG-POD-STARTUP-FAILING
Container startup probe is failing
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
The startup probe is not passing within its timeout window. Until the startup probe succeeds, both readiness and liveness probes are disabled. If the probe never passes (failureThreshold × periodSeconds elapsed), the container is killed.
This is most commonly misconfigured when the application has variable cold-start time that exceeds failureThreshold × periodSeconds.
ResourcePressure
TRG-CLUSTER-APISERVER-LATENCY
API server latency or availability events detected
- Severity: high
- Scopes: Cluster, Namespace
- Docs: https://kubernetes.io/docs/tasks/debug/debug-cluster/, https://kubernetes.io/docs/concepts/cluster-administration/logging/
Warning events have been detected that indicate the Kubernetes API server is experiencing elevated latency or transient unavailability. These events may originate from:
- The API server itself (SlowReadResponse, SlowWriteResponse)
- The etcd backend (timeout, leader election)
- Controllers failing to reach the API (FailedToCreateEndpoint, context deadline exceeded)
API server latency causes cascading failures: controllers fall behind, pod readiness state becomes stale, and deployments stall. It is usually caused by:
- etcd disk I/O saturation
- Control-plane node resource exhaustion (CPU/memory)
- Large object counts (too many secrets/configmaps/pods in cluster)
- Network issues between control-plane and worker nodes
TRG-CLUSTER-NODE-NOT-READY
One or more cluster nodes are NotReady
- Severity: critical
- Scopes: Cluster, Namespace
- Docs: https://kubernetes.io/docs/concepts/architecture/nodes/#node-status
At least one node has condition Ready=False or Ready=Unknown, meaning the kubelet on that node is not communicating with the control plane. Pods scheduled to that node will not start and existing pods may become Terminating.
Common causes:
- Node lost network connectivity.
- kubelet process crashed or is OOMKilled on the node.
- Disk is full on the node (log or container image partition).
- Node was forcibly shut down (spot instance reclaimed, maintenance).
TRG-CLUSTER-NODE-PRESSURE
One or more nodes have Memory, Disk, or PID pressure
- Severity: high
- Scopes: Cluster, Namespace
- Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
One or more nodes report pressure conditions: MemoryPressure, DiskPressure, or PIDPressure. Under pressure, the kubelet will evict pods (starting with Best-Effort, then Burstable) until pressure is relieved.
Pressure conditions are early warnings — address them before pods start being evicted.
TRG-CLUSTER-QUOTA-EXHAUSTED
Namespace ResourceQuota is exhausted or nearly exhausted
- Severity: high
- Scopes: Namespace, Cluster
- Docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/
A ResourceQuota in the namespace has one or more tracked resources at or near its hard limit. When a quota is fully exhausted, the API server will reject new object creation (pods, services, PVCs, etc.) with a "exceeded quota" error.
Thresholds used:
- ≥100% used → Critical: quota fully consumed, new workloads will be rejected.
- ≥95% used → High: quota nearly consumed, plan to increase or clean up.
Common causes:
- Too many replicas scaled up without increasing quota.
- Stale completed/failed pods not cleaned up (they still count toward pod quota).
- Namespace was under-quota'd at creation and the workload has grown.
TRG-POD-OOMKILLED
Container was OOMKilled
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
The container's process was killed by the kernel's out-of-memory (OOM) killer because the container exceeded its memory limit. This shows up in lastState.terminated.reason = "OOMKilled".
Common causes:
- The memory limit is set too low for the application's actual working set.
- A memory leak in the application.
- A burst-workload that requires more memory than the limit.
Remediation involves either increasing the memory limit or reducing the application's memory footprint.
Rollout
TRG-DEPLOY-ROLLOUT-STUCK
Deployment rollout has exceeded its progress deadline
- Severity: critical
- Scopes: Deployment
- Docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment
The Deployment's progressDeadlineSeconds was exceeded. Kubernetes marks the Deployment with condition Progressing=False / ProgressDeadlineExceeded. The rollout is stuck — new pods are not becoming ready within the deadline.
Common causes:
- New pods crash (CrashLoopBackOff) and never become ready.
- New pods are unschedulable (insufficient resources, taint mismatch).
- The readiness probe on the new pods is never passing.
- A PVC or Secret is missing that the new pods need.
TRG-DEPLOY-UNAVAILABLE-REPLICAS
Deployment has unavailable replicas
- Severity: high
- Scopes: Deployment
- Docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
The Deployment has fewer ready replicas than desired. Some pods are either not running, crashing, or failing their readiness probe. This reduces the available capacity and may impact traffic.
Runtime
TRG-NS-WARNING-EVENTS
Namespace has recent Warning events
- Severity: medium
- Scopes: Namespace
- Docs: https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources
The namespace has recent Kubernetes Warning events across its workloads. This rule aggregates events by reason and provides a summary — it does not diagnose individual pods, but gives a namespace-level picture of what is failing.
Use this as a quick scan before running more specific "kubediag pod" commands.
TRG-POD-CRASHLOOPBACKOFF
Container is in CrashLoopBackOff
- Severity: critical
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-states
The container has crashed repeatedly and Kubernetes is backing off before restarting it again. The root cause is almost always in the container process itself: the command exits non-zero on every start.
Common causes:
- The application panics or exits immediately due to a missing configuration file or env var.
- The entrypoint binary does not exist in the image.
- An OOMKilled container gets re-labelled as CrashLoopBackOff after several OOM events.
- A health check fails so fast that the container never stabilises.
Check the previous container logs for the actual exit reason.
TRG-POD-EXIT-IMMEDIATE
Container exits immediately — exec format error or missing binary
- Severity: critical
- Scopes: Pod
- Docs: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/, https://docs.docker.com/build/building/multi-platform/
The container process exited within seconds of starting with a non-zero exit code that indicates the entrypoint binary could not be executed at all:
- Exit code 126: binary found but not executable (wrong permissions or file type)
- Exit code 127: binary not found in PATH (missing from image, wrong entrypoint)
- "exec format error": image built for a different CPU architecture (e.g. amd64 image on arm64 node)
- "no such file or directory": entrypoint path does not exist inside the container
These are distinct from application crashes because the process never started — the kernel rejected the binary before any user code ran.
TRG-POD-INIT-FAILED
Pod init container failed and is not in CrashLoopBackOff
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
An init container exited with a non-zero exit code, preventing the main containers from starting. Unlike CrashLoopBackOff (repeated restarts), this catches the case where the init container has failed but has not been retried yet (or is blocked for other reasons).
Common causes:
- Init container command references a script/binary that doesn't exist in the image.
- Init container depends on a service (database, API) that is not yet ready.
- Network policy blocks the init container from reaching an external service.
Scheduling
TRG-POD-PENDING-INSUFFICIENT-RESOURCES
Pod is Pending due to insufficient CPU or memory on all nodes
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
No node has enough allocatable CPU or memory to satisfy the pod's resource requests. The pod will remain Pending until a node with sufficient capacity becomes available.
Common causes:
- Resource requests are set too high relative to cluster capacity.
- All nodes are at capacity; scale up the cluster or reduce requests.
- Requests have been set in millicores/Mi that are unexpectedly large.
TRG-POD-PENDING-SELECTOR-MISMATCH
Pod is Pending due to node selector or affinity mismatch
- Severity: medium
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
No node satisfies the pod's nodeSelector, nodeAffinity, or podAffinity rules. The pod will wait indefinitely until a matching node is available or the scheduling constraints are relaxed.
TRG-POD-PENDING-TAINT-MISMATCH
Pod is Pending due to untolerated node taints
- Severity: medium
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
All nodes carry taints that this pod does not tolerate. The pod must declare tolerations for all required taints to be scheduled.
Common in: dedicated node groups, spot instances, GPU nodes, or nodes marked with NoSchedule for maintenance.
Storage
TRG-POD-PENDING-PVC-UNBOUND
Pod is Pending because a referenced PVC is not bound
- Severity: high
- Scopes: Pod
- Docs: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
A PersistentVolumeClaim (PVC) referenced by this pod is in Pending state — it has not been bound to a PersistentVolume. The pod cannot start until the PVC is bound.
Common causes:
- No PersistentVolume matches the PVC's storageClass, accessMode, or capacity.
- The StorageClass does not exist.
- Dynamic provisioning failed (CSI driver error, quota, permissions).