Version: Next (unreleased)

kubediag — Diagnostic Rules

This file is auto-generated by go run ./hack/docgen. Do not edit by hand.

28 rules across 10 categories.

Summary

Rule ID	Title	Severity	Confidence	Scopes
TRG-ACCESS-INSUFFICIENT-READ	kubediag has insufficient RBAC permissions; diagnosis is incomplete	info	—	Pod, Deployment, Namespace, Cluster
TRG-CLUSTER-APISERVER-LATENCY	API server latency or availability events detected	high	—	Cluster, Namespace
TRG-CLUSTER-NODE-NOT-READY	One or more cluster nodes are NotReady	critical	—	Cluster, Namespace
TRG-CLUSTER-NODE-PRESSURE	One or more nodes have Memory, Disk, or PID pressure	high	—	Cluster, Namespace
TRG-CLUSTER-QUOTA-EXHAUSTED	Namespace ResourceQuota is exhausted or nearly exhausted	high	—	Namespace, Cluster
TRG-DEPLOY-ROLLOUT-STUCK	Deployment rollout has exceeded its progress deadline	critical	—	Deployment
TRG-DEPLOY-UNAVAILABLE-REPLICAS	Deployment has unavailable replicas	high	—	Deployment
TRG-NS-WARNING-EVENTS	Namespace has recent Warning events	medium	—	Namespace
TRG-POD-BAD-ENV-REF	Pod env var references a missing key in a ConfigMap or Secret	high	—	Pod
TRG-POD-CRASHLOOPBACKOFF	Container is in CrashLoopBackOff	critical	—	Pod
TRG-POD-EXIT-IMMEDIATE	Container exits immediately — exec format error or missing binary	critical	—	Pod
TRG-POD-IMAGE-AUTH	Container image pull failed due to authentication/authorisation error	high	—	Pod
TRG-POD-IMAGE-NOT-FOUND	Container image tag or repository does not exist	high	—	Pod
TRG-POD-IMAGEPULLBACKOFF	Container image pull is failing (ImagePullBackOff / ErrImagePull)	high	—	Pod
TRG-POD-INIT-FAILED	Pod init container failed and is not in CrashLoopBackOff	high	—	Pod
TRG-POD-LIVENESS-FAILING	Container liveness probe is failing (causing container restarts)	high	—	Pod
TRG-POD-MISSING-CONFIGMAP	Pod references a ConfigMap that does not exist	high	—	Pod
TRG-POD-MISSING-SECRET	Pod references a Secret that does not exist	high	—	Pod
TRG-POD-OOMKILLED	Container was OOMKilled	high	—	Pod
TRG-POD-PENDING-INSUFFICIENT-RESOURCES	Pod is Pending due to insufficient CPU or memory on all nodes	high	—	Pod
TRG-POD-PENDING-PVC-UNBOUND	Pod is Pending because a referenced PVC is not bound	high	—	Pod
TRG-POD-PENDING-SELECTOR-MISMATCH	Pod is Pending due to node selector or affinity mismatch	medium	—	Pod
TRG-POD-PENDING-TAINT-MISMATCH	Pod is Pending due to untolerated node taints	medium	—	Pod
TRG-POD-READINESS-FAILING	Container readiness probe is failing	medium	—	Pod
TRG-POD-STARTUP-FAILING	Container startup probe is failing	high	—	Pod
TRG-SVC-NO-ENDPOINTS	Service has no endpoints (no pods are selected)	high	—	Pod, Namespace
TRG-SVC-PORT-MISMATCH	Service targetPort is not exposed by any selected pod	high	—	Pod, Namespace
TRG-SVC-SELECTOR-MISMATCH	Service selector does not match any pod labels in the namespace	high	—	Pod, Namespace

Access

TRG-ACCESS-INSUFFICIENT-READ

kubediag has insufficient RBAC permissions; diagnosis is incomplete

Severity: info
Scopes: Pod, Deployment, Namespace, Cluster
Docs: https://kubernetes.io/docs/reference/access-authn-authz/rbac/

kubediag was denied read access to one or more Kubernetes resources needed for a complete diagnosis. Results may be incomplete.

This is not necessarily a problem with your workloads — it is a signal that the user or service account running kubediag does not have the required RBAC permissions to perform a full inspection.

To diagnose with full fidelity, grant read (get/list) on: pods, events, deployments, replicasets, services, endpoints, configmaps, secrets, persistentvolumeclaims, nodes.

Configuration

TRG-POD-BAD-ENV-REF

Pod env var references a missing key in a ConfigMap or Secret

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#define-container-environment-variables-using-configmap-data

An env var in the pod spec uses valueFrom.configMapKeyRef or valueFrom.secretKeyRef to reference a specific key inside a ConfigMap or Secret. The ConfigMap/Secret exists, but the referenced key does not.

Kubernetes will refuse to start the container and report "CreateContainerConfigError" with a message like: "couldn't find key KEY in ConfigMap NS/NAME" "couldn't find key KEY in Secret NS/NAME"

This is distinct from TRG-POD-MISSING-CONFIGMAP / TRG-POD-MISSING-SECRET, which fire when the whole ConfigMap or Secret is absent.

TRG-POD-MISSING-CONFIGMAP

Pod references a ConfigMap that does not exist

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/configuration/configmap/

The pod spec references a ConfigMap (in a volume, envFrom, or env valueFrom) that does not exist in the same namespace. The pod will stay in ContainerCreating or Pending until the ConfigMap is created.

TRG-POD-MISSING-SECRET

Pod references a Secret that does not exist

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/configuration/secret/

The pod spec references a Secret (in a volume, envFrom, env valueFrom, or imagePullSecrets) that does not exist in the same namespace. The pod will remain in ContainerCreating or Pending until the Secret is created.

Image

TRG-POD-IMAGE-AUTH

Container image pull failed due to authentication/authorisation error

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

The registry rejected the pull because credentials are missing, expired, or incorrect. This can also mean the service account's imagePullSecret was not configured, or the secret's credentials have rotated.

TRG-POD-IMAGE-NOT-FOUND

Container image tag or repository does not exist

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/containers/images/

The registry responded that the specified image or tag was not found. This is almost always a typo in the image reference (wrong tag, wrong repo path, deleted image) rather than a network or auth issue.

TRG-POD-IMAGEPULLBACKOFF

Container image pull is failing (ImagePullBackOff / ErrImagePull)

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/containers/images/

The kubelet cannot pull the container image. Kubernetes backs off retries exponentially. The most common sub-causes are: wrong image name/tag, image not found, or authentication failure against a private registry.

See TRG-POD-IMAGE-NOT-FOUND and TRG-POD-IMAGE-AUTH for specialised rules that fire when the event message is more specific.

Networking

TRG-SVC-NO-ENDPOINTS

Service has no endpoints (no pods are selected)

Severity: high
Scopes: Pod, Namespace
Docs: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service

The Service's selector does not match any Running+Ready pod, so its Endpoints object is empty. Traffic to this Service will be dropped or return connection refused.

Common causes:

Label mismatch between the Service selector and the pod labels.
All pods are failing their readiness probes.
No pods exist with the selected labels.
The pods are in a different namespace.

TRG-SVC-PORT-MISMATCH

Service targetPort is not exposed by any selected pod

Severity: high
Scopes: Pod, Namespace
Docs: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service

A Service's targetPort (the port traffic is forwarded to inside the pod) does not match any containerPort declared by the pods the Service selects.

Note: Kubernetes does not require containerPorts to be declared for traffic to flow — kube-proxy uses iptables/ipvs rules regardless. However, when containerPorts ARE declared and none match the targetPort, this almost always indicates a misconfiguration: a typo, a renamed port, or a changed application port that was not reflected in the Service spec.

This rule only fires when:

The Service has a non-empty selector (not a headless/external service).
At least one pod matches the selector.
Those pods declare containerPorts.
None of the declared containerPorts match the Service's targetPort.

TRG-SVC-SELECTOR-MISMATCH

Service selector does not match any pod labels in the namespace

Severity: high
Scopes: Pod, Namespace
Docs: https://kubernetes.io/docs/concepts/services-networking/service/

The Service's selector labels do not appear on any pod in the namespace, regardless of pod health or readiness. This is often a misconfiguration — either the pod labels were changed or the Service selector was mistyped.

Probes

TRG-POD-LIVENESS-FAILING

Container liveness probe is failing (causing container restarts)

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

The liveness probe is returning failure. When the probe fails enough times (failureThreshold), Kubernetes kills and restarts the container. Repeated liveness failures appear as increasing restart counts, sometimes leading to CrashLoopBackOff.

A misconfigured liveness probe (wrong path, too-short timeout) is a very common cause of unexpected pod restarts.

TRG-POD-READINESS-FAILING

Container readiness probe is failing

Severity: medium
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

The container's readiness probe is not returning success. Kubernetes removes the pod from Service Endpoints while readiness fails, so traffic is not routed to it. The pod stays Running but receives no traffic.

This is distinct from a liveness failure: failing readiness does not restart the container; it only removes it from load balancing.

TRG-POD-STARTUP-FAILING

Container startup probe is failing

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

The startup probe is not passing within its timeout window. Until the startup probe succeeds, both readiness and liveness probes are disabled. If the probe never passes (failureThreshold × periodSeconds elapsed), the container is killed.

This is most commonly misconfigured when the application has variable cold-start time that exceeds failureThreshold × periodSeconds.

ResourcePressure

TRG-CLUSTER-APISERVER-LATENCY

API server latency or availability events detected

Severity: high
Scopes: Cluster, Namespace
Docs: https://kubernetes.io/docs/tasks/debug/debug-cluster/, https://kubernetes.io/docs/concepts/cluster-administration/logging/

Warning events have been detected that indicate the Kubernetes API server is experiencing elevated latency or transient unavailability. These events may originate from:

The API server itself (SlowReadResponse, SlowWriteResponse)
The etcd backend (timeout, leader election)
Controllers failing to reach the API (FailedToCreateEndpoint, context deadline exceeded)

API server latency causes cascading failures: controllers fall behind, pod readiness state becomes stale, and deployments stall. It is usually caused by:

etcd disk I/O saturation
Control-plane node resource exhaustion (CPU/memory)
Large object counts (too many secrets/configmaps/pods in cluster)
Network issues between control-plane and worker nodes

TRG-CLUSTER-NODE-NOT-READY

One or more cluster nodes are NotReady

Severity: critical
Scopes: Cluster, Namespace
Docs: https://kubernetes.io/docs/concepts/architecture/nodes/#node-status

At least one node has condition Ready=False or Ready=Unknown, meaning the kubelet on that node is not communicating with the control plane. Pods scheduled to that node will not start and existing pods may become Terminating.

Common causes:

Node lost network connectivity.
kubelet process crashed or is OOMKilled on the node.
Disk is full on the node (log or container image partition).
Node was forcibly shut down (spot instance reclaimed, maintenance).

TRG-CLUSTER-NODE-PRESSURE

One or more nodes have Memory, Disk, or PID pressure

Severity: high
Scopes: Cluster, Namespace
Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

One or more nodes report pressure conditions: MemoryPressure, DiskPressure, or PIDPressure. Under pressure, the kubelet will evict pods (starting with Best-Effort, then Burstable) until pressure is relieved.

Pressure conditions are early warnings — address them before pods start being evicted.

TRG-CLUSTER-QUOTA-EXHAUSTED

Namespace ResourceQuota is exhausted or nearly exhausted

Severity: high
Scopes: Namespace, Cluster
Docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/

A ResourceQuota in the namespace has one or more tracked resources at or near its hard limit. When a quota is fully exhausted, the API server will reject new object creation (pods, services, PVCs, etc.) with a "exceeded quota" error.

Thresholds used:

≥100% used → Critical: quota fully consumed, new workloads will be rejected.
≥95% used → High: quota nearly consumed, plan to increase or clean up.

Common causes:

Too many replicas scaled up without increasing quota.
Stale completed/failed pods not cleaned up (they still count toward pod quota).
Namespace was under-quota'd at creation and the workload has grown.

TRG-POD-OOMKILLED

Container was OOMKilled

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

The container's process was killed by the kernel's out-of-memory (OOM) killer because the container exceeded its memory limit. This shows up in lastState.terminated.reason = "OOMKilled".

Common causes:

The memory limit is set too low for the application's actual working set.
A memory leak in the application.
A burst-workload that requires more memory than the limit.

Remediation involves either increasing the memory limit or reducing the application's memory footprint.

Rollout

TRG-DEPLOY-ROLLOUT-STUCK

Deployment rollout has exceeded its progress deadline

Severity: critical
Scopes: Deployment
Docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment

The Deployment's progressDeadlineSeconds was exceeded. Kubernetes marks the Deployment with condition Progressing=False / ProgressDeadlineExceeded. The rollout is stuck — new pods are not becoming ready within the deadline.

Common causes:

New pods crash (CrashLoopBackOff) and never become ready.
New pods are unschedulable (insufficient resources, taint mismatch).
The readiness probe on the new pods is never passing.
A PVC or Secret is missing that the new pods need.

TRG-DEPLOY-UNAVAILABLE-REPLICAS

Deployment has unavailable replicas

Severity: high
Scopes: Deployment
Docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

The Deployment has fewer ready replicas than desired. Some pods are either not running, crashing, or failing their readiness probe. This reduces the available capacity and may impact traffic.

Runtime

TRG-NS-WARNING-EVENTS

Namespace has recent Warning events

Severity: medium
Scopes: Namespace
Docs: https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources

The namespace has recent Kubernetes Warning events across its workloads. This rule aggregates events by reason and provides a summary — it does not diagnose individual pods, but gives a namespace-level picture of what is failing.

Use this as a quick scan before running more specific "kubediag pod" commands.

TRG-POD-CRASHLOOPBACKOFF

Container is in CrashLoopBackOff

Severity: critical
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-states

The container has crashed repeatedly and Kubernetes is backing off before restarting it again. The root cause is almost always in the container process itself: the command exits non-zero on every start.

Common causes:

The application panics or exits immediately due to a missing configuration file or env var.
The entrypoint binary does not exist in the image.
An OOMKilled container gets re-labelled as CrashLoopBackOff after several OOM events.
A health check fails so fast that the container never stabilises.

Check the previous container logs for the actual exit reason.

TRG-POD-EXIT-IMMEDIATE

Container exits immediately — exec format error or missing binary

Severity: critical
Scopes: Pod
Docs: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/, https://docs.docker.com/build/building/multi-platform/

The container process exited within seconds of starting with a non-zero exit code that indicates the entrypoint binary could not be executed at all:

Exit code 126: binary found but not executable (wrong permissions or file type)
Exit code 127: binary not found in PATH (missing from image, wrong entrypoint)
"exec format error": image built for a different CPU architecture (e.g. amd64 image on arm64 node)
"no such file or directory": entrypoint path does not exist inside the container

These are distinct from application crashes because the process never started — the kernel rejected the binary before any user code ran.

TRG-POD-INIT-FAILED

Pod init container failed and is not in CrashLoopBackOff

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

An init container exited with a non-zero exit code, preventing the main containers from starting. Unlike CrashLoopBackOff (repeated restarts), this catches the case where the init container has failed but has not been retried yet (or is blocked for other reasons).

Common causes:

Init container command references a script/binary that doesn't exist in the image.
Init container depends on a service (database, API) that is not yet ready.
Network policy blocks the init container from reaching an external service.

Scheduling

TRG-POD-PENDING-INSUFFICIENT-RESOURCES

Pod is Pending due to insufficient CPU or memory on all nodes

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

No node has enough allocatable CPU or memory to satisfy the pod's resource requests. The pod will remain Pending until a node with sufficient capacity becomes available.

Common causes:

Resource requests are set too high relative to cluster capacity.
All nodes are at capacity; scale up the cluster or reduce requests.
Requests have been set in millicores/Mi that are unexpectedly large.

TRG-POD-PENDING-SELECTOR-MISMATCH

Pod is Pending due to node selector or affinity mismatch

Severity: medium
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

No node satisfies the pod's nodeSelector, nodeAffinity, or podAffinity rules. The pod will wait indefinitely until a matching node is available or the scheduling constraints are relaxed.

TRG-POD-PENDING-TAINT-MISMATCH

Pod is Pending due to untolerated node taints

Severity: medium
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

All nodes carry taints that this pod does not tolerate. The pod must declare tolerations for all required taints to be scheduled.

Common in: dedicated node groups, spot instances, GPU nodes, or nodes marked with NoSchedule for maintenance.

Storage

TRG-POD-PENDING-PVC-UNBOUND

Pod is Pending because a referenced PVC is not bound

Severity: high
Scopes: Pod
Docs: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

A PersistentVolumeClaim (PVC) referenced by this pod is in Pending state — it has not been bound to a PersistentVolume. The pod cannot start until the PVC is bound.

Common causes:

No PersistentVolume matches the PVC's storageClass, accessMode, or capacity.
The StorageClass does not exist.
Dynamic provisioning failed (CSI driver error, quota, permissions).

Summary​

Access​

TRG-ACCESS-INSUFFICIENT-READ​

Configuration​

TRG-POD-BAD-ENV-REF​

TRG-POD-MISSING-CONFIGMAP​

TRG-POD-MISSING-SECRET​

Image​

TRG-POD-IMAGE-AUTH​

TRG-POD-IMAGE-NOT-FOUND​

TRG-POD-IMAGEPULLBACKOFF​

Networking​

TRG-SVC-NO-ENDPOINTS​

TRG-SVC-PORT-MISMATCH​

TRG-SVC-SELECTOR-MISMATCH​

Probes​

TRG-POD-LIVENESS-FAILING​

TRG-POD-READINESS-FAILING​

TRG-POD-STARTUP-FAILING​

ResourcePressure​

TRG-CLUSTER-APISERVER-LATENCY​

TRG-CLUSTER-NODE-NOT-READY​

TRG-CLUSTER-NODE-PRESSURE​

TRG-CLUSTER-QUOTA-EXHAUSTED​

TRG-POD-OOMKILLED​

Rollout​

TRG-DEPLOY-ROLLOUT-STUCK​

TRG-DEPLOY-UNAVAILABLE-REPLICAS​

Runtime​

TRG-NS-WARNING-EVENTS​

TRG-POD-CRASHLOOPBACKOFF​

TRG-POD-EXIT-IMMEDIATE​

TRG-POD-INIT-FAILED​

Scheduling​

TRG-POD-PENDING-INSUFFICIENT-RESOURCES​

TRG-POD-PENDING-SELECTOR-MISMATCH​

TRG-POD-PENDING-TAINT-MISMATCH​

Storage​

TRG-POD-PENDING-PVC-UNBOUND​

Summary

Access

TRG-ACCESS-INSUFFICIENT-READ

Configuration

TRG-POD-BAD-ENV-REF

TRG-POD-MISSING-CONFIGMAP

TRG-POD-MISSING-SECRET

Image

TRG-POD-IMAGE-AUTH

TRG-POD-IMAGE-NOT-FOUND

TRG-POD-IMAGEPULLBACKOFF

Networking

TRG-SVC-NO-ENDPOINTS

TRG-SVC-PORT-MISMATCH

TRG-SVC-SELECTOR-MISMATCH

Probes

TRG-POD-LIVENESS-FAILING

TRG-POD-READINESS-FAILING

TRG-POD-STARTUP-FAILING

ResourcePressure

TRG-CLUSTER-APISERVER-LATENCY

TRG-CLUSTER-NODE-NOT-READY

TRG-CLUSTER-NODE-PRESSURE

TRG-CLUSTER-QUOTA-EXHAUSTED

TRG-POD-OOMKILLED

Rollout

TRG-DEPLOY-ROLLOUT-STUCK

TRG-DEPLOY-UNAVAILABLE-REPLICAS

Runtime

TRG-NS-WARNING-EVENTS

TRG-POD-CRASHLOOPBACKOFF

TRG-POD-EXIT-IMMEDIATE

TRG-POD-INIT-FAILED

Scheduling

TRG-POD-PENDING-INSUFFICIENT-RESOURCES

TRG-POD-PENDING-SELECTOR-MISMATCH

TRG-POD-PENDING-TAINT-MISMATCH

Storage

TRG-POD-PENDING-PVC-UNBOUND