Skip to main content
Version: v0.2.0

kubediag — Diagnostic Rules

This file is auto-generated by go run ./hack/docgen. Do not edit by hand.

28 rules across 10 categories.

Summary

Rule IDTitleSeverityConfidenceScopes
TRG-ACCESS-INSUFFICIENT-READkubediag has insufficient RBAC permissions; diagnosis is incompleteinfoPod, Deployment, Namespace, Cluster
TRG-CLUSTER-APISERVER-LATENCYAPI server latency or availability events detectedhighCluster, Namespace
TRG-CLUSTER-NODE-NOT-READYOne or more cluster nodes are NotReadycriticalCluster, Namespace
TRG-CLUSTER-NODE-PRESSUREOne or more nodes have Memory, Disk, or PID pressurehighCluster, Namespace
TRG-CLUSTER-QUOTA-EXHAUSTEDNamespace ResourceQuota is exhausted or nearly exhaustedhighNamespace, Cluster
TRG-DEPLOY-ROLLOUT-STUCKDeployment rollout has exceeded its progress deadlinecriticalDeployment
TRG-DEPLOY-UNAVAILABLE-REPLICASDeployment has unavailable replicashighDeployment
TRG-NS-WARNING-EVENTSNamespace has recent Warning eventsmediumNamespace
TRG-POD-BAD-ENV-REFPod env var references a missing key in a ConfigMap or SecrethighPod
TRG-POD-CRASHLOOPBACKOFFContainer is in CrashLoopBackOffcriticalPod
TRG-POD-EXIT-IMMEDIATEContainer exits immediately — exec format error or missing binarycriticalPod
TRG-POD-IMAGE-AUTHContainer image pull failed due to authentication/authorisation errorhighPod
TRG-POD-IMAGE-NOT-FOUNDContainer image tag or repository does not existhighPod
TRG-POD-IMAGEPULLBACKOFFContainer image pull is failing (ImagePullBackOff / ErrImagePull)highPod
TRG-POD-INIT-FAILEDPod init container failed and is not in CrashLoopBackOffhighPod
TRG-POD-LIVENESS-FAILINGContainer liveness probe is failing (causing container restarts)highPod
TRG-POD-MISSING-CONFIGMAPPod references a ConfigMap that does not existhighPod
TRG-POD-MISSING-SECRETPod references a Secret that does not existhighPod
TRG-POD-OOMKILLEDContainer was OOMKilledhighPod
TRG-POD-PENDING-INSUFFICIENT-RESOURCESPod is Pending due to insufficient CPU or memory on all nodeshighPod
TRG-POD-PENDING-PVC-UNBOUNDPod is Pending because a referenced PVC is not boundhighPod
TRG-POD-PENDING-SELECTOR-MISMATCHPod is Pending due to node selector or affinity mismatchmediumPod
TRG-POD-PENDING-TAINT-MISMATCHPod is Pending due to untolerated node taintsmediumPod
TRG-POD-READINESS-FAILINGContainer readiness probe is failingmediumPod
TRG-POD-STARTUP-FAILINGContainer startup probe is failinghighPod
TRG-SVC-NO-ENDPOINTSService has no endpoints (no pods are selected)highPod, Namespace
TRG-SVC-PORT-MISMATCHService targetPort is not exposed by any selected podhighPod, Namespace
TRG-SVC-SELECTOR-MISMATCHService selector does not match any pod labels in the namespacehighPod, Namespace

Access

TRG-ACCESS-INSUFFICIENT-READ

kubediag has insufficient RBAC permissions; diagnosis is incomplete

kubediag was denied read access to one or more Kubernetes resources needed for a complete diagnosis. Results may be incomplete.

This is not necessarily a problem with your workloads — it is a signal that the user or service account running kubediag does not have the required RBAC permissions to perform a full inspection.

To diagnose with full fidelity, grant read (get/list) on: pods, events, deployments, replicasets, services, endpoints, configmaps, secrets, persistentvolumeclaims, nodes.

Configuration

TRG-POD-BAD-ENV-REF

Pod env var references a missing key in a ConfigMap or Secret

An env var in the pod spec uses valueFrom.configMapKeyRef or valueFrom.secretKeyRef to reference a specific key inside a ConfigMap or Secret. The ConfigMap/Secret exists, but the referenced key does not.

Kubernetes will refuse to start the container and report "CreateContainerConfigError" with a message like: "couldn't find key KEY in ConfigMap NS/NAME" "couldn't find key KEY in Secret NS/NAME"

This is distinct from TRG-POD-MISSING-CONFIGMAP / TRG-POD-MISSING-SECRET, which fire when the whole ConfigMap or Secret is absent.

TRG-POD-MISSING-CONFIGMAP

Pod references a ConfigMap that does not exist

The pod spec references a ConfigMap (in a volume, envFrom, or env valueFrom) that does not exist in the same namespace. The pod will stay in ContainerCreating or Pending until the ConfigMap is created.

TRG-POD-MISSING-SECRET

Pod references a Secret that does not exist

The pod spec references a Secret (in a volume, envFrom, env valueFrom, or imagePullSecrets) that does not exist in the same namespace. The pod will remain in ContainerCreating or Pending until the Secret is created.

Image

TRG-POD-IMAGE-AUTH

Container image pull failed due to authentication/authorisation error

The registry rejected the pull because credentials are missing, expired, or incorrect. This can also mean the service account's imagePullSecret was not configured, or the secret's credentials have rotated.

TRG-POD-IMAGE-NOT-FOUND

Container image tag or repository does not exist

The registry responded that the specified image or tag was not found. This is almost always a typo in the image reference (wrong tag, wrong repo path, deleted image) rather than a network or auth issue.

TRG-POD-IMAGEPULLBACKOFF

Container image pull is failing (ImagePullBackOff / ErrImagePull)

The kubelet cannot pull the container image. Kubernetes backs off retries exponentially. The most common sub-causes are: wrong image name/tag, image not found, or authentication failure against a private registry.

See TRG-POD-IMAGE-NOT-FOUND and TRG-POD-IMAGE-AUTH for specialised rules that fire when the event message is more specific.

Networking

TRG-SVC-NO-ENDPOINTS

Service has no endpoints (no pods are selected)

The Service's selector does not match any Running+Ready pod, so its Endpoints object is empty. Traffic to this Service will be dropped or return connection refused.

Common causes:

  • Label mismatch between the Service selector and the pod labels.
  • All pods are failing their readiness probes.
  • No pods exist with the selected labels.
  • The pods are in a different namespace.

TRG-SVC-PORT-MISMATCH

Service targetPort is not exposed by any selected pod

A Service's targetPort (the port traffic is forwarded to inside the pod) does not match any containerPort declared by the pods the Service selects.

Note: Kubernetes does not require containerPorts to be declared for traffic to flow — kube-proxy uses iptables/ipvs rules regardless. However, when containerPorts ARE declared and none match the targetPort, this almost always indicates a misconfiguration: a typo, a renamed port, or a changed application port that was not reflected in the Service spec.

This rule only fires when:

  1. The Service has a non-empty selector (not a headless/external service).
  2. At least one pod matches the selector.
  3. Those pods declare containerPorts.
  4. None of the declared containerPorts match the Service's targetPort.

TRG-SVC-SELECTOR-MISMATCH

Service selector does not match any pod labels in the namespace

The Service's selector labels do not appear on any pod in the namespace, regardless of pod health or readiness. This is often a misconfiguration — either the pod labels were changed or the Service selector was mistyped.

Probes

TRG-POD-LIVENESS-FAILING

Container liveness probe is failing (causing container restarts)

The liveness probe is returning failure. When the probe fails enough times (failureThreshold), Kubernetes kills and restarts the container. Repeated liveness failures appear as increasing restart counts, sometimes leading to CrashLoopBackOff.

A misconfigured liveness probe (wrong path, too-short timeout) is a very common cause of unexpected pod restarts.

TRG-POD-READINESS-FAILING

Container readiness probe is failing

The container's readiness probe is not returning success. Kubernetes removes the pod from Service Endpoints while readiness fails, so traffic is not routed to it. The pod stays Running but receives no traffic.

This is distinct from a liveness failure: failing readiness does not restart the container; it only removes it from load balancing.

TRG-POD-STARTUP-FAILING

Container startup probe is failing

The startup probe is not passing within its timeout window. Until the startup probe succeeds, both readiness and liveness probes are disabled. If the probe never passes (failureThreshold × periodSeconds elapsed), the container is killed.

This is most commonly misconfigured when the application has variable cold-start time that exceeds failureThreshold × periodSeconds.

ResourcePressure

TRG-CLUSTER-APISERVER-LATENCY

API server latency or availability events detected

Warning events have been detected that indicate the Kubernetes API server is experiencing elevated latency or transient unavailability. These events may originate from:

  • The API server itself (SlowReadResponse, SlowWriteResponse)
  • The etcd backend (timeout, leader election)
  • Controllers failing to reach the API (FailedToCreateEndpoint, context deadline exceeded)

API server latency causes cascading failures: controllers fall behind, pod readiness state becomes stale, and deployments stall. It is usually caused by:

  • etcd disk I/O saturation
  • Control-plane node resource exhaustion (CPU/memory)
  • Large object counts (too many secrets/configmaps/pods in cluster)
  • Network issues between control-plane and worker nodes

TRG-CLUSTER-NODE-NOT-READY

One or more cluster nodes are NotReady

At least one node has condition Ready=False or Ready=Unknown, meaning the kubelet on that node is not communicating with the control plane. Pods scheduled to that node will not start and existing pods may become Terminating.

Common causes:

  • Node lost network connectivity.
  • kubelet process crashed or is OOMKilled on the node.
  • Disk is full on the node (log or container image partition).
  • Node was forcibly shut down (spot instance reclaimed, maintenance).

TRG-CLUSTER-NODE-PRESSURE

One or more nodes have Memory, Disk, or PID pressure

One or more nodes report pressure conditions: MemoryPressure, DiskPressure, or PIDPressure. Under pressure, the kubelet will evict pods (starting with Best-Effort, then Burstable) until pressure is relieved.

Pressure conditions are early warnings — address them before pods start being evicted.

TRG-CLUSTER-QUOTA-EXHAUSTED

Namespace ResourceQuota is exhausted or nearly exhausted

A ResourceQuota in the namespace has one or more tracked resources at or near its hard limit. When a quota is fully exhausted, the API server will reject new object creation (pods, services, PVCs, etc.) with a "exceeded quota" error.

Thresholds used:

  • ≥100% used → Critical: quota fully consumed, new workloads will be rejected.
  • ≥95% used → High: quota nearly consumed, plan to increase or clean up.

Common causes:

  • Too many replicas scaled up without increasing quota.
  • Stale completed/failed pods not cleaned up (they still count toward pod quota).
  • Namespace was under-quota'd at creation and the workload has grown.

TRG-POD-OOMKILLED

Container was OOMKilled

The container's process was killed by the kernel's out-of-memory (OOM) killer because the container exceeded its memory limit. This shows up in lastState.terminated.reason = "OOMKilled".

Common causes:

  • The memory limit is set too low for the application's actual working set.
  • A memory leak in the application.
  • A burst-workload that requires more memory than the limit.

Remediation involves either increasing the memory limit or reducing the application's memory footprint.

Rollout

TRG-DEPLOY-ROLLOUT-STUCK

Deployment rollout has exceeded its progress deadline

The Deployment's progressDeadlineSeconds was exceeded. Kubernetes marks the Deployment with condition Progressing=False / ProgressDeadlineExceeded. The rollout is stuck — new pods are not becoming ready within the deadline.

Common causes:

  • New pods crash (CrashLoopBackOff) and never become ready.
  • New pods are unschedulable (insufficient resources, taint mismatch).
  • The readiness probe on the new pods is never passing.
  • A PVC or Secret is missing that the new pods need.

TRG-DEPLOY-UNAVAILABLE-REPLICAS

Deployment has unavailable replicas

The Deployment has fewer ready replicas than desired. Some pods are either not running, crashing, or failing their readiness probe. This reduces the available capacity and may impact traffic.

Runtime

TRG-NS-WARNING-EVENTS

Namespace has recent Warning events

The namespace has recent Kubernetes Warning events across its workloads. This rule aggregates events by reason and provides a summary — it does not diagnose individual pods, but gives a namespace-level picture of what is failing.

Use this as a quick scan before running more specific "kubediag pod" commands.

TRG-POD-CRASHLOOPBACKOFF

Container is in CrashLoopBackOff

The container has crashed repeatedly and Kubernetes is backing off before restarting it again. The root cause is almost always in the container process itself: the command exits non-zero on every start.

Common causes:

  • The application panics or exits immediately due to a missing configuration file or env var.
  • The entrypoint binary does not exist in the image.
  • An OOMKilled container gets re-labelled as CrashLoopBackOff after several OOM events.
  • A health check fails so fast that the container never stabilises.

Check the previous container logs for the actual exit reason.

TRG-POD-EXIT-IMMEDIATE

Container exits immediately — exec format error or missing binary

The container process exited within seconds of starting with a non-zero exit code that indicates the entrypoint binary could not be executed at all:

  • Exit code 126: binary found but not executable (wrong permissions or file type)
  • Exit code 127: binary not found in PATH (missing from image, wrong entrypoint)
  • "exec format error": image built for a different CPU architecture (e.g. amd64 image on arm64 node)
  • "no such file or directory": entrypoint path does not exist inside the container

These are distinct from application crashes because the process never started — the kernel rejected the binary before any user code ran.

TRG-POD-INIT-FAILED

Pod init container failed and is not in CrashLoopBackOff

An init container exited with a non-zero exit code, preventing the main containers from starting. Unlike CrashLoopBackOff (repeated restarts), this catches the case where the init container has failed but has not been retried yet (or is blocked for other reasons).

Common causes:

  • Init container command references a script/binary that doesn't exist in the image.
  • Init container depends on a service (database, API) that is not yet ready.
  • Network policy blocks the init container from reaching an external service.

Scheduling

TRG-POD-PENDING-INSUFFICIENT-RESOURCES

Pod is Pending due to insufficient CPU or memory on all nodes

No node has enough allocatable CPU or memory to satisfy the pod's resource requests. The pod will remain Pending until a node with sufficient capacity becomes available.

Common causes:

  • Resource requests are set too high relative to cluster capacity.
  • All nodes are at capacity; scale up the cluster or reduce requests.
  • Requests have been set in millicores/Mi that are unexpectedly large.

TRG-POD-PENDING-SELECTOR-MISMATCH

Pod is Pending due to node selector or affinity mismatch

No node satisfies the pod's nodeSelector, nodeAffinity, or podAffinity rules. The pod will wait indefinitely until a matching node is available or the scheduling constraints are relaxed.

TRG-POD-PENDING-TAINT-MISMATCH

Pod is Pending due to untolerated node taints

All nodes carry taints that this pod does not tolerate. The pod must declare tolerations for all required taints to be scheduled.

Common in: dedicated node groups, spot instances, GPU nodes, or nodes marked with NoSchedule for maintenance.

Storage

TRG-POD-PENDING-PVC-UNBOUND

Pod is Pending because a referenced PVC is not bound

A PersistentVolumeClaim (PVC) referenced by this pod is in Pending state — it has not been bound to a PersistentVolume. The pod cannot start until the PVC is bound.

Common causes:

  • No PersistentVolume matches the PVC's storageClass, accessMode, or capacity.
  • The StorageClass does not exist.
  • Dynamic provisioning failed (CSI driver error, quota, permissions).