Troubleshoot GKE
This document lists troubleshooting documents for common issues that you might
encounter when using Google Kubernetes Engine (GKE). Whether you're diagnosing
workload errors like ImagePullBackOff and CrashLoopBackOff, debugging
cluster autoscaling behavior, resolving PersistentVolume issues, or
troubleshooting node registration problems, the documents listed here can help.
If you're new to troubleshooting in GKE, start with
Introduction to troubleshooting.
To diagnose and resolve issues you encounter, see the documents in the
following sections:
To troubleshoot GKE networking, see
Troubleshoot GKE networking
in the GKE networking documentation.
This document is for Admins and architects, Security specialists,
Networking specialists, or Storage specialists who troubleshoot
GKE configurations. To learn more about GKE roles,
see
Common GKE user roles and tasks.
Introduction to troubleshooting
Cluster setup
| Topic |
Description |
| Cluster creation |
Resolve issues with creating clusters. |
| Autopilot clusters |
Diagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues. |
| Kubectl command-line tool |
Troubleshoot the kubectl command-line tool in
GKE, including issues with authentication, authorization.
This page also includes advice on how to
troubleshoot the Konnectivity proxy
to check if it's causing the kubectl logs, attach,
exec, or port-forward commands to stop
responding. |
| Standard node pools |
Troubleshoot GKE Standard node pools,
including issues with node pool creation, best-effort provisioning,
corrupted instance metadata, and migrating workloads to new node pools. |
Node NotReady status |
Learn how to diagnose and resolve the node NotReady
status in GKE by troubleshooting common causes such as
resource shortages, network issues, and component failures. |
| Node registration |
Troubleshoot issues that occur when adding nodes to your
GKE Standard cluster, such as node registration
failures and missing prerequisites for successful node registration. |
| Container runtime |
Troubleshoot container runtimes in GKE, including
issues with containerd and dockershim, and
private registries. |
Autoscaling
| Topic |
Description |
| Cluster autoscaler not scaling down |
Diagnose and resolve common reasons your cluster isn't removing
underutilized nodes. Learn how to check for issues like restrictive
PodDisruptionBudgets, Pods with local storage, or specific annotations
(for example, "cluster-autoscaler.kubernetes.io/safe-to-evict": "false")
that prevent node eviction. |
| Cluster autoscaler not scaling up |
Learn why the cluster autoscaler isn't adding new nodes to meet demand.
Check for unschedulable Pods, verify that you haven't hit cluster or node
pool size limits, and identify potential resource quota or regional VM
availability issues. |
| Horizontal Pod autoscaling |
Troubleshoot problems with the Horizontal Pod Autoscaler not scaling
your application's Pod replicas. Resolve common issues, such as
misconfigured HorizontalPodAutoscaler objects or problems with the metrics
pipeline. |
Storage
| Topic |
Description |
| Storage |
Troubleshoot storage, including issues with regional persistent disks,
disk performance, and volume expansion. |
Cluster security
Cluster's root Certificate Authority expiring soon
Workloads
| Topic |
Description |
| Deployed workloads |
Troubleshoot errors for workloads running in a GKE
cluster, including
PodUnschedulable.
Read the PodUnschedulable section for advice on errors like
MatchNodeSelector and
Does not have minimum availability.
|
| Image pulls |
Troubleshoot image pulls. Learn what causes statuses like
ImagePullBackOff and ErrImagePull
and how to resolve these statuses by fixing common issues like
authentication and network connectivity. |
| CrashLoopBackOff events |
Troubleshoot CrashLoopBackOff events in
GKE. Diagnose issues like resource exhaustion, app
misconfigurations, and liveness probe failures. |
| OOM events |
Troubleshoot Kubernetes Out of Memory (OOM) events. Identify causes,
distinguish event types, and apply effective solutions for both container-
and node-level OOM kills. |
| Arm workloads |
Troubleshoot issues with Arm workloads, including Pods on Arm nodes
crashing. |
| TPUs |
Troubleshoot TPUs, including issues with quota, node
auto-provisioning, workload configuration, and scheduling. |
| GPUs |
Troubleshoot GPUs, including issues with GPU driver installation,
device plugin errors, and container images. |
Cluster management
| Topic |
Description |
| Cluster upgrades |
Troubleshoot and resolve GKE cluster and node
upgrade issues, including long or incomplete upgrades, unexpected
auto-upgrades, failures, and post-upgrade problems. |
| Webhooks |
Understand how to troubleshoot and ensure the stability of your
cluster control plane when using admission webhooks. |
Namespace stuck in the Terminating state |
Troubleshoot issues with namespaces stuck in the
Terminating state by identifying and removing the unhealthy
components that are blocking deletion. |
| Concurrent operations |
Troubleshoot concurrent operations by learning how to identify
these errors and resolve them by waiting for operations to complete. |
Monitoring
| Topic |
Description |
| System metrics |
Troubleshoot system metrics not appearing in Cloud Monitoring. |
| Monitoring dashboards |
Troubleshoot monitoring dashboards, including issues with enabling
monitoring, missing Kubernetes resources, and permissions. |
| Troubleshoot missing logs |
Troubleshoot missing GKE logs. Learn how to check API
status, cluster settings, permissions, quotas, filters, and application
behavior. |
4xx errors
Known issues
| Topic |
Description |
| Known issues |
Identify and resolve known issues that might
affect your use of GKE. |
What's next
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-11 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-11-11 UTC."],[],[]]