Some or all of the information on this page might not apply to Cloud de Confiance by S3NS. See Differences from Google Cloud for more details.

Troubleshooting scalability in GKE

Autopilot Standard

High usage of the etcd database can cause cluster instability and resource shortages that prevent your Google Kubernetes Engine (GKE) clusters from scaling effectively.

Use this document to learn how to identify clusters where etcd usage is approaching its limit and find recommendations to free up space, helping to ensure that your cluster remains stable.

This information is important for Platform admins and operators responsible for maintaining the health and scalability of GKE clusters. For more information about the common roles and example tasks that we reference in Cloud de Confiance by S3NS content, see Common GKE user roles and tasks.

This document covers troubleshooting cluster stability related to high etcd usage. If you experience a different scalability problem, one of the following documents might help:

Cluster autoscaler issues:
- For troubleshooting why new nodes aren't being added, see Troubleshoot cluster autoscaler not scaling up.
- For troubleshooting why underutilized nodes aren't being removed, see Troubleshoot cluster autoscaler not scaling down.
Horizontal Pod Autoscaler issues: for troubleshooting why your Horizontal Pod Autoscaler isn't working, see Troubleshoot horizontal Pod autoscaling.
Autopilot scaling issues: for more information about Autopilot-specific issues, including those related to scaling, see Troubleshoot Autopilot clusters.

Identify clusters where etcd usage is approaching the limit

GKE provides insights and recommendations for the scenario where etcd usage is approaching the limit. You can find these insights and recommendations in the following ways:

Use the Cloud de Confiance console. Go to the Kubernetes clusters page. In the Notifications column for specific clusters, check for the Free up space to reduce risk of cluster instability recommendation.

Use the gcloud CLI or Recommender API by specifying the ETCD_DB_USAGE_APPROACHING_LIMIT recommender subtype.

To query for this recommendation, run the following command:

gcloud recommender recommendations list \
    --recommender=google.container.DiagnosisRecommender \
    --location=LOCATION \
    --project=PROJECT_ID \
    --format=yaml \
    --filter="recommenderSubtype:ETCD_DB_USAGE_APPROACHING_LIMIT"

To implement this recommendation, remove any unnecessary data from etcd to free up space. This might involve deleting old resources or moving large objects out of etcd. For more information, see Plan for large GKE clusters.

Identify clusters where storage usage per object type is approaching the limit

GKE provides insights and recommendations for the scenario where total size of etcd objects per type is approaching the limit. You can find these insights and recommendations in the following ways:

Use the Cloud de Confiance console. Go to the Kubernetes clusters page. In the Notifications column for specific clusters, check for the Reduce the size of resource type(s) recommendation.
Use the gcloud CLI or Recommender API by specifying the APISERVER_RESOURCE_TYPE_SIZE_EXCEEDS_LIMIT recommender subtype.

To query for this recommendation, run the following command:
```
gcloud recommender recommendations list \
    --recommender=google.container.DiagnosisRecommender \
    --location=LOCATION \
    --project=PROJECT_ID \
    --format=yaml \
    --filter="recommenderSubtype:APISERVER_RESOURCE_TYPE_SIZE_EXCEEDS_LIMIT"
```
To decide which objects to remove, you can use kubectl to list them. For example, if ConfigMaps are nearing the storage limit, the following command will output all ConfigMaps across all namespaces, helping you identify candidates for deletion:
```
kubectl get configmaps --all-namespaces > new_file.txt
```

To implement this recommendation and free up space, remove any unnecessary objects of the specified types from storage. This process might involve deleting old resources or moving large objects out of storage. For more information, see Plan for large GKE clusters.

What's next

If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.