Troubleshooting scalability in GKE

High usage of the etcd database can cause cluster instability and resource shortages that prevent your Google Kubernetes Engine (GKE) clusters from scaling effectively.

Use this document to learn how to identify clusters where etcd usage is approaching its limit and find recommendations to free up space, helping to ensure that your cluster remains stable.

This information is important for Platform admins and operators responsible for maintaining the health and scalability of GKE clusters. For more information about the common roles and example tasks that we reference in Trusted Cloud by S3NS content, see Common GKE user roles and tasks.

This document covers troubleshooting cluster stability related to high etcd usage. If you experience a different scalability problem, one of the following documents might help:

Identify clusters where etcd usage is approaching the limit

GKE provides insights and recommendations for the scenario where etcd usage is approaching the limit. You can find these insights and recommendations in the following ways:

  • Use the Trusted Cloud console. Go to the Kubernetes clusters page. In the Notifications column for specific clusters, check for the Free up space to reduce risk of cluster instability recommendation.
  • Use the gcloud CLI or Recommender API by specifying the ETCD_DB_USAGE_APPROACHING_LIMIT recommender subtype.

    To query for this recommendation, run the following command:

    gcloud recommender recommendations list \
        --recommender=google.container.DiagnosisRecommender \
        --location LOCATION \
        --project PROJECT_ID \
        --format yaml \
        --filter="recommenderSubtype:ETCD_DB_USAGE_APPROACHING_LIMIT"
    

To implement this recommendation, remove any unnecessary data from etcd to free up space. This might involve deleting old resources or moving large objects out of etcd. For more information, see Plan for large GKE clusters.

What's next