Troubleshoot load balancing in GKE

Load balancing issues in Google Kubernetes Engine (GKE) can lead to service disruptions, such as HTTP 502 errors, or prevent access to applications.

Use this document to learn how to troubleshoot 502 errors from external Ingress and how to use load balancer logs and diagnostic tools, such as check-gke-ingress, to identify problems.

This information is important for Platform admins and operators and Application developers who configure and maintain load-balanced services in GKE. For more information about the common roles and example tasks that we reference in Cloud de Confiance by S3NS content, see Common GKE user roles and tasks.

External Ingress produces HTTP 502 errors

Use the following guidance to troubleshoot HTTP 502 errors with external Ingress resources:

  1. Enable logs for each backend service associated with each GKE Service that is referenced by the Ingress.
  2. Use status details to identify causes for HTTP 502 responses. Status details that indicate the HTTP 502 response originated from the backend require troubleshooting within the serving Pods, not the load balancer.

Unmanaged instance groups

You might experience HTTP 502 errors with external Ingress resources if your external Ingress uses unmanaged instance group backends. This issue occurs when all of the following conditions are met:

  • The cluster has a large total number of nodes among all node pools.
  • The serving Pods for one or more Services that are referenced by the Ingress are located on only a few nodes.
  • Services referenced by the Ingress use externalTrafficPolicy: Local.

To determine if your external Ingress uses unmanaged instance group backends, do the following:

  1. Go to the Ingress page in the Cloud de Confiance console.

    Go to Ingress

  2. Click the name of your external Ingress.

  3. Click the name of the Load balancer. The Load balancing details page displays.

  4. Check the table in the Backend services section to determine if your external Ingress uses NEGs or instance groups.

To resolve this issue, use one of the following solutions:

  • Use a VPC-native cluster.
  • Use externalTrafficPolicy: Cluster for each Service referenced by the external Ingress. This solution causes you to lose the original client IP address in the packet's sources.
  • Use the node.kubernetes.io/exclude-from-external-load-balancers=true annotation. Add the annotation to the nodes or node pools that don't run any serving Pod for any Service referenced by any external Ingress or LoadBalancer Service in your cluster.

L4 load balancer logging configuration

This section provides troubleshooting information if you have enabled logging for your external passthrough Network Load Balancer or internal passthrough Network Load Balancer.

Monitor status of logging configuration

The GKE L4LB controller provides feedback on the logging reconciliation status through the Service's status.conditions type. You can check this status by running the following command:

kubectl get svc SERVICE_NAME -o yaml

Replace the following:

  • SERVICE_NAME: the name of the cluster.

In the output, look for the LoggingConfigManaged condition type. The following table describes the possible reasons for the condition:

Condition status Reason Description
True Reconciled The controller is actively enforcing the logging configuration defined in the L4LBConfig CRD.
False Unmanaged The logging section is missing from the L4LBConfig CRD, or the annotation was removed. The controller has stopped management and left the backend service in its last known state.
False Missing The L4LBConfig resource referenced in the Service annotation cannot be found.
False Invalid The L4LBConfig resource failed the cross validation of the optionalFields parameter.
False Error An error occurred during the reconciliation of the backend service.

Understand coast behavior

If the networking.gke.io/l4lb-config annotation is removed from the Service manifest, or the referenced L4LBConfig resource is deleted, the configuration enters a Coast state.

In this state, the GKE controller stops managing the logging settings but doesn't reset the Cloud de Confiance by S3NS backend service to its default settings. Instead, the backend service remains in its last known good state. A warning event is typically issued to notify you that Kubernetes is no longer controlling the configuration.

Use load balancer logs to troubleshoot

You can use internal passthrough Network Load Balancer logs and external passthrough Network Load Balancer logs to troubleshoot issues with load balancers and correlate traffic from load balancers to GKE resources.

Logs are aggregated per-connection and exported in near real time. Logs are generated for each GKE node involved in the data path of a LoadBalancer Service, for both ingress and egress traffic. Log entries include additional fields for GKE resources, such as:

  • Cluster name
  • Cluster location
  • Service name
  • Service namespace
  • Pod name
  • Pod namespace

Use diagnostic tools to troubleshoot

The check-gke-ingress diagnostic tool inspects Ingress resources for common misconfigurations. You can use the check-gke-ingress tool in the following ways:

  • Run the gcpdiag command-line tool on your cluster. Ingress results appear in the check rule gke/ERR/2023_004 section.
  • Use the check-gke-ingress tool alone or as a kubectl plugin by following the instructions in check-gke-ingress.