This document describes how to enable and use predicted latency-based routing provided by llm-d within GKE Inference Gateway. By default, GKE Inference Gateway routes requests using a combination of load signals and prefix-cache affinity heuristics. Predicted latency-based routing replaces the static heuristic weights with an XGBoost model trained continuously on live traffic, making more accurate routing decisions as workload patterns shift.
When to use predicted latency-based routing
This feature is most effective when the following conditions apply to your workload:
- High variance in prompt and completion length: queue depth alone is a poor proxy for server load when request sizes vary significantly. The latency predictor accounts for actual prefill and decode cost per request.
- Per-request latency SLOs: when your applications specify Time-to-First-Token (TTFT) or Time-per-Output-Token (TPOT) targets on individual requests, the scheduler enforces these targets during routing. It does this by computing the headroom (predicted latency minus SLO target) for each candidate Pod.
- Fragile static weight tuning: if you are frequently re-tuning the balance between cache affinity and load signals as traffic patterns shift, the online-trained model adapts automatically.
How predicted latency-based routing works
This section details the architecture and the scheduling pipeline used by predicted latency-based routing.
Architecture
Predicted latency-based scheduling deploys two additional sidecar containers inside the EPP Pod, alongside the EPP itself:
| Component | Description |
|---|---|
| Training Server | Continuously retrains the XGBoost TTFT and TPOT models on completed request samples received from the EPP. Uses stratified bucketing over a sliding window so that rare traffic regimes are not forgotten. Writes updated models to a shared volume. |
| Prediction Servers | Serve TTFT and TPOT predictions to the EPP on the request hot path. Read the latest trained model from the shared volume. Horizontally scalable — each server instance sustains approximately 300 QPS of prediction work. Multiple instances are load-balanced by a Go coalescing proxy in the EPP that batches concurrent prediction requests within a 1ms window. |
llm-d EPP scheduling pipeline
When predicted latency-based scheduling is enabled, the EPP processes each request through the following sequence of composable plugins:
predicted-latency-producer: calls the Prediction Server to obtain TTFT and TPOT estimates for every candidate Pod in theInferencePool, conditioned on each Pod's current KV-cache utilization, queue depth, prefix cache match score, and the incoming request features. After the response is returned to the client, the producer sends the observed TTFT and inter-token latency back to the Training Server as a new training sample.- Fallback behavior: if the Prediction Server is unreachable or returns an error, the EPP automatically falls back to a composite score based on KV-cache utilization, queue depth, and prefix cache match.
prefix-cache-affinity-filter: this filter narrows the candidate set to cache-warm Pods when any Pod's prefix cache match score exceeds the affinity threshold (default of 0.80). This threshold separates two populations observed in production: Pods that already have a conversation history cached from prior turns, and Pods that don't. This filter implements an epsilon-greedy exploit and explore strategy:Exploit (default path): this path routes to cache-warm Pods so that scoring concentrates cache reuse on them.
Explore (small probability): this path bypasses the filter entirely on a configurable fraction of requests to seed cache entries on cold Pods to prevent cache fragmentation.
TTFT load gate: even on the exploit path, if the best cache-warm Pod's predicted TTFT exceeds the best overall Pod's TTFT by more than a configurable threshold (default of 5,000 ms), affinity is broken and the full candidate set is used.
slo-headroom-tier-filter(SLO requests only): when the request includes SLO headers, splits candidate Pods into a positive tier (predicted to meet the SLO) and a negative tier (predicted to violate it).latency-scorer: scores candidate Pods. Without SLO headers, the Pod with the lowest predicted latency is selected. With SLO headers, the score is based on headroom (SLO minus predicted latency) using theheadroomSelectionStrategy:least(default): Bin-pack. Routes to the Pod with the smallest positive headroom, maximizing utilization and keeping less loaded Pods free for future traffic bursts.most: Spread. Routes to the Pod with the most positive headroom, leaving more slack for unexpected load spikes.
latency-slo-admitter(SLO requests only): rejects sheddable requests (priority is less than 0) when no candidate Pod is predicted to meet the SLO, instead of consuming capacity on a request predicted to miss its target. This filter has no effect when SLO headers are absent or when a Pod that meets the SLO exists.weighted-random-picker: selects the final Pod using weighted random selection over the scores. This spreads load while still favoring better-scoring Pods.
Streaming mode
The predicted-latency-producer plugin supports two training modes, configured
using the streamingMode parameter:
streamingMode: false(default): trains on end-to-end (E2E) request latency. Use this mode if your workload mixes streaming and non-streaming responses, or if you only need latency-aware routing without per-request SLO enforcement.streamingMode: true: trains separate TTFT and TPOT models. TTFT is recorded on the first streamed chunk; TPOT is sampled across subsequent tokens. Use this mode if your workload is fully streaming and you need meaningfulx-slo-ttft-ms/x-slo-tpot-msenforcement.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Enable the Compute Engine API, the Network Services API, and the Model Armor API if needed.
Go to Enable access to APIs and follow the instructions.
Ensure you have a working GKE Inference Gateway deployment. See Deploy GKE Inference Gateway.
Ensure that your
InferencePooluses a homogeneous set of Pods—identical GPU type, model weights, and serving configuration.Ensure your GKE cluster is version 1.32.3 or later.
Install Helm. See the Helm installation guide.
Enable predicted latency-based scheduling
The following steps guide you through enabling predicted latency-based scheduling for your GKE Inference Gateway deployment.
Step 1: Install or upgrade the InferencePool with predicted latency enabled
The latencyPredictor.enabled=true flag deploys
the Training Server and Prediction Server sidecars inside the EPP Pod and wires
up the full scheduling plugin pipeline:
helm upgrade --install INFERENCE_POOL_NAME \
--set inferencePool.modelServers.matchLabels.app=MODEL_SERVER_LABEL \
--set provider.name=gke \
--set inferenceExtension.monitoring.gke.enabled=true \
--set inferenceExtension.latencyPredictor.enabled=true \
--version LLM_D_VERSION \
oci://LLM_D_REGISTRY_PATH
Replace the following:
INFERENCE_POOL_NAME: the name of your InferencePool—for example,vllm-llama3-8b-instruct.MODEL_SERVER_LABEL: the label key used to select your model server Pods.LLM_D_VERSION: the llm-d Helm chart version to use.LLM_D_REGISTRY_PATH: the llm-d OCI registry path.
Step 2: Verify the deployment
Confirm the EPP Pod is running with all sidecar containers ready:
kubectl get pods -l app=INFERENCE_POOL_NAME-epp
The EPP Pod should show all containers in Running or Ready state: the EPP itself, the Training Server, and one or more Prediction Servers.
Step 3: Send a baseline request
Send a standard inference request to confirm that routing is working before enabling SLO headers:
curl -i -X POST GATEWAY_IP:PORT/v1/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $(gcloud auth print-access-token)' \
-H 'x-prediction-based-scheduling: true' \
-d '{
"model": "MODEL_NAME",
"prompt": "PROMPT_TEXT",
"max_tokens": MAX_TOKENS,
"temperature": "0"
}'
Replace the following:
GATEWAY_IP: the IP address of your gateway service.PORT: the port number of your gateway service.MODEL_NAME: the name of the model to use for inference.PROMPT_TEXT: the input prompt.MAX_TOKENS: the maximum number of tokens to generate.
The x-prediction-based-scheduling: true header opts this request into the
predicted latency scheduling pipeline. During the predictor warm-up period, the
EPP falls back to heuristic routing.
Step 4: Send SLO-aware requests (optional)
To enable per-request SLO enforcement, add TTFT and TPOT latency objective headers:
curl -i -X POST GATEWAY_IP:PORT/v1/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $(gcloud auth print-access-token)' \
-H 'x-prediction-based-scheduling: true' \
-H 'x-slo-ttft-ms: 500' \
-H 'x-slo-tpot-ms: 50' \
-d '{
"model": "MODEL_NAME",
"prompt": "PROMPT_TEXT",
"max_tokens": MAX_TOKENS,
"temperature": "0",
"stream": true
}'
Replace the following:
GATEWAY_IP: the IP address of your gateway service.PORT: the port number of your gateway service.MODEL_NAME: the name of the model to use for inference.PROMPT_TEXT: the input prompt.MAX_TOKENS: the maximum number of tokens to generate.
Request headers:
x-prediction-based-scheduling: true: opts the request into the predicted latency scheduling pipeline.x-slo-ttft-ms: maximum acceptable Time-to-First-Token in milliseconds.x-slo-tpot-ms: maximum acceptable Time-per-Output-Token in milliseconds.
Monitor predicted latency scheduling
When the latency predictor is enabled, the EPP exposes additional metrics through Cloud Monitoring.
| Metric | Description |
|---|---|
inference_objective_request_ttft_seconds |
Actual TTFT distribution (or E2E latency if streamingMode=false). |
inference_objective_request_predicted_ttft_seconds |
Predicted TTFT distribution (or E2E latency if streamingMode=false). |
inference_objective_request_tpot_seconds |
Actual TPOT distribution. |
inference_objective_request_predicted_tpot_seconds |
Predicted TPOT distribution. |
inference_objective_request_ttft_slo_violation_total |
Counter of TTFT SLO violations. |
Scale the Prediction Server
The EPP makes one prediction call per candidate Pod per incoming request. Each Prediction Server instance sustains approximately 300 QPS of prediction work.
Approximate guidance for Prediction Server instance count:
| Cluster QPS (100 Pods) | Prediction servers required |
|---|---|
| Up to 1,000 QPS | 1 server |
| Up to 5,000 QPS | 2 servers |
| Up to 10,000 QPS | 4 servers |
Add Prediction Server instances by updating the
latencyPredictor.predictionServerCount Helm value.
Limitations
- Homogeneous
InferencePoolrequired: mixed GPU types, model variants, or serving configurations within a single pool are not supported. - Warm-up period: the XGBoost model requires sufficient live traffic samples before predictions become accurate.
- SLO enforcement: enforcement is at the routing layer only. The model server does not terminate requests that exceed the SLO target after selection.
- Status: this feature is in Preview. It is not recommended for production workloads with strict SLA requirements without thorough testing.