Serve Gemma open models using GPUs on GKE with vLLM

Autopilot

To serve Gemma 4 large language models (LLMs) on Google Kubernetes Engine (GKE) with vLLM framework using GPUs, you must provision a GKE cluster with supported accelerators, such as NVIDIA H100 GPUs.

To serve Gemma 4 models, the prebuilt vLLM container is configured to load model weights. Weights will be loaded from Cloud Storage buckets (specified by the --model argument).

Once the weights are loaded, the vLLM container exposes an OpenAI-compatible API endpoint for high-throughput inference.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads on H100 GPU hardware.

Before reading this page, ensure that you're familiar with the following:

Objectives

This tutorial provides a foundation for understanding and exploring practical LLM deployment for inference in a managed Kubernetes environment.

Prepare your environment with a GKE cluster in Autopilot mode.
Deploy a vLLM container to your cluster.
Use vLLM to serve the Gemma 4 model through curl interface.

Before you begin

In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Cloud de Confiance project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the serviceusage.services.enable permission. If you created the project, then you likely already have this permission through the Owner role (roles/owner). Otherwise, you can get this permission through the Service Usage Admin role (roles/serviceusage.serviceUsageAdmin). Learn how to grant roles.
Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Cloud de Confiance console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Cloud de Confiance console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the identifier for a user in a workforce identity pool. For details, see Represent workforce pool users in IAM policies, or contact your administrator.
5. Click Select a role, then search for the role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Ensure your project has sufficient quota for H100 GPUs. For more information, see About GPUs and Allocation quotas.

Prepare your environment

In this tutorial, you will use kubectl and gcloud CLI to manage resources hosted on Cloud de Confiance by S3NS. You can authenticate for the gcloud CLI to access Cloud de Confiance by S3NS.

To set up your environment with gcloud CLI, set the default environment variables in gcloud CLI:

gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=u-france-east1
export CLUSTER_NAME=CLUSTER_NAME
export GSA_NAME=GSA_NAME
export KSA_NAME=KSA_NAME
export NAMESPACE=NAMESPACE
export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
export MODEL_BUCKET_NAME=MODEL_BUCKET_NAME

Replace the following values:

PROJECT_ID: your Cloud de Confiance project ID.
REGION: u-france-east1 region that supports H100 GPU. You can find which region has which GPUs available.
CLUSTER_NAME: the name of your cluster.
GSA_NAME: the name for the Google Service Account, for example, gemma-gsa.
KSA_NAME: the name for the Kubernetes ServiceAccount, for example, gemma-ksa.
NAMESPACE: the Kubernetes namespace, for example, default.
MODEL_BUCKET_NAME: the name of Cloud Storage bucket where model weights will be stored. It can be the same name as the selected model, such as gemma-4-26b-it.

Create and configure Cloud de Confiance resources

Follow these instructions to create the required resources.

Create a GKE cluster and node pool

You can serve Gemma on GPUs in a GKE Autopilot cluster. Autopilot cluster provides a fully managed Kubernetes experience.

Autopilot

In gcloud CLI, run the following command:

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=REGION \
    --release-channel=rapid

Replace the following values:

PROJECT_ID: your Cloud de Confiance project ID.
CLUSTER_NAME: the name of your cluster.
REGION: the region where your cluster is located.

GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

Create a Cloud Storage bucket

In gcloud CLI, run the following command:

gcloud storage buckets create gs://${MODEL_BUCKET_NAME} \
  --project=${PROJECT_ID} \
  --location=${REGION} \
  --uniform-bucket-level-access

This creates a Cloud Storage bucket to store the model files you download from Hugging Face.

Download and Upload Model Weights:

You need to obtain the Gemma model weights for the versions you intend to serve (e.g. from Hugging Face or other official sources). Organize the downloaded files locally into directories. For example:
- ./gemma-4-26b-it-local/ (containing all files for the 26B IT model)
- ./gemma-4-31b-it-local/ (containing all files for the 31B IT model)
Upload these directories to your Cloud Storage bucket with the specific prefixes expected by the deployment manifests:
```
# Upload files for the 26B IT model
gcloud storage cp --recursive ./gemma-4-26b-it-local/* gs://${MODEL_BUCKET_NAME}

# Upload files for the 31B IT model
gcloud storage cp --recursive ./gemma-4-31b-it-local/* gs://${MODEL_BUCKET_NAME}
```
This command structure ensures the model files are located at paths like gs://${MODEL_BUCKET_NAME}/config.json, etc.

Configure Workload Identity for Cloud Storage Access

To allow your Kubernetes pods to securely access the Cloud Storage bucket containing the model weights, you'll configure GKE Workload Identity.

Create the Google Service Account (GSA):

gcloud iam service-accounts create ${GSA_NAME} \
  --project=${PROJECT_ID}

Determine and Export GSA Email:

The email format depends on whether your ${PROJECT_ID} is domain-scoped (contains a colon).

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export GSA_EMAIL="${GSA_NAME}@${PROJ_NAME}.${DOMAIN}.s3ns.iam.gserviceaccount.com"
else
  export GSA_EMAIL="${GSA_NAME}@${PROJECT_ID}.s3ns.iam.gserviceaccount.com"
fi
  echo "Using GSA Email: ${GSA_EMAIL}"

Create the Kubernetes Service Account (KSA):

This KSA is used in your deployment manifest.

kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}

Verify creation

kubectl get serviceaccounts --namespace ${NAMESPACE}

Annotate the KSA to Link it to the GSA:

This annotation tells GKE which GSA the KSA can impersonate.

kubectl annotate serviceaccount ${KSA_NAME} \
  --namespace ${NAMESPACE} \
  iam.gke.io/gcp-service-account=${GSA_EMAIL}

Grant the KSA Permission to Impersonate the GSA:

This IAM binding on the GSA allows the KSA to act as the GSA.

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export WI_MEMBER="serviceAccount:${PROJ_NAME}.${DOMAIN}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
else
  export WI_MEMBER="serviceAccount:${PROJECT_ID}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
fi

gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role roles/iam.workloadIdentityUser \
  --member="${WI_MEMBER}" \
  --project=${PROJECT_ID}

Grant the GSA Permission to Read from the Bucket:

Grant the GSA the storage.objectViewer role on the bucket.

gcloud storage buckets add-iam-policy-binding gs://${MODEL_BUCKET_NAME} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.objectViewer" \
  --project=${PROJECT_ID}

Deploy Gemma 4 models on vLLM

To deploy Gemma 4 models, create Cloud Storage buckets for each model to store model weights, and apply a Kubernetes Deployment manifest for your selected model size. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..

Procedure

Applying this manifest pulls the vLLM container image, requests an NVIDIA GPU, and automatically connects to the model weights from Cloud Storage buckets to start the vLLM inference engine.

Gemma 4 26B-A4B-it

Follow these instructions to deploy the Gemma 4 26B-A4B instruction tuned model.

Create the following vllm-4-26b-a4b-it.yaml manifest:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-26b-a4b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-26b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-num-seqs=16"
        - "--max-model-len=16384"
        - "--gpu-memory-utilization=0.95"
        - "--limit_mm_per_prompt.image=1"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        - "--trust-remote-code"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

Apply the manifest:
```
kubectl apply -f vllm-4-26b-a4b-it.yaml
```
If you want you can limit the context window size by 16 K using vLLM option --max-model-len=16384. If you want a larger context window size (up to 128 K), adjust your manifest and node-pool configuration with more GPU capacity.

Gemma 4 31B-it

Follow these instructions to deploy the Gemma 4 31B instruction tuned model.

Create the following vllm-4-31b-it.yaml manifest:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-31b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-31b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-model-len=16384"
        - "--max-num-seqs=16"
        - "--gpu-memory-utilization=0.95"
        - "--trust-remote-code"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

Apply the manifest:
```
kubectl apply -f vllm-4-31b-it.yaml
```
In our example, we limit the context window size by 16 K using vLLM option --max-model-len=16384. If you want a larger context window size (up to 128K), adjust your manifest and the node-poolconfiguration with more GPU capacity.

Verification

Wait for the Deployment to be available:

kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment

View the logs from the running Deployment:

kubectl logs -f -l app=gemma-server

The Deployment resource downloads the Gemma model data. This process can take a few minutes. The output is similar to the following:

  ...
  ...
  (APIServer pid=1) INFO:     Started server process [1]
  (APIServer pid=1) INFO:     Waiting for application startup.
  (APIServer pid=1) INFO:     Application startup complete.

After the deployment is available, set up port forwarding to interact with the model.

Serve the model

In this section, you interact with the model. Make sure the model is fully downloaded before proceeding.

Set up port forwarding

Run the following command to set up port forwarding to the model:

kubectl port-forward svc/llm-service 8080:8080 --namespace default &

The output is similar to the following:

Forwarding from 127.0.0.1:8080 -> 8080

Interact with the model using curl

This section shows how you can perform a basic smoke test to verify your deployed Gemma 4 instruction-tuned models. For other models, replace gemma-4-26B-A4B-it with the name of the respective model.

This example shows how to test the Gemma 4 26B instruction tuned model with text-only input.

In a new terminal session, use curl to chat with your model:

curl http://127.0.0.1:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {
          "role": "user",
          "content": "Why is the sky blue?"
        }
    ],
    "chat_template_kwargs": {
         "enable_thinking": true
    },
    "skip_special_tokens": false
}'

The output looks similar to the following:

{
  "id": "chatcmpl-be75ccfcbdf753d1",
  "object": "chat.completion",
  "created": 1775006187,
  "model": "google/gemma-4-26B-A4B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n*   **Red light** travels in long, lazy, wide waves.\n*   **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n*   The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n*   The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1.  **The Sun's output:** The Sun emits much more blue light than violet light.\n2.  **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n    *   Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n    *   Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n    *   Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n    *   The human eye's sensitivity: Why isn't it violet?\n\n    A good scientific explanation should follow a logical flow:\n    *   *Direct Answer:* The core mechanism (Rayleigh Scattering).\n    *   *The Components:* Sunlight and the Atmosphere.\n    *   *The Mechanism:* How light interacts with gas molecules.\n    *   *The Wavelength Factor:* Comparing colors.\n    *   *The \"Wait, why not violet?\" question:* Addressing human perception.\n    *   *Bonus/Related concept:* Why sunsets are red.\n\n        *   Use the term **Rayleigh Scattering**.\n        *   Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n        *   Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n        *   Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n        *   The atmosphere is mostly Nitrogen and Oxygen.\n        *   When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n        *   Blue light travels in shorter, smaller waves.\n        *   Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n        *   Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n        *   *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n        *   Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n        *   Briefly mention sunsets to provide a complete picture.\n        *   At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n    *   *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n    *   *Clarity:* Ensure the distinction between wavelength and scattering is clear."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 1122,
    "completion_tokens": 1101,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Troubleshoot issues

If you get the message Empty reply from server, it's possible the container has not finished downloading the model data. Check the Pod's logs again for the Connected message which indicates that the model is ready to serve.
If you see Connection refused, verify that your port forwarding is active.

Observe model performance

To view the dashboards for observability metrics of a model, follow these steps:

In the Cloud de Confiance console, go to the Deployed Models page.

Go to Deployed Models
To view details about the specific deployment, including its metrics, logs, and dashboards, click the model name in the list.
In the model details page, click the Observability tab to view the following dashboards. If prompted, click Enable to enable metrics collection for the cluster.
- The Infrastructure usage dashboard displays utilization metrics.
- The DCGM dashboard displays DCGM metrics.
- If you are using vLLM, then the Model performance dashboard is available and displays metrics for the vLLM model performance.

You can also view metrics in the vLLM dashboard integration in Cloud Monitoring. These metrics are aggregated for all vLLM deployments with no pre-set filters

vLLM exposes metrics in Prometheus format by default; you don't need to install an additional exporter. For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the vLLM observability guidance in the Cloud Monitoring documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Cloud de Confiance account for the resources that you created in this guide, run the following command:

gcloud container clusters delete CLUSTER_NAME \
    --location=REGION

Replace the following values:

REGION: the region of your cluster.
CLUSTER_NAME: the name of your cluster.

What's next

Learn more about GPUs in GKE.
Learn how to use Gemma with vLLM on other accelerators, including A100 and H100 GPUs, by viewing the sample code in GitHub.
Learn how to deploy GPU workloads in Autopilot.
Explore the vLLM GitHub repository and documentation.
Explore the Vertex AI Model Garden.
Discover how to run optimized AI/ML workloads with GKE platform orchestration capabilities.

Serve Gemma open models using GPUs on GKE with vLLM

Objectives

Before you begin

Check for the roles

Grant the roles

Prepare your environment

Create and configure Cloud de Confiance resources

Create a GKE cluster and node pool

Autopilot

Create a Cloud Storage bucket

Configure Workload Identity for Cloud Storage Access

Deploy Gemma 4 models on vLLM

Procedure

Gemma 4 26B-A4B-it

Gemma 4 31B-it

Verification

Serve the model

Set up port forwarding

Interact with the model using curl

Troubleshoot issues

Observe model performance

Clean up

Delete the deployed resources

What's next