Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

通过 vLLM 使用 GKE 中的 GPU 应用 Gemma 开放模型

Autopilot

如需使用 GPU 通过 vLLM 框架在 Google Kubernetes Engine (GKE) 上部署 Gemma 4 大语言模型 (LLM)，您必须预配具有受支持加速器（例如 NVIDIA H100 GPU）的 GKE 集群。

如需部署 Gemma 4 模型，预构建的 vLLM 容器会配置为加载模型权重。权重将从 Cloud Storage 存储分区（由 --model 实参指定）加载。

加载权重后，vLLM 容器会公开与 OpenAI 兼容的 API 端点，以实现高吞吐量推理。

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员，以及希望使用 Kubernetes 容器编排功能在 H100 GPU 硬件上处理 AI/机器学习工作负载的数据和 AI 专家。

在阅读本页面之前，请确保您熟悉以下内容：

目标

本教程为理解和探索在托管式 Kubernetes 环境中部署实际 LLM 以进行推理提供了基础。

使用处于 Autopilot 模式的 GKE 集群准备环境。
将 vLLM 容器部署到您的集群。
通过 curl 接口，使用 vLLM 部署 Gemma 4 模型。

准备工作

In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Cloud de Confiance project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

确保您在项目中拥有以下一个或多个角色： roles/container.admin、roles/iam.serviceAccountAdmin
检查角色
1. 在 Cloud de Confiance 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 在主账号 列中，找到标识您或您所属群组的所有行。如需了解您属于哪些群组，请与您的管理员联系。
4. 对于指定或包含您的所有行，请检查角色列以查看角色列表是否包含所需的角色。
授予角色
1. 在 Cloud de Confiance 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号 字段中，输入您的用户标识符。这通常是员工身份池中的用户的标识符。如需了解详情，请参阅在 IAM 政策中表示员工池用户，或与您的管理员联系。
5. 点击选择角色，然后搜索相应角色。
6. 如需授予其他角色，请点击 添加其他角色 ，然后添加其他各个角色。
7. 点击 Save （保存）。

确保您的项目具有足够的 H100 GPU 配额。如需了解详情，请参阅 GPU 简介和分配配额。

准备环境

在本教程中，您将使用 kubectl 和 gcloud CLI 来管理上托管的资源 Cloud de Confiance by S3NS。您可以向 gcloud CLI进行身份验证以进行访问 Cloud de Confiance by S3NS。

如需使用 gcloud CLI 设置环境，请在 gcloud CLI 中设置默认环境变量：

gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=u-france-east1
export CLUSTER_NAME=CLUSTER_NAME
export GSA_NAME=GSA_NAME
export KSA_NAME=KSA_NAME
export NAMESPACE=NAMESPACE
export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
export MODEL_BUCKET_NAME=MODEL_BUCKET_NAME

替换以下值：

PROJECT_ID：您的 Cloud de Confiance 项目 ID。
REGION：支持 H100 GPU 的 u-france-east1 区域。您可以查找哪些区域提供哪些 GPU。
CLUSTER_NAME：您的集群的名称。
GSA_NAME：Google 服务账号的名称，例如 gemma-gsa。
KSA_NAME：Kubernetes 服务账号的名称，例如 gemma-ksa。
NAMESPACE：Kubernetes 命名空间，例如 default。
MODEL_BUCKET_NAME：将存储模型权重的 Cloud Storage 存储桶的名称。它可以与所选模型的名称相同，例如 gemma-4-26b-it。

创建和配置 Cloud de Confiance 资源

请按照以下说明创建所需的资源。

创建 GKE 集群和节点池

您可以在 GKE Autopilot 集群中的 GPU 上部署 Gemma。Autopilot 集群可提供全代管式 Kubernetes 体验。

Autopilot

在 gcloud CLI 中，运行以下命令：

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=REGION \
    --release-channel=rapid

替换以下值：

PROJECT_ID：您的 Cloud de Confiance 项目 ID。
CLUSTER_NAME：您的集群的名称。
REGION：集群所在的区域。

GKE 会根据所部署的工作负载的请求，创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。

创建 Cloud Storage 存储桶

在 gcloud CLI 中，运行以下命令：

gcloud storage buckets create gs://${MODEL_BUCKET_NAME} \
  --project=${PROJECT_ID} \
  --location=${REGION} \
  --uniform-bucket-level-access

这会创建一个 Cloud Storage 存储桶，用于存储您从 Hugging Face 下载的模型文件。

下载并上传模型权重：

您需要获取要部署的版本的 Gemma 模型权重（例如，从 Hugging Face 或其他官方来源获取）。在本地将下载的文件整理到目录中。例如：
- ./gemma-4-26b-it-local/（包含 26B IT 模型的所有文件）
- ./gemma-4-31b-it-local/（包含 31B IT 模型的所有文件）
使用部署清单所需的特定前缀将这些目录上传到 Cloud Storage 存储桶：
```
# Upload files for the 26B IT model
gcloud storage cp --recursive ./gemma-4-26b-it-local/* gs://${MODEL_BUCKET_NAME}

# Upload files for the 31B IT model
gcloud storage cp --recursive ./gemma-4-31b-it-local/* gs://${MODEL_BUCKET_NAME}
```
此命令结构可确保模型文件位于 gs://${MODEL_BUCKET_NAME}/config.json 等路径下。

配置 Workload Identity 以访问 Cloud Storage

如需允许 Kubernetes pod 安全地访问包含模型权重的 Cloud Storage 存储桶，您需要配置 GKE Workload Identity。

创建 Google 服务账号 (GSA)：

gcloud iam service-accounts create ${GSA_NAME} \
  --project=${PROJECT_ID}

确定并导出 GSA 电子邮件地址：

电子邮件地址格式取决于您的 ${PROJECT_ID} 是否限定在网域范围内（包含英文冒号）。

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export GSA_EMAIL="${GSA_NAME}@${PROJ_NAME}.${DOMAIN}.s3ns.iam.gserviceaccount.com"
else
  export GSA_EMAIL="${GSA_NAME}@${PROJECT_ID}.s3ns.iam.gserviceaccount.com"
fi
  echo "Using GSA Email: ${GSA_EMAIL}"

创建 Kubernetes 服务账号 (KSA)：

此 KSA 用于您的部署清单。

kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}

验证创建

kubectl get serviceaccounts --namespace ${NAMESPACE}

为 KSA 添加注释，以将其与 GSA 相关联：

此注解会告知 GKE KSA 可以模拟哪个 GSA。

kubectl annotate serviceaccount ${KSA_NAME} \
  --namespace ${NAMESPACE} \
  iam.gke.io/gcp-service-account=${GSA_EMAIL}

向 KSA 授予模拟 GSA 的权限：

GSA 上的此 IAM 绑定允许 KSA 充当 GSA。

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export WI_MEMBER="serviceAccount:${PROJ_NAME}.${DOMAIN}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
else
  export WI_MEMBER="serviceAccount:${PROJECT_ID}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
fi

gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role roles/iam.workloadIdentityUser \
  --member="${WI_MEMBER}" \
  --project=${PROJECT_ID}

向 GSA 授予从存储分区读取的权限：

向 GSA 授予存储桶的 storage.objectViewer 角色。

gcloud storage buckets add-iam-policy-binding gs://${MODEL_BUCKET_NAME} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.objectViewer" \
  --project=${PROJECT_ID}

在 vLLM 上部署 Gemma 4 模型

如需部署 Gemma 4 模型，请为每个模型创建 Cloud Storage 存储分区以存储模型权重，并为所选模型大小应用 Kubernetes Deployment 清单。Deployment 是一个 Kubernetes API 对象，可让您运行在集群节点中分布的多个 Pod 副本。

过程

应用此清单会拉取 vLLM 容器映像，请求 NVIDIA GPU，并自动连接到 Cloud Storage 存储分区中的模型权重，以启动 vLLM 推理引擎。

Gemma 4 26B-A4B-it

请按照以下说明部署 Gemma 4 26B-A4B 指令调优模型。

创建以下 vllm-4-26b-a4b-it.yaml 清单：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-26b-a4b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-26b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-num-seqs=16"
        - "--max-model-len=16384"
        - "--gpu-memory-utilization=0.95"
        - "--limit_mm_per_prompt.image=1"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        - "--trust-remote-code"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

应用清单：
```
kubectl apply -f vllm-4-26b-a4b-it.yaml
```
如果您愿意，可以使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口（最多 128 K），请调整清单和节点池配置，以增加 GPU 容量。

Gemma 4 31B-it

请按照以下说明部署 Gemma 4 31B 指令调优模型。

创建以下 vllm-4-31b-it.yaml 清单：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-31b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-31b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-model-len=16384"
        - "--max-num-seqs=16"
        - "--gpu-memory-utilization=0.95"
        - "--trust-remote-code"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

应用清单：
```
kubectl apply -f vllm-4-31b-it.yaml
```
在我们的示例中，我们使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口（最多 128K），请调整清单和节点池配置，以增加 GPU 容量。

验证

等待部署成为可用状态：

kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment

查看正在运行的部署的日志：

kubectl logs -f -l app=gemma-server

部署资源会下载 Gemma 模型数据。此过程可能需要几分钟的时间。输出类似于以下内容：

  ...
  ...
  (APIServer pid=1) INFO:     Started server process [1]
  (APIServer pid=1) INFO:     Waiting for application startup.
  (APIServer pid=1) INFO:     Application startup complete.

部署可用后，设置端口转发以与模型互动。

应用模型

在本部分中，您将与模型互动。确保模型已完全下载，然后再继续。

设置端口转发

运行以下命令以设置到模型的端口转发：

kubectl port-forward svc/llm-service 8080:8080 --namespace default &

输出类似于以下内容：

Forwarding from 127.0.0.1:8080 -> 8080

使用 curl 与模型互动

本部分介绍如何执行基本的冒烟测试来验证所部署的 Gemma 4 指令调优模型。对于其他模型，请将 gemma-4-26B-A4B-it 替换为相应模型的名称。

此示例展示了如何使用纯文本输入来测试 Gemma 4 26B 指令调优模型。

在新的终端会话中，使用 curl 与模型聊天：

curl http://127.0.0.1:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {
          "role": "user",
          "content": "Why is the sky blue?"
        }
    ],
    "chat_template_kwargs": {
         "enable_thinking": true
    },
    "skip_special_tokens": false
}'

输出类似于以下内容：

{
  "id": "chatcmpl-be75ccfcbdf753d1",
  "object": "chat.completion",
  "created": 1775006187,
  "model": "google/gemma-4-26B-A4B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n*   **Red light** travels in long, lazy, wide waves.\n*   **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n*   The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n*   The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1.  **The Sun's output:** The Sun emits much more blue light than violet light.\n2.  **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n    *   Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n    *   Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n    *   Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n    *   The human eye's sensitivity: Why isn't it violet?\n\n    A good scientific explanation should follow a logical flow:\n    *   *Direct Answer:* The core mechanism (Rayleigh Scattering).\n    *   *The Components:* Sunlight and the Atmosphere.\n    *   *The Mechanism:* How light interacts with gas molecules.\n    *   *The Wavelength Factor:* Comparing colors.\n    *   *The \"Wait, why not violet?\" question:* Addressing human perception.\n    *   *Bonus/Related concept:* Why sunsets are red.\n\n        *   Use the term **Rayleigh Scattering**.\n        *   Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n        *   Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n        *   Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n        *   The atmosphere is mostly Nitrogen and Oxygen.\n        *   When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n        *   Blue light travels in shorter, smaller waves.\n        *   Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n        *   Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n        *   *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n        *   Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n        *   Briefly mention sunsets to provide a complete picture.\n        *   At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n    *   *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n    *   *Clarity:* Ensure the distinction between wavelength and scattering is clear."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 1122,
    "completion_tokens": 1101,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

问题排查

如果您收到 Empty reply from server 消息，则容器可能尚未完成模型数据下载。再次检查 Pod 的日志中是否包含 Connected 消息，该消息表明模型已准备好进行应用。
如果您看到 Connection refused，请验证您的端口转发已启用。

观察模型性能

如需查看模型的可观测性指标对应的信息中心，请按以下步骤操作：

在 Cloud de Confiance 控制台中，前往已部署的模型页面。

前往“已部署的模型”页面
如需查看特定部署的详细信息（包括其指标、日志和信息中心），请点击列表中的模型名称。
在模型详情页面中，点击可观测性标签页以查看以下信息中心。如果系统提示，请点击启用以对集群启用指标收集。
- 基础设施使用情况信息中心会显示利用率指标。
- DCGM 信息中心会显示 DCGM 指标。
- 如果您使用的是 vLLM，则可以使用模型性能信息中心，该信息中心会显示 vLLM 模型性能的指标。

您还可以在 Cloud Monitoring 中的 vLLM 信息中心集成中查看指标。这些指标会针对所有 vLLM 部署进行汇总，且没有预设过滤条件

vLLM 默认以 Prometheus 格式公开指标；您无需安装其他导出工具。如需了解如何使用 Google Cloud Managed Service for Prometheus 从模型收集指标，请参阅 vLLM 可观测性指南（请参阅 Cloud Monitoring 文档）。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Cloud de Confiance 账号产生费用，请运行以下命令：

gcloud container clusters delete CLUSTER_NAME \
    --location=REGION

替换以下值：

REGION：集群所在的区域。
CLUSTER_NAME：您的集群的名称。

后续步骤

详细了解 GKE 中的 GPU。
查看 GitHub 中的示例代码，了解如何在其他加速器（包括 A100 和 H100 GPU）上将 Gemma 与 vLLM 搭配使用。
了解如何在 Autopilot 中部署 GPU 工作负载。
浏览 vLLM GitHub 代码库和文档。
探索 Vertex AI Model Garden。
了解如何使用 GKE 平台编排功能运行经过优化的 AI/机器学习工作负载。

通过 vLLM 使用 GKE 中的 GPU 应用 Gemma 开放模型

目标

准备工作

检查角色

授予角色

准备环境

创建和配置 Cloud de Confiance 资源

创建 GKE 集群和节点池

Autopilot

创建 Cloud Storage 存储桶

配置 Workload Identity 以访问 Cloud Storage

在 vLLM 上部署 Gemma 4 模型

过程

Gemma 4 26B-A4B-it

Gemma 4 31B-it

验证

应用模型

设置端口转发

使用 curl 与模型互动

问题排查

观察模型性能

清理

删除已部署的资源

后续步骤