通过 vLLM 使用 GKE 中的 GPU 应用 Gemma 开放模型

如需使用 GPU 通过 vLLM 框架在 Google Kubernetes Engine (GKE) 上部署 Gemma 4 大语言模型 (LLM),您必须预配具有受支持的加速器(例如 NVIDIA H100 GPU)的 GKE 集群。

为了部署 Gemma 4 模型,预构建的 vLLM 容器配置为加载模型权重。权重将从 Cloud Storage 存储分区(由 --model 实参指定)加载。

加载权重后,vLLM 容器会公开一个与 OpenAI 兼容的 API 端点,用于实现高吞吐量推理。

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员,以及希望使用 Kubernetes 容器编排功能在 H100 GPU 硬件上处理 AI/机器学习工作负载的数据和 AI 专家。

在阅读本页面之前,请确保您熟悉以下内容:

目标

本教程为您提供了一个基础,让您能够了解和探索用于在托管式 Kubernetes 环境中进行推理的实际 LLM 部署。

  1. 使用处于 Autopilot 模式的 GKE 集群准备环境。
  2. 将 vLLM 容器部署到您的集群。
  3. 通过 curl 界面,使用 vLLM 提供 Gemma 4 模型。

准备工作

  • In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  • Verify that billing is enabled for your Cloud de Confiance project.

  • Enable the required API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  • 确保您在项目中拥有以下一个或多个角色: roles/container.admin、roles/iam.serviceAccountAdmin

    检查角色

    1. 在 Cloud de Confiance 控制台中,前往 IAM 页面。

      转到 IAM
    2. 选择项目。
    3. 主账号列中,找到标识您或您所属群组的所有行。如需了解您属于哪些群组,请与您的管理员联系。

    4. 对于指定或包含您的所有行,请检查角色列以查看角色列表是否包含所需的角色。

    授予角色

    1. 在 Cloud de Confiance 控制台中,前往 IAM 页面。

      转到 IAM
    2. 选择项目。
    3. 点击 授予访问权限
    4. 新的主账号字段中,输入您的用户标识符。 这通常是员工身份池中的用户的标识符。如需了解详情,请参阅在 IAM 政策中表示员工池用户,或与您的管理员联系。

    5. 点击选择角色,然后搜索相应角色。
    6. 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
    7. 点击 Save(保存)。
  • 确保您的项目具有足够的 H100 GPU 配额。如需了解详情,请参阅 GPU 简介分配配额

准备环境

在本教程中,您将使用 kubectlgcloud CLI 来管理Cloud de Confiance by S3NS上托管的资源。您可以使用 gcloud CLI 授权访问 Cloud de Confiance by S3NS。

如需使用 gcloud CLI 设置环境,请按以下步骤操作:

  1. 在 gcloud CLI 中设置默认环境变量:

    gcloud config set project PROJECT_ID
    gcloud config set billing/quota_project PROJECT_ID
    export PROJECT_ID=$(gcloud config get project)
    export REGION=u-france-east1
    export CLUSTER_NAME=CLUSTER_NAME
    export GSA_NAME=GSA_NAME
    export KSA_NAME=KSA_NAME
    export NAMESPACE=NAMESPACE
    export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
    export MODEL_BUCKET_NAME=MODEL_BUCKET_NAME
    

    替换以下值:

    • PROJECT_ID:您的 Cloud de Confiance项目 ID
    • REGION:支持 H100 GPU 的 u-france-east1 区域。您可以了解哪些区域提供哪些 GPU
    • CLUSTER_NAME:您的集群的名称。
    • GSA_NAME:Google 服务账号的名称,例如 gemma-gsa
    • KSA_NAME:Kubernetes ServiceAccount 的名称,例如 gemma-ksa
    • NAMESPACE:Kubernetes 命名空间,例如 default
    • MODEL_BUCKET_NAME:将存储模型权重的 Cloud Storage 存储桶的名称。可以与所选模型名称相同,例如 gemma-4-26b-it

创建和配置 Cloud de Confiance 资源

请按照以下说明创建所需的资源。

创建 GKE 集群和节点池

您可以在 GKE Autopilot 集群中的 GPU 上应用 Gemma。Autopilot 集群可提供全代管式 Kubernetes 体验。

Autopilot

在 gcloud CLI 中,运行以下命令:

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=REGION \
    --release-channel=rapid

替换以下值:

  • PROJECT_ID:您的 Cloud de Confiance项目 ID
  • CLUSTER_NAME:您的集群的名称。
  • REGION:集群所在的区域。

GKE 会根据所部署的工作负载的请求,创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。

创建 Cloud Storage 存储桶

  1. 在 gcloud CLI 中,运行以下命令:

    gcloud storage buckets create gs://${MODEL_BUCKET_NAME} \
      --project=${PROJECT_ID} \
      --location=${REGION} \
      --uniform-bucket-level-access
    

    这会创建一个 Cloud Storage 存储桶,用于存储您从 Hugging Face 下载的模型文件。

  2. 下载和上传模型权重:

    您需要获取打算提供服务的版本的 Gemma 模型权重(例如从 Hugging Face 或其他官方来源获取)。将下载的文件在本地整理到目录中。例如:

    • ./gemma-4-26b-it-local/(包含 26B IT 模型的所有文件)
    • ./gemma-4-31b-it-local/(包含 31B IT 模型的所有文件)

    将这些目录上传到您的 Cloud Storage 存储桶,并使用部署清单所需的特定前缀:

    # Upload files for the 26B IT model
    gcloud storage cp --recursive ./gemma-4-26b-it-local/* gs://${MODEL_BUCKET_NAME}
    
    # Upload files for the 31B IT model
    gcloud storage cp --recursive ./gemma-4-31b-it-local/* gs://${MODEL_BUCKET_NAME}
    

    此命令结构可确保模型文件位于 gs://${MODEL_BUCKET_NAME}/config.json 等路径中。

配置 Workload Identity 以实现 Cloud Storage 访问

为了让 Kubernetes pod 安全地访问包含模型权重的 Cloud Storage 存储桶,您需要配置 GKE Workload Identity。

  1. 创建 Google 服务账号 (GSA):

    gcloud iam service-accounts create ${GSA_NAME} \
      --project=${PROJECT_ID}
    
  2. 确定并导出 GSA 电子邮件地址:

    电子邮件格式取决于您的 ${PROJECT_ID} 是否为网域级(包含英文冒号)。

    if [[ $PROJECT_ID == *:* ]]; then
      DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
      PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
      export GSA_EMAIL="${GSA_NAME}@${PROJ_NAME}.${DOMAIN}.s3ns.iam.gserviceaccount.com"
    else
      export GSA_EMAIL="${GSA_NAME}@${PROJECT_ID}.s3ns.iam.gserviceaccount.com"
    fi
      echo "Using GSA Email: ${GSA_EMAIL}"
    
  3. 创建 Kubernetes 服务账号 (KSA):

    此 KSA 用于您的部署清单中。

    kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
    

    验证创建

    kubectl get serviceaccounts --namespace ${NAMESPACE}
    
  4. 批注 KSA 以将其与 GSA 相关联:

    此注解告知 GKE KSA 可以模拟哪个 GSA。

    kubectl annotate serviceaccount ${KSA_NAME} \
      --namespace ${NAMESPACE} \
      iam.gke.io/gcp-service-account=${GSA_EMAIL}
    
  5. 授予 KSA 模拟 GSA 的权限:

    GSA 上的此 IAM 绑定允许 KSA 充当 GSA。

    if [[ $PROJECT_ID == *:* ]]; then
      DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
      PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
      export WI_MEMBER="serviceAccount:${PROJ_NAME}.${DOMAIN}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
    else
      export WI_MEMBER="serviceAccount:${PROJECT_ID}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
    fi
    
    gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
      --role roles/iam.workloadIdentityUser \
      --member="${WI_MEMBER}" \
      --project=${PROJECT_ID}
    
  6. 向 GSA 授予从存储分区读取数据的权限:

    向 GSA 授予对相应存储桶的 storage.objectViewer 角色。

    gcloud storage buckets add-iam-policy-binding gs://${MODEL_BUCKET_NAME} \
      --member="serviceAccount:${GSA_EMAIL}" \
      --role="roles/storage.objectViewer" \
      --project=${PROJECT_ID}
    

在 vLLM 上部署 Gemma 4 模型

如需部署 Gemma 4 模型,请为每个模型创建 Cloud Storage 存储分区以存储模型权重,并为所选模型大小应用 Kubernetes Deployment 清单。Deployment 是一个 Kubernetes API 对象,可让您运行在集群节点中分布的多个 Pod 副本。

过程

应用此清单会拉取 vLLM 容器映像、请求 NVIDIA GPU,并自动连接到 Cloud Storage 存储分区中的模型权重,以启动 vLLM 推理引擎。

Gemma 4 26B-A4B-it

请按照以下说明部署 Gemma 4 26B-A4B 指令调优模型。

  1. 创建以下 vllm-4-26b-a4b-it.yaml 清单:

    apiVersion: cloud.google.com/v1
    kind: ComputeClass
    metadata:
      name: a3-edgegpu-8g-nolssd
    spec:
      priorities:
      - machineType: a3-edgegpu-8g-nolssd
        gpu:
          count: 8
          type: nvidia-h100-80gb
      nodePoolAutoCreation:
        enabled: true
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-gemma-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gemma-server
      template:
        metadata:
          labels:
            app: gemma-server
            ai.gke.io/model: gemma-4-26b-a4b-it
            ai.gke.io/inference-server: vllm
            examples.ai.gke.io/source: user-guide
        spec:
          containers:
          - name: inference-server
            image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
            resources:
              requests:
                cpu: "20"
                memory: "80Gi"
                ephemeral-storage: "80Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "20"
                memory: "80Gi"
                ephemeral-storage: "80Gi"
                nvidia.com/gpu: "1"
            command: ["./entrypoint.sh"] # Use the image's entrypoint
            args:
            - "python"
            - "-m"
            - "vllm.entrypoints.api_server"
            - "--host=0.0.0.0"
            - "--port=8080"
            - "--model=gs://gemma-4-26b-it" # YOUR Cloud Storage PATH
            - "--tensor-parallel-size=1"
            - "--max-num-seqs=128"
            - "--gpu-memory-utilization=0.9"
            - "--limit_mm_per_prompt.image=1"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=gemma4"
            - "--reasoning-parser=gemma4"
            ports:
            - containerPort: 8080
            env:
            - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
              value: "s3nsapis.fr"
            - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
              value: "s3nsapis.fr"
            - name: GCS_URI_ARG_KEY
              value: "model"
            - name: GCS_URI_ENV_KEY
              value: "AIP_STORAGE_URI"
            - name: LORA_ADAPTER_ARG_KEY
              value: "lora-modules"
            - name: HF_HUB_ENABLE_HF_TRANSFER
              value: "1"
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          nodeSelector:
            cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: llm-service
    spec:
      selector:
        app: gemma-server
      type: ClusterIP
      ports:
        - protocol: TCP
          port: 8080
          targetPort: 8080
    
    
  2. 应用清单:

    kubectl apply -f vllm-4-26b-a4b-it.yaml
    

    如果您愿意,可以使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口(最多 128 K),请调整清单和节点池配置,以增加 GPU 容量。

Gemma 4 31B-it

请按照以下说明部署 Gemma 4 31B 指令调优模型。

  1. 创建以下 vllm-4-31b-it.yaml 清单:

    apiVersion: cloud.google.com/v1
    kind: ComputeClass
    metadata:
      name: a3-edgegpu-8g-nolssd
    spec:
      priorities:
      - machineType: a3-edgegpu-8g-nolssd
        gpu:
          count: 8
          type: nvidia-h100-80gb
      nodePoolAutoCreation:
        enabled: true
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-gemma-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gemma-server
      template:
        metadata:
          labels:
            app: gemma-server
            ai.gke.io/model: gemma-4-31b-it
            ai.gke.io/inference-server: vllm
            examples.ai.gke.io/source: user-guide
        spec:
          containers:
          - name: inference-server
            image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
            resources:
              requests:
                cpu: "20"
                memory: "80Gi"
                ephemeral-storage: "80Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "20"
                memory: "80Gi"
                ephemeral-storage: "80Gi"
                nvidia.com/gpu: "1"
            command: ["./entrypoint.sh"] # Use the image's entrypoint
            args:
            - "python"
            - "-m"
            - "vllm.entrypoints.api_server"
            - "--host=0.0.0.0"
            - "--port=8080"
            - "--model=gs://gemma-4-31b-it" # YOUR Cloud Storage PATH
            - "--tensor-parallel-size=1"
            - "--max-num-seqs=128"
            - "--gpu-memory-utilization=0.9"
            - "--limit_mm_per_prompt.image=1"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=gemma4"
            - "--reasoning-parser=gemma4"
            ports:
            - containerPort: 8080
            env:
            - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
              value: "s3nsapis.fr"
            - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
              value: "s3nsapis.fr"
            - name: GCS_URI_ARG_KEY
              value: "model"
            - name: GCS_URI_ENV_KEY
              value: "AIP_STORAGE_URI"
            - name: LORA_ADAPTER_ARG_KEY
              value: "lora-modules"
            - name: HF_HUB_ENABLE_HF_TRANSFER
              value: "1"
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          nodeSelector:
            cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: llm-service
    spec:
      selector:
        app: gemma-server
      type: ClusterIP
      ports:
        - protocol: TCP
          port: 8080
          targetPort: 8080
    
    
  2. 应用清单:

    kubectl apply -f vllm-4-31b-it.yaml
    

    在我们的示例中,我们使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口(最多 128K),请调整清单和节点池配置,以增加 GPU 容量。

验证

  1. 等待部署成为可用状态:

    kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment
    
  2. 查看正在运行的部署的日志:

    kubectl logs -f -l app=gemma-server
    

    部署资源会下载 Gemma 模型数据。此过程可能需要几分钟的时间。输出类似于以下内容:

      ...
      ...
      (APIServer pid=1) INFO:     Started server process [1]
      (APIServer pid=1) INFO:     Waiting for application startup.
      (APIServer pid=1) INFO:     Application startup complete.
    

部署可用后,设置端口转发以与模型互动。

应用模型

在本部分中,您将与模型互动。确保模型已完全下载,然后再继续。

设置端口转发

运行以下命令以设置到模型的端口转发:

kubectl port-forward svc/llm-service 8080:8080 --namespace default &

输出类似于以下内容:

Forwarding from 127.0.0.1:8080 -> 8080

使用 curl 与模型互动

本部分介绍如何执行基本的冒烟测试来验证所部署的 Gemma 4 指令调优模型。对于其他模型,请将 gemma-4-26B-A4B-it 替换为相应模型的名称。

此示例展示了如何使用纯文本输入来测试 Gemma 4 26B 指令调优模型。

在新的终端会话中,使用 curl 与模型聊天:

curl http://127.0.0.1:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {
          "role": "user",
          "content": "Why is the sky blue?"
        }
    ],
    "chat_template_kwargs": {
         "enable_thinking": true
    },
    "skip_special_tokens": false
}'

输出类似于以下内容:

{
  "id": "chatcmpl-be75ccfcbdf753d1",
  "object": "chat.completion",
  "created": 1775006187,
  "model": "google/gemma-4-26B-A4B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n*   **Red light** travels in long, lazy, wide waves.\n*   **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n*   The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n*   The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1.  **The Sun's output:** The Sun emits much more blue light than violet light.\n2.  **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n    *   Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n    *   Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n    *   Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n    *   The human eye's sensitivity: Why isn't it violet?\n\n    A good scientific explanation should follow a logical flow:\n    *   *Direct Answer:* The core mechanism (Rayleigh Scattering).\n    *   *The Components:* Sunlight and the Atmosphere.\n    *   *The Mechanism:* How light interacts with gas molecules.\n    *   *The Wavelength Factor:* Comparing colors.\n    *   *The \"Wait, why not violet?\" question:* Addressing human perception.\n    *   *Bonus/Related concept:* Why sunsets are red.\n\n        *   Use the term **Rayleigh Scattering**.\n        *   Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n        *   Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n        *   Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n        *   The atmosphere is mostly Nitrogen and Oxygen.\n        *   When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n        *   Blue light travels in shorter, smaller waves.\n        *   Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n        *   Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n        *   *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n        *   Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n        *   Briefly mention sunsets to provide a complete picture.\n        *   At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n    *   *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n    *   *Clarity:* Ensure the distinction between wavelength and scattering is clear."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 1122,
    "completion_tokens": 1101,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

问题排查

  • 如果您收到 Empty reply from server 消息,则容器可能尚未完成模型数据下载。再次检查 Pod 的日志中是否包含 Connected 消息,该消息表明模型已准备好进行应用。
  • 如果您看到 Connection refused,请验证您的端口转发已启用

观察模型性能

如需查看模型的可观测性指标对应的信息中心,请按以下步骤操作:

  1. 在 Cloud de Confiance 控制台中,前往已部署的模型页面。

    前往“已部署的模型”页面

  2. 如需查看特定部署的详细信息(包括其指标、日志和信息中心),请点击列表中的模型名称。

  3. 在模型详情页面中,点击可观测性标签页以查看以下信息中心。如果系统提示,请点击启用以对集群启用指标收集。

    • 基础设施使用情况信息中心会显示利用率指标。
    • DCGM 信息中心会显示 DCGM 指标。
    • 如果您使用的是 vLLM,则可以使用模型性能信息中心,该信息中心会显示 vLLM 模型性能的指标。

您还可以在 Cloud Monitoring 中的 vLLM 信息中心集成中查看指标。 这些指标会针对所有 vLLM 部署进行汇总,且没有预设过滤条件

vLLM 默认以 Prometheus 格式公开指标;您无需安装其他导出工具。如需了解如何使用 Google Cloud Managed Service for Prometheus 从模型收集指标,请参阅 Cloud Monitoring 文档中的 vLLM 可观测性指南。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用,请删除包含这些资源的项目,或者保留项目但删除各个资源。

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Cloud de Confiance 账号产生费用,请运行以下命令:

gcloud container clusters delete CLUSTER_NAME \
    --location=REGION

替换以下值:

  • REGION:集群所在的区域。
  • CLUSTER_NAME:您的集群的名称。

后续步骤