如要使用 GPU,透過 vLLM 框架在 Google Kubernetes Engine (GKE) 上提供 Gemma 4 大型語言模型 (LLM) 服務,您必須佈建具有支援加速器的 GKE 叢集,例如 NVIDIA H100 GPU。
如要提供 Gemma 4 模型服務,預先建構的 vLLM 容器會設定為載入模型權重。權重會從 Cloud Storage bucket 載入 (由 --model 引數指定)。
載入權重後,vLLM 容器會公開與 OpenAI 相容的 API 端點,以進行高處理量的推論。
本教學課程的適用對象為機器學習 (ML) 工程師、平台管理員和營運人員,以及有興趣使用 Kubernetes 容器自動化調度管理功能,在 H100 GPU 硬體上提供 AI/ML 工作負載服務的資料和 AI 專家。
閱讀本頁面之前,請先熟悉下列項目:
目標
本教學課程提供基礎知識,協助您瞭解及探索如何在 Kubernetes 代管環境中,實際部署 LLM 以進行推論。
- 在 Autopilot 模式中,使用 GKE 叢集準備環境。
- 將 vLLM 容器部署至叢集。
- 使用 vLLM 透過 curl 介面提供 Gemma 4 模型服務。
事前準備
-
In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Cloud de Confiance project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
確認您在專案中具備下列角色: roles/container.admin、roles/iam.serviceAccountAdmin
檢查角色
-
前往 Cloud de Confiance 控制台的「IAM」頁面。
前往「IAM」頁面 - 選取專案。
-
在「主體」欄中,找出所有識別您或您所屬群組的資料列。如要瞭解自己所屬的群組,請與管理員聯絡。
- 針對指定或包含您的所有列,請檢查「角色」欄,確認角色清單是否包含必要角色。
授予角色
-
前往 Cloud de Confiance 控制台的「IAM」頁面。
前往「IAM」頁面 - 選取專案。
- 按一下「Grant access」(授予存取權)。
-
在「New principals」(新增主體) 欄位中,輸入您的使用者 ID。 這通常是指員工身分集區中使用者的 ID。詳情請參閱「在 IAM 政策中代表工作團隊集區使用者」,或聯絡管理員。
- 按一下「選取角色」,然後搜尋角色。
- 如要授予其他角色,請按一下「Add another role」(新增其他角色),然後新增其他角色。
- 按一下「Save」(儲存)。
-
準備環境
在本教學課程中,您會使用 kubectl 和 gcloud CLI 管理Cloud de Confiance by S3NS上託管的資源。您可以使用 gcloud CLI 授權存取 Cloud de Confiance by S3NS。
如要使用 gcloud CLI 設定環境,請按照下列步驟操作:
在 gcloud CLI 中設定預設環境變數:
gcloud config set project PROJECT_ID gcloud config set billing/quota_project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export REGION=u-france-east1 export CLUSTER_NAME=CLUSTER_NAME export GSA_NAME=GSA_NAME export KSA_NAME=KSA_NAME export NAMESPACE=NAMESPACE export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)") export MODEL_BUCKET_NAME=MODEL_BUCKET_NAME替換下列值:
PROJECT_ID:您的 Cloud de Confiance 專案 ID。REGION:支援 H100 GPU 的u-france-east1區域。您可以查看哪個區域提供哪些 GPU。CLUSTER_NAME:叢集名稱。GSA_NAME:Google 服務帳戶的名稱,例如gemma-gsa。KSA_NAME:Kubernetes ServiceAccount 的名稱,例如gemma-ksa。NAMESPACE:Kubernetes 命名空間,例如default。MODEL_BUCKET_NAME:用於儲存模型權重的 Cloud Storage bucket 名稱。可以與所選模型同名,例如gemma-4-26b-it。
建立及設定 Cloud de Confiance 資源
請按照下列操作說明建立必要資源。
建立 GKE 叢集和節點集區
您可以在 GKE Autopilot 叢集的 GPU 上提供 Gemma 服務。Autopilot 叢集提供全代管的 Kubernetes 體驗。
Autopilot
在 gcloud CLI 中執行下列指令:
gcloud container clusters create-auto CLUSTER_NAME \
--project=PROJECT_ID \
--location=REGION \
--release-channel=rapid
替換下列值:
PROJECT_ID:您的 Cloud de Confiance 專案 ID。CLUSTER_NAME:叢集名稱。REGION:叢集所在的區域。
GKE 會根據部署的工作負載要求,建立含有 CPU 和 GPU 節點的 Autopilot 叢集。
建立 Cloud Storage bucket
在 gcloud CLI 中執行下列指令:
gcloud storage buckets create gs://${MODEL_BUCKET_NAME} \ --project=${PROJECT_ID} \ --location=${REGION} \ --uniform-bucket-level-access這會建立 Cloud Storage bucket,用於儲存從 Hugging Face 下載的模型檔案。
下載及上傳模型權重:
您必須取得要提供服務的 Gemma 模型權重 (例如來自 Hugging Face 或其他官方來源)。將下載的檔案整理到本機目錄中。例如:
./gemma-4-26b-it-local/(內含 26B IT 模型的所有檔案)./gemma-4-31b-it-local/(內含 31B IT 模型的所有檔案)
將這些目錄上傳至 Cloud Storage bucket,並使用部署資訊清單預期的特定前置字元:
# Upload files for the 26B IT model gcloud storage cp --recursive ./gemma-4-26b-it-local/* gs://${MODEL_BUCKET_NAME} # Upload files for the 31B IT model gcloud storage cp --recursive ./gemma-4-31b-it-local/* gs://${MODEL_BUCKET_NAME}這個指令結構可確保模型檔案位於
gs://${MODEL_BUCKET_NAME}/config.json等路徑。
設定 Workload Identity,以便存取 Cloud Storage
如要讓 Kubernetes Pod 安全地存取含有模型權重的 Cloud Storage 值區,請設定 GKE Workload Identity。
建立 Google 服務帳戶 (GSA):
gcloud iam service-accounts create ${GSA_NAME} \ --project=${PROJECT_ID}找出並匯出 GSA 電子郵件:
電子郵件格式取決於 ${PROJECT_ID} 是否為網域範圍 (包含冒號)。
if [[ $PROJECT_ID == *:* ]]; then DOMAIN=$(echo $PROJECT_ID | cut -d: -f1) PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2) export GSA_EMAIL="${GSA_NAME}@${PROJ_NAME}.${DOMAIN}.s3ns.iam.gserviceaccount.com" else export GSA_EMAIL="${GSA_NAME}@${PROJECT_ID}.s3ns.iam.gserviceaccount.com" fi echo "Using GSA Email: ${GSA_EMAIL}"建立 Kubernetes 服務帳戶 (KSA):
這個 KSA 會用在部署資訊清單中。
kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}驗證建立作業
kubectl get serviceaccounts --namespace ${NAMESPACE}為 KSA 加上註解,將其連結至 GSA:
這項註解會告知 GKE,KSA 可以模擬哪個 GSA。
kubectl annotate serviceaccount ${KSA_NAME} \ --namespace ${NAMESPACE} \ iam.gke.io/gcp-service-account=${GSA_EMAIL}授予 KSA 模擬 GSA 的權限:
GSA 上的這項 IAM 繫結可讓 KSA 擔任 GSA。
if [[ $PROJECT_ID == *:* ]]; then DOMAIN=$(echo $PROJECT_ID | cut -d: -f1) PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2) export WI_MEMBER="serviceAccount:${PROJ_NAME}.${DOMAIN}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]" else export WI_MEMBER="serviceAccount:${PROJECT_ID}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]" fi gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \ --role roles/iam.workloadIdentityUser \ --member="${WI_MEMBER}" \ --project=${PROJECT_ID}授予 GSA 從 Bucket 讀取的權限:
將 bucket 的
storage.objectViewer角色授予 GSA。gcloud storage buckets add-iam-policy-binding gs://${MODEL_BUCKET_NAME} \ --member="serviceAccount:${GSA_EMAIL}" \ --role="roles/storage.objectViewer" \ --project=${PROJECT_ID}
在 vLLM 上部署 Gemma 4 模型
如要部署 Gemma 4 模型,請為每個模型建立 Cloud Storage 儲存區來儲存模型權重,並為所選模型大小套用 Kubernetes Deployment 資訊清單。Deployment 是 Kubernetes API 物件,可讓您執行多個 Pod 副本,並將這些副本分散到叢集中的節點。
程序
套用這個資訊清單會提取 vLLM 容器映像檔、要求 NVIDIA GPU,並自動連線至 Cloud Storage bucket 中的模型權重,以啟動 vLLM 推論引擎。
Gemma 4 26B-A4B-it
請按照下列操作說明,部署 Gemma 4 26B-A4B 指令微調模型。
建立下列
vllm-4-26b-a4b-it.yaml資訊清單:apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: a3-edgegpu-8g-nolssd spec: priorities: - machineType: a3-edgegpu-8g-nolssd gpu: count: 8 type: nvidia-h100-80gb nodePoolAutoCreation: enabled: true --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-gemma-deployment spec: replicas: 1 selector: matchLabels: app: gemma-server template: metadata: labels: app: gemma-server ai.gke.io/model: gemma-4-26b-a4b-it ai.gke.io/inference-server: vllm examples.ai.gke.io/source: user-guide spec: containers: - name: inference-server image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4 resources: requests: cpu: "20" memory: "80Gi" ephemeral-storage: "80Gi" nvidia.com/gpu: "1" limits: cpu: "20" memory: "80Gi" ephemeral-storage: "80Gi" nvidia.com/gpu: "1" command: ["./entrypoint.sh"] # Use the image's entrypoint args: - "python" - "-m" - "vllm.entrypoints.api_server" - "--host=0.0.0.0" - "--port=8080" - "--model=gs://gemma-4-26b-it" # YOUR Cloud Storage PATH - "--tensor-parallel-size=1" - "--max-num-seqs=128" - "--gpu-memory-utilization=0.9" - "--limit_mm_per_prompt.image=1" - "--enable-auto-tool-choice" - "--tool-call-parser=gemma4" - "--reasoning-parser=gemma4" ports: - containerPort: 8080 env: - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN value: "s3nsapis.fr" - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN value: "s3nsapis.fr" - name: GCS_URI_ARG_KEY value: "model" - name: GCS_URI_ENV_KEY value: "AIP_STORAGE_URI" - name: LORA_ADAPTER_ARG_KEY value: "lora-modules" - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory nodeSelector: cloud.google.com/compute-class: a3-edgegpu-8g-nolssd --- apiVersion: v1 kind: Service metadata: name: llm-service spec: selector: app: gemma-server type: ClusterIP ports: - protocol: TCP port: 8080 targetPort: 8080套用資訊清單:
kubectl apply -f vllm-4-26b-a4b-it.yaml如要限制內容視窗大小,可以使用 vLLM 選項
--max-model-len=16384,將大小限制為 16K。 如要使用較大的脈絡窗口大小 (最多 128K),請調整資訊清單和節點集區設定,增加 GPU 容量。
Gemma 4 31B-it
請按照下列操作說明部署 Gemma 4 31B 指令微調模型。
建立下列
vllm-4-31b-it.yaml資訊清單:apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: a3-edgegpu-8g-nolssd spec: priorities: - machineType: a3-edgegpu-8g-nolssd gpu: count: 8 type: nvidia-h100-80gb nodePoolAutoCreation: enabled: true --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-gemma-deployment spec: replicas: 1 selector: matchLabels: app: gemma-server template: metadata: labels: app: gemma-server ai.gke.io/model: gemma-4-31b-it ai.gke.io/inference-server: vllm examples.ai.gke.io/source: user-guide spec: containers: - name: inference-server image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4 resources: requests: cpu: "20" memory: "80Gi" ephemeral-storage: "80Gi" nvidia.com/gpu: "1" limits: cpu: "20" memory: "80Gi" ephemeral-storage: "80Gi" nvidia.com/gpu: "1" command: ["./entrypoint.sh"] # Use the image's entrypoint args: - "python" - "-m" - "vllm.entrypoints.api_server" - "--host=0.0.0.0" - "--port=8080" - "--model=gs://gemma-4-31b-it" # YOUR Cloud Storage PATH - "--tensor-parallel-size=1" - "--max-num-seqs=128" - "--gpu-memory-utilization=0.9" - "--limit_mm_per_prompt.image=1" - "--enable-auto-tool-choice" - "--tool-call-parser=gemma4" - "--reasoning-parser=gemma4" ports: - containerPort: 8080 env: - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN value: "s3nsapis.fr" - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN value: "s3nsapis.fr" - name: GCS_URI_ARG_KEY value: "model" - name: GCS_URI_ENV_KEY value: "AIP_STORAGE_URI" - name: LORA_ADAPTER_ARG_KEY value: "lora-modules" - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory nodeSelector: cloud.google.com/compute-class: a3-edgegpu-8g-nolssd --- apiVersion: v1 kind: Service metadata: name: llm-service spec: selector: app: gemma-server type: ClusterIP ports: - protocol: TCP port: 8080 targetPort: 8080套用資訊清單:
kubectl apply -f vllm-4-31b-it.yaml在我們的範例中,我們使用 vLLM 選項
--max-model-len=16384,將內容視窗大小限制為 16K。如要使用較大的脈絡窗口大小 (最多 128K),請調整資訊清單和節點集區設定,增加 GPU 容量。
驗證
等待部署作業完成:
kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment查看執行中 Deployment 的記錄:
kubectl logs -f -l app=gemma-serverDeployment 資源會下載 Gemma 模型資料。這項程序會在幾分鐘內完成。輸出結果會與下列內容相似:
... ... (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete.
部署完成後,請設定通訊埠轉送,與模型互動。
提供模型
在本節中,您將與模型互動。請先確認模型已完全下載,再繼續操作。
設定通訊埠轉送
執行下列指令,設定通訊埠轉送至模型:
kubectl port-forward svc/llm-service 8080:8080 --namespace default &
輸出結果會與下列內容相似:
Forwarding from 127.0.0.1:8080 -> 8080
使用 curl 與模型互動
本節說明如何執行基本冒煙測試,驗證已部署的 Gemma 4 指令微調模型。如果是其他模型,請將 gemma-4-26B-A4B-it 替換為對應模型的名稱。
這個範例說明如何使用純文字輸入,測試 Gemma 4 26B 指令微調模型。
在新終端機工作階段中,使用 curl 與模型對話:
curl http://127.0.0.1:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"chat_template_kwargs": {
"enable_thinking": true
},
"skip_special_tokens": false
}'
輸出看起來類似以下內容:
{
"id": "chatcmpl-be75ccfcbdf753d1",
"object": "chat.completion",
"created": 1775006187,
"model": "google/gemma-4-26B-A4B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n* **Red light** travels in long, lazy, wide waves.\n* **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n* The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n* The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1. **The Sun's output:** The Sun emits much more blue light than violet light.\n2. **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n * Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n * Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n * Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n * The human eye's sensitivity: Why isn't it violet?\n\n A good scientific explanation should follow a logical flow:\n * *Direct Answer:* The core mechanism (Rayleigh Scattering).\n * *The Components:* Sunlight and the Atmosphere.\n * *The Mechanism:* How light interacts with gas molecules.\n * *The Wavelength Factor:* Comparing colors.\n * *The \"Wait, why not violet?\" question:* Addressing human perception.\n * *Bonus/Related concept:* Why sunsets are red.\n\n * Use the term **Rayleigh Scattering**.\n * Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n * Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n * Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n * The atmosphere is mostly Nitrogen and Oxygen.\n * When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n * Blue light travels in shorter, smaller waves.\n * Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n * Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n * *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n * Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n * Briefly mention sunsets to provide a complete picture.\n * At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n * *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n * *Clarity:* Ensure the distinction between wavelength and scattering is clear."
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": 106,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 21,
"total_tokens": 1122,
"completion_tokens": 1101,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
排解問題
- 如果收到
Empty reply from server訊息,可能是因為容器尚未完成下載模型資料。再次檢查 Pod 的記錄,確認是否出現Connected訊息,表示模型已準備好提供服務。 - 如果看到
Connection refused,請確認連接埠轉送功能是否已啟用。
觀察模型成效
如要查看模型可觀測性指標的資訊主頁,請按照下列步驟操作:
前往 Cloud de Confiance 控制台的「Deployed Models」(已部署模型) 頁面。
如要查看特定部署作業的詳細資料,包括指標、記錄和資訊主頁,請按一下清單中的模型名稱。
在模型詳細資料頁面中,按一下「可觀測性」分頁標籤,即可查看下列資訊主頁。如果系統提示,請按一下「啟用」,為叢集啟用指標收集功能。
- 「基礎架構用量」資訊主頁會顯示使用率指標。
- 「DCGM」DCGM資訊主頁會顯示 DCGM 指標。
- 如果您使用 vLLM,則可使用「模型效能」資訊主頁,查看 vLLM 模型效能指標。
您也可以在 Cloud Monitoring 中,透過 vLLM 資訊主頁整合功能查看指標。這些指標會匯總所有 vLLM 部署作業,且沒有預設篩選器
vLLM 預設會以 Prometheus 格式公開指標,您不必安裝額外的匯出工具。如要瞭解如何使用 Google Cloud Managed Service for Prometheus 收集模型指標,請參閱 Cloud Monitoring 說明文件中的 vLLM 可觀測性指南。清除所用資源
為避免因為本教學課程所用資源,導致系統向 Google Cloud 帳戶收取費用,請刪除含有相關資源的專案,或者保留專案但刪除個別資源。
刪除已部署的資源
如要避免系統向您的 Cloud de Confiance 帳戶收取本指南所建立資源的費用,請執行下列指令:
gcloud container clusters delete CLUSTER_NAME \
--location=REGION
替換下列值:
REGION:叢集所在的區域。CLUSTER_NAME:叢集名稱。
後續步驟
- 進一步瞭解 GKE 中的 GPU。
- 如要瞭解如何在其他加速器 (包括 A100 和 H100 GPU) 上使用 vLLM 搭配 Gemma,請查看 GitHub 中的程式碼範例。
- 瞭解如何在 Autopilot 中部署 GPU 工作負載。
- 探索 vLLM GitHub 存放區和說明文件。
- 探索 Vertex AI Model Garden。
- 瞭解如何運用 GKE 平台的自動化調度管理功能,執行最佳化的 AI/機器學習工作負載。