Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

GKE の GPU で vLLM を使用して Gemma オープンモデルを提供する

Autopilot

GPU を使用して vLLM フレームワークで Google Kubernetes Engine（GKE）に Gemma 4 大規模言語モデル（LLM）をサービングするには、NVIDIA H100 GPU などのサポートされているアクセラレータを使用して GKE クラスタをプロビジョニングする必要があります。

Gemma 4 モデルをサービングするには、モデルの重みを読み込むように事前構築済みの vLLM コンテナを構成します。重みは、--model 引数で指定された Cloud Storage バケットから読み込まれます。

重みが読み込まれると、vLLM コンテナは高スループット推論用の OpenAI 互換 API エンドポイントを公開します。

このチュートリアルは、ML エンジニア、プラットフォームの管理者とオペレーターのほか、Kubernetes のコンテナオーケストレーション機能を使用して H100 GPU ハードウェアで AI/ML ワークロードをサービングすることに関心があるデータと AI のスペシャリストを対象としています。

このページを読む前に、次のことをよく理解しておいてください。

目標

これにより、マネージド Kubernetes 環境における推論用 LLM の実用的なデプロイに関する基礎を学ぶことができます。

Autopilot モードの GKE クラスタで環境を準備する。
vLLM コンテナをクラスタにデプロイする。
vLLM を使用して、curl インターフェースを介して Gemma 4 モデルをサービングする。

始める前に

In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Cloud de Confiance project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

プロジェクトで次のロール（複数の場合あり）が割り当てられていることを確認します。 roles/container.admin、roles/iam.serviceAccountAdmin
ロールを確認する
1. コンソールで、[IAM] ページに移動します。 Cloud de Confiance
  IAM に移動
2. プロジェクトを選択します。
3. [Principal] 列で、自分または自分が所属するグループの行をすべて確認します。所属するグループについては、管理者にお問い合わせください。
4. 自分のメールアドレスを含む行の [**ロール**] 列で、ロールのリストに必要なロールが含まれているかどうか確認します。
ロールを付与する
1. コンソールで、[IAM] ページに移動します。 Cloud de Confiance
  IAM に移動
2. プロジェクトを選択します。
3. [Grant access] をクリックします。
4. [新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Workforce Identity プール内のユーザーの ID です。詳細については、 IAM ポリシーで Workforce プールユーザーを表すをご覧いただくか、管理者にお問い合わせください。
5. [**ロールを選択**] をクリックし、ロールを検索します。
6. 追加のロールを付与するには、 [Add another role] をクリックして各ロールを追加します。
7. [保存] をクリックします。

H100 GPU 用にプロジェクトに十分な割り当てがあることを確認します。詳細については、GPU についてと数量に基づく割り当てをご覧ください。

環境を準備する

このチュートリアルでは、kubectl と gcloud CLI を使用して Cloud de Confiance by S3NSでホストされているリソースを管理します。gcloud CLI を認証してにアクセスできます Cloud de Confiance by S3NS。

gcloud CLI で環境を設定するには、gcloud CLI でデフォルトの環境変数を設定します。

gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=u-france-east1
export CLUSTER_NAME=CLUSTER_NAME
export GSA_NAME=GSA_NAME
export KSA_NAME=KSA_NAME
export NAMESPACE=NAMESPACE
export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
export MODEL_BUCKET_NAME=MODEL_BUCKET_NAME

次の値を置き換えます。

PROJECT_ID: の Cloud de Confiance プロジェクト ID。
REGION: H100 GPU をサポートする u-france-east1 リージョン。どのリージョンでどの GPU を使用できるかを確認できます。
CLUSTER_NAME: クラスタの名前。
GSA_NAME: Google サービスアカウントの名前（例: gemma-gsa）。
KSA_NAME: Kubernetes ServiceAccount の名前（例: gemma-ksa）。
NAMESPACE: Kubernetes Namespace（例: default）。
MODEL_BUCKET_NAME: モデルの重みが保存される Cloud Storage バケットの名前。選択したモデルと同じ名前（gemma-4-26b-it など）にできます。

リソースを作成して構成する Cloud de Confiance

次の手順で、必要なリソースを作成します。

GKE クラスタとノードプールを作成する

GKE Autopilot クラスタの GPU で Gemma をサービングできます。Autopilot クラスタは、フルマネージドの Kubernetes エクスペリエンスを提供します。

Autopilot

gcloud CLI で次のコマンドを実行します。

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=REGION \
    --release-channel=rapid

次の値を置き換えます。

PROJECT_ID: の Cloud de Confiance プロジェクト ID。
CLUSTER_NAME: クラスタの名前。
REGION: クラスタが配置されているリージョン。

GKE は、デプロイされたワークロードからのリクエストに応じた CPU ノードと GPU ノードを持つ Autopilot クラスタを作成します。

Cloud Storage バケットを作成する

gcloud CLI で次のコマンドを実行します。
```
gcloud storage buckets create gs://${MODEL_BUCKET_NAME} \
  --project=${PROJECT_ID} \
  --location=${REGION} \
  --uniform-bucket-level-access
```
これにより、Hugging Face からダウンロードしたモデルファイルを格納する Cloud Storage バケットが作成されます。
モデルの重みをダウンロードしてアップロードする:

サービングするバージョンの Gemma モデルの重みを取得する必要があります（Hugging Face やその他の公式ソースから取得できます）。ダウンロードしたファイルをローカルでディレクトリに整理します。次に例を示します。
- ./gemma-4-26b-it-local/（26B IT モデルのすべてのファイルを含む）
- ./gemma-4-31b-it-local/（31B IT モデルのすべてのファイルを含む）
これらのディレクトリを、デプロイマニフェストで想定される特定の接頭辞を使用して Cloud Storage バケットにアップロードします。
```
# Upload files for the 26B IT model
gcloud storage cp --recursive ./gemma-4-26b-it-local/* gs://${MODEL_BUCKET_NAME}

# Upload files for the 31B IT model
gcloud storage cp --recursive ./gemma-4-31b-it-local/* gs://${MODEL_BUCKET_NAME}
```
このコマンド構造により、モデルファイルは gs://${MODEL_BUCKET_NAME}/config.json などのパスに配置されます。

Cloud Storage アクセス用に Workload Identity を構成する

モデルの重みを含む Cloud Storage バケットに Kubernetes Pod が安全にアクセスできるようにするには、GKE Workload Identity を構成します。

Google サービスアカウント（GSA）を作成します。

gcloud iam service-accounts create ${GSA_NAME} \
  --project=${PROJECT_ID}

GSA のメールアドレスを特定してエクスポートします。

メール形式は、${PROJECT_ID} がドメインスコープ（コロンを含む）かどうかによって異なります。

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export GSA_EMAIL="${GSA_NAME}@${PROJ_NAME}.${DOMAIN}.s3ns.iam.gserviceaccount.com"
else
  export GSA_EMAIL="${GSA_NAME}@${PROJECT_ID}.s3ns.iam.gserviceaccount.com"
fi
  echo "Using GSA Email: ${GSA_EMAIL}"

Kubernetes サービスアカウント（KSA）を作成します。

この KSA はデプロイマニフェストで使用されます。
```
kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
```
作成を確認する
```
kubectl get serviceaccounts --namespace ${NAMESPACE}
```
KSA にアノテーションを付けて、GSA にリンクします。

このアノテーションは、KSA が権限を借用できる GSA を GKE に通知します。
```
kubectl annotate serviceaccount ${KSA_NAME} \
  --namespace ${NAMESPACE} \
  iam.gke.io/gcp-service-account=${GSA_EMAIL}
```

GSA の権限を借用する権限を KSA に付与します。

GSA のこの IAM バインディングにより、KSA は GSA として動作できます。

if [[ $PROJECT_ID == *:* ]]; then
  DOMAIN=$(echo $PROJECT_ID | cut -d: -f1)
  PROJ_NAME=$(echo $PROJECT_ID | cut -d: -f2)
  export WI_MEMBER="serviceAccount:${PROJ_NAME}.${DOMAIN}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
else
  export WI_MEMBER="serviceAccount:${PROJECT_ID}.s3ns.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"
fi

gcloud iam service-accounts add-iam-policy-binding ${GSA_EMAIL} \
  --role roles/iam.workloadIdentityUser \
  --member="${WI_MEMBER}" \
  --project=${PROJECT_ID}

バケットから読み取る権限を GSA に付与します。

バケットに対する storage.objectViewer ロールを GSA に付与します。

gcloud storage buckets add-iam-policy-binding gs://${MODEL_BUCKET_NAME} \
  --member="serviceAccount:${GSA_EMAIL}" \
  --role="roles/storage.objectViewer" \
  --project=${PROJECT_ID}

vLLM に Gemma 4 モデルをデプロイする

Gemma 4 モデルをデプロイするには、モデルの重みを保存するモデルごとに Cloud Storage バケットを作成し、選択したモデルサイズの Kubernetes Deployment マニフェストを適用します。Deployment は、クラスタ内のノードに分散された Pod の複数のレプリカを実行できる Kubernetes API オブジェクトです。

手順

このマニフェストを適用すると、vLLM コンテナイメージが pull され、NVIDIA GPU がリクエストされ、Cloud Storage バケットからモデルの重みに自動的に接続されて vLLM 推論エンジンが起動します。

Gemma 4 26B-A4B-it

次の手順に沿って、Gemma 4 26B-A4B 指示用調整モデルをデプロイします。

次の vllm-4-26b-a4b-it.yaml マニフェストを作成します。

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-26b-a4b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-26b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-num-seqs=16"
        - "--max-model-len=16384"
        - "--gpu-memory-utilization=0.95"
        - "--limit_mm_per_prompt.image=1"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        - "--trust-remote-code"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

次のようにマニフェストを適用します。
```
kubectl apply -f vllm-4-26b-a4b-it.yaml
```
必要に応じて、vLLM オプション --max-model-len=16384 を使用してコンテキストウィンドウのサイズを 16K に制限できます。コンテキストウィンドウのサイズを大きくする場合（最大 128K）は、マニフェストとノードプールの構成を調整して GPU 容量を増やします。

Gemma 4 31B-it

次の手順に沿って、Gemma 4 31B 指示用調整モデルをデプロイします。

次の vllm-4-31b-it.yaml マニフェストを作成します。

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: a3-edgegpu-8g-nolssd
spec:
  priorities:
  - machineType: a3-edgegpu-8g-nolssd
    gpu:
      count: 8
      type: nvidia-h100-80gb
  nodePoolAutoCreation:
    enabled: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-31b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["./entrypoint.sh"] # Use the image's entrypoint
        args:
        - "python"
        - "-m"
        - "vllm.entrypoints.api_server"
        - "--host=0.0.0.0"
        - "--port=8080"
        - "--model=gs://gemma-4-31b-it" # YOUR Cloud Storage PATH
        - "--tensor-parallel-size=1"
        - "--enable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
        - "--enable-auto-tool-choice"
        - "--generation-config=auto"
        - "--dtype=bfloat16"
        - "--max-model-len=16384"
        - "--max-num-seqs=16"
        - "--gpu-memory-utilization=0.95"
        - "--trust-remote-code"
        - "--tool-call-parser=gemma4"
        - "--reasoning-parser=gemma4"
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: CLOUDSDK_CORE_UNIVERSE_DOMAIN
          value: "s3nsapis.fr"
        - name: GCS_URI_ARG_KEY
          value: "model"
        - name: GCS_URI_ENV_KEY
          value: "AIP_STORAGE_URI"
        - name: LORA_ADAPTER_ARG_KEY
          value: "lora-modules"
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "1"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: a3-edgegpu-8g-nolssd
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

次のようにマニフェストを適用します。
```
kubectl apply -f vllm-4-31b-it.yaml
```
この例では、vLLM オプション --max-model-len=16384 を使用してコンテキストウィンドウを 16K に制限しています。コンテキストウィンドウのサイズを大きくする場合（最大 128K）は、マニフェストとノードプールの構成を調整して GPU 容量を増やします。

検証

Deployment が利用可能になるまで待ちます。

kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment

実行中の Deployment のログを表示します。
```
kubectl logs -f -l app=gemma-server
```
Deployment リソースによって Gemma モデルデータがダウンロードされます。この処理には数分かかることがあります。出力は次のようになります。
```
  ...
  ...
  (APIServer pid=1) INFO:     Started server process [1]
  (APIServer pid=1) INFO:     Waiting for application startup.
  (APIServer pid=1) INFO:     Application startup complete.
```

デプロイが利用可能になったら、ポート転送を設定しますモデルを操作するための。

モデルをサービングする

このセクションでは、モデルを操作します。モデルが完全にダウンロードされたことを確認してから、次の手順に進んでください。

ポート転送をセットアップする

次のコマンドを実行して、モデルへのポート転送を設定します。

kubectl port-forward svc/llm-service 8080:8080 --namespace default &

出力は次のようになります。

Forwarding from 127.0.0.1:8080 -> 8080

curl を使用してモデルを操作する

このセクションでは、デプロイした Gemma 4 指示チューニング済みモデルを基本的なスモークテストによって検証する方法を説明します。他のモデルの場合は、gemma-4-26B-A4B-it をモデル名に置き換えます。

この例では、Gemma 4 26B の指示チューニング済みモデル（テキストのみの入力）をテストする方法を示します。

新しいターミナルセッションで、curl を使用してモデルとチャットします。

curl http://127.0.0.1:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {
          "role": "user",
          "content": "Why is the sky blue?"
        }
    ],
    "chat_template_kwargs": {
         "enable_thinking": true
    },
    "skip_special_tokens": false
}'

出力は次のようになります。

{
  "id": "chatcmpl-be75ccfcbdf753d1",
  "object": "chat.completion",
  "created": 1775006187,
  "model": "google/gemma-4-26B-A4B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n*   **Red light** travels in long, lazy, wide waves.\n*   **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n*   The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n*   The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1.  **The Sun's output:** The Sun emits much more blue light than violet light.\n2.  **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n    *   Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n    *   Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n    *   Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n    *   The human eye's sensitivity: Why isn't it violet?\n\n    A good scientific explanation should follow a logical flow:\n    *   *Direct Answer:* The core mechanism (Rayleigh Scattering).\n    *   *The Components:* Sunlight and the Atmosphere.\n    *   *The Mechanism:* How light interacts with gas molecules.\n    *   *The Wavelength Factor:* Comparing colors.\n    *   *The \"Wait, why not violet?\" question:* Addressing human perception.\n    *   *Bonus/Related concept:* Why sunsets are red.\n\n        *   Use the term **Rayleigh Scattering**.\n        *   Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n        *   Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n        *   Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n        *   The atmosphere is mostly Nitrogen and Oxygen.\n        *   When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n        *   Blue light travels in shorter, smaller waves.\n        *   Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n        *   Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n        *   *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n        *   Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n        *   Briefly mention sunsets to provide a complete picture.\n        *   At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n    *   *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n    *   *Clarity:* Ensure the distinction between wavelength and scattering is clear."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 1122,
    "completion_tokens": 1101,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

問題のトラブルシューティング

Empty reply from server というメッセージが表示された場合は、コンテナがモデルデータのダウンロードを完了していない可能性があります。モデルがサービス提供の準備ができていることを示す Connected というメッセージがないか、再度 Pod のログを確認します。
Connection refused が表示された場合は、ポート転送が有効であることを確認します。

モデルのパフォーマンスをモニタリングする

モデルのオブザーバビリティ指標のダッシュボードを表示する手順は次のとおりです。

Cloud de Confiance コンソールで、[デプロイされるモデル] ページに移動します。

[デプロイされたモデル] に移動
特定のデプロイの詳細（指標、ログ、ダッシュボードなど）を表示するには、リスト内のモデル名をクリックします。
モデルの詳細ページで、[オブザーバビリティ] タブをクリックして、次のダッシュボードを表示します。プロンプトが表示されたら、[有効にする] をクリックして、クラスタの指標収集を有効にします。
- [インフラストラクチャの使用量] ダッシュボードには、使用率の指標が表示されます。
- [DCGM] ダッシュボードには、DCGM 指標が表示されます。
- vLLM を使用している場合は、[モデルのパフォーマンス] ダッシュボードが使用可能になり、vLLM モデルのパフォーマンスの指標が表示されます。

Cloud Monitoring の vLLM ダッシュボード統合で指標を表示することもできます。これらの指標は、事前設定されたフィルタを使用することなくすべての vLLM デプロイで集計されます。

vLLM はデフォルトで Prometheus 形式の指標を公開するため、追加のエクスポーターをインストールする必要はありません。 Google Cloud Managed Service for Prometheus を使用してモデルから指標を収集する方法については、Cloud Monitoring のドキュメントで vLLM のオブザーバビリティガイダンスをご覧ください。

クリーンアップ

このチュートリアルで使用したリソースについて、Google Cloud アカウントに課金されないようにするには、リソースを含むプロジェクトを削除するか、プロジェクトを維持して個々のリソースを削除します。

デプロイされたリソースを削除する

このガイドで作成したリソースについて Cloud de Confiance アカウントに課金されないようにするには、次のコマンドを実行します。

gcloud container clusters delete CLUSTER_NAME \
    --location=REGION

次の値を置き換えます。

REGION: クラスタのリージョン。
CLUSTER_NAME: クラスタの名前。

次のステップ

GKE の GPU の詳細を確認する。
GitHub のサンプルコードを表示し、他のアクセラレータ（A100 GPU や H100 GPU など）で Gemma と vLLM の使用方法を確認する。
Autopilot で GPU ワークロードをデプロイする方法を学習する。
vLLM の GitHub リポジトリとドキュメントを確認する。
Vertex AI Model Garden を確認する。
GKE プラットフォームのオーケストレーション機能を使用して、最適化された AI / ML ワークロードを実行する方法を確認する。

GKE の GPU で vLLM を使用して Gemma オープンモデルを提供する

目標

始める前に

ロールを確認する

ロールを付与する

環境を準備する

リソースを作成して構成する Cloud de Confiance

GKE クラスタとノードプールを作成する

Autopilot

Cloud Storage バケットを作成する

Cloud Storage アクセス用に Workload Identity を構成する

vLLM に Gemma 4 モデルをデプロイする

手順

Gemma 4 26B-A4B-it

Gemma 4 31B-it

検証

モデルをサービングする

ポート転送をセットアップする

curl を使用してモデルを操作する

問題のトラブルシューティング

モデルのパフォーマンスをモニタリングする

クリーンアップ

デプロイされたリソースを削除する

次のステップ