Some or all of the information on this page might not apply to Cloud de Confiance by S3NS. See Differences from Google Cloud for more details.

Train a model with GPUs on GKE Standard mode

Standard

This quickstart tutorial shows you how to deploy a training model with GPUs in Google Kubernetes Engine (GKE) and store the predictions in Cloud Storage. This tutorial uses a TensorFlow model and GKE Standard clusters. You can also run these workloads on Autopilot clusters with fewer setup steps. For instructions, see Train a model with GPUs on GKE Autopilot mode.

This document is intended for GKE administrators who have existing Standard clusters and want to run GPU workloads for the first time.

Before you begin

In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Cloud de Confiance project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Cloud de Confiance console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Cloud de Confiance console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Clone the sample repository

In Cloud Shell, run the following command:

git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Create a Standard mode cluster and a GPU node pool

Use Cloud Shell to do the following:

Create a Standard cluster that uses Workload Identity Federation for GKE and installs the Cloud Storage FUSE driver:
```
gcloud container clusters create gke-gpu-cluster \
    --addons GcsFuseCsiDriver \
    --location=us-central1 \
    --num-nodes=1 \
    --workload-pool=PROJECT_ID.s3ns.svc.id.goog
```
Replace PROJECT_ID with your Cloud de Confiance project ID.

Cluster creation might take several minutes.

Create a GPU node pool:

gcloud container node-pools create gke-gpu-pool-1 \
    --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --machine-type=n1-standard-16 --num-nodes=1 \
    --location=us-central1 \
    --cluster=gke-gpu-cluster

Create a Cloud Storage bucket

In the Cloud de Confiance console, go to the Create a bucket page:

Go to Create a bucket
In the Name your bucket field, enter the following name:
```
PROJECT_ID-gke-gpu-bucket
```
Click Continue.
For Location type, select Region.
In the Region list, select us-central1 (Iowa) and click Continue.
In the Choose a storage class for your data section, click Continue.
In the Choose how to control access to objects section, for Access control, select Uniform.
Click Create.
In the Public access will be prevented dialog, ensure that the Enforce public access prevention on this bucket checkbox is selected, and click Confirm.

Configure your cluster to access the bucket using Workload Identity Federation for GKE

To let your cluster access the Cloud Storage bucket, you do the following:

Create a Cloud de Confiance service account.
Create a Kubernetes ServiceAccount in your cluster.
Bind the Kubernetes ServiceAccount to the Cloud de Confiance service account.

Create a Cloud de Confiance service account

In the Cloud de Confiance console, go to the Create service account page:

Go to Create service account
In the Service account ID field, enter gke-ai-sa.
Click Create and continue.
In the Role list, select the Cloud Storage > Storage Insights Collector Service role.
Click Add another role.
In the Select a role list, select the Cloud Storage > Storage Object Admin role.
Click Continue, and then click Done.

Create a Kubernetes ServiceAccount in your cluster

In Cloud Shell, do the following:

Create a Kubernetes namespace:

kubectl create namespace gke-ai-namespace

Create a Kubernetes ServiceAccount in the namespace:

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace

Bind the Kubernetes ServiceAccount to the Cloud de Confiance service account

In Cloud Shell, run the following commands:

Add an IAM binding to the Cloud de Confiance service account:

gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.s3ns.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

The --member flag provides the full identity of the Kubernetes ServiceAccount in Cloud de Confiance.

Annotate the Kubernetes ServiceAccount:

kubectl annotate serviceaccount gpu-k8s-sa \
    --namespace gke-ai-namespace \
    iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com

Verify that Pods can access the Cloud Storage bucket

In Cloud Shell, create the following environment variables:
```
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket
```
Replace PROJECT_ID with your Cloud de Confiance project ID.
Create a Pod that has a TensorFlow container:
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```
This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace $K8S_SA_NAME and $BUCKET_NAME with the corresponding values.

Create a sample file in the bucket:

touch sample-file
gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucket

Wait for your Pod to become ready:

kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-ai-namespace --timeout=180s

When the Pod is ready, the output is the following:

pod/test-tensorflow-pod condition met

Open a shell in the TensorFlow container:

kubectl -n gke-ai-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

Try to read the sample file that you created:
```
ls /data
```
The output shows the sample file.

Check the logs to identify the GPU attached to the Pod:

python3 -m pip install 'tensorflow[and-cuda]'
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

The output shows the GPU attached to the Pod, similar to the following:

...
PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

Exit the container:
```
exit
```

Delete the sample Pod:

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \
    --namespace=gke-ai-namespace

Train and predict using the `MNIST` dataset

In this section, you run a training workload on the MNIST example dataset.

Copy the example data to the Cloud Storage bucket:

gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Create the following environment variables:

export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

Review the training Job:

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

Deploy the training Job:
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-ai-namespace apply -f -
```
This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace $K8S_SA_NAME and $BUCKET_NAME with the corresponding values.

Wait until the Job has the Completed status:

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-training-job --timeout=180s

The output is similar to the following:

job.batch/mnist-training-job condition met

Check the logs from the TensorFlow container:

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-ai-namespace

The output shows the following events occur:

Install required Python packages
Download the MNIST dataset
Train the model using a GPU
Save the model
Evaluate the model

...
Epoch 12/12
927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954
Learning rate for epoch 12 is 9.999999747378752e-06
938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05
157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861
Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446
Training finished. Model saved

Delete the training workload:

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

Deploy an inference workload

In this section, you deploy an inference workload that takes a sample dataset as input and returns predictions.

Copy the images for prediction to the bucket:

gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Review the inference workload:

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

Deploy the inference workload:
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-ai-namespace apply -f -
```
This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace $K8S_SA_NAME and $BUCKET_NAME with the corresponding values.