Train a model with GPUs on GKE Standard mode
This quickstart tutorial shows you how to deploy a training model with GPUs in Google Kubernetes Engine (GKE) and store the predictions in Cloud Storage. This tutorial uses a TensorFlow model and GKE Standard clusters. You can also run these workloads on Autopilot clusters with fewer setup steps. For instructions, see Train a model with GPUs on GKE Autopilot mode.
This document is intended for GKE administrators who have existing Standard clusters and want to run GPU workloads for the first time.
Before you begin
In the Cloud de Confiance console, on the project selector page, select or create a Cloud de Confiance project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
Verify that billing is enabled for your Cloud de Confiance project.
Enable the Kubernetes Engine and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
In the Cloud de Confiance console, activate Cloud Shell.
At the bottom of the Cloud de Confiance console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Clone the sample repository
In Cloud Shell, run the following command:
git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu
Create a Standard mode cluster and a GPU node pool
Use Cloud Shell to do the following:
Create a Standard cluster that uses Workload Identity Federation for GKE and installs the Cloud Storage FUSE driver:
gcloud container clusters create gke-gpu-cluster \ --addons GcsFuseCsiDriver \ --location=us-central1 \ --num-nodes=1 \ --workload-pool=PROJECT_ID.s3ns.svc.id.googReplace
PROJECT_IDwith your Cloud de Confiance project ID.Cluster creation might take several minutes.
Create a GPU node pool:
gcloud container node-pools create gke-gpu-pool-1 \ --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \ --machine-type=n1-standard-16 --num-nodes=1 \ --location=us-central1 \ --cluster=gke-gpu-cluster
Create a Cloud Storage bucket
In the Cloud de Confiance console, go to the Create a bucket page:
In the Name your bucket field, enter the following name:
PROJECT_ID-gke-gpu-bucketClick Continue.
For Location type, select Region.
In the Region list, select
us-central1 (Iowa)and click Continue.In the Choose a storage class for your data section, click Continue.
In the Choose how to control access to objects section, for Access control, select Uniform.
Click Create.
In the Public access will be prevented dialog, ensure that the Enforce public access prevention on this bucket checkbox is selected, and click Confirm.
Configure your cluster to access the bucket using Workload Identity Federation for GKE
To let your cluster access the Cloud Storage bucket, you do the following:
- Create a Cloud de Confiance service account.
- Create a Kubernetes ServiceAccount in your cluster.
- Bind the Kubernetes ServiceAccount to the Cloud de Confiance service account.
Create a Cloud de Confiance service account
In the Cloud de Confiance console, go to the Create service account page:
In the Service account ID field, enter
gke-ai-sa.Click Create and continue.
In the Role list, select the Cloud Storage > Storage Insights Collector Service role.
Click Add another role.
In the Select a role list, select the Cloud Storage > Storage Object Admin role.
Click Continue, and then click Done.
Create a Kubernetes ServiceAccount in your cluster
In Cloud Shell, do the following:
Create a Kubernetes namespace:
kubectl create namespace gke-ai-namespaceCreate a Kubernetes ServiceAccount in the namespace:
kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace
Bind the Kubernetes ServiceAccount to the Cloud de Confiance service account
In Cloud Shell, run the following commands:
Add an IAM binding to the Cloud de Confiance service account:
gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:PROJECT_ID.s3ns.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"The
--memberflag provides the full identity of the Kubernetes ServiceAccount in Cloud de Confiance.Annotate the Kubernetes ServiceAccount:
kubectl annotate serviceaccount gpu-k8s-sa \ --namespace gke-ai-namespace \ iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com
Verify that Pods can access the Cloud Storage bucket
In Cloud Shell, create the following environment variables:
export K8S_SA_NAME=gpu-k8s-sa export BUCKET_NAME=PROJECT_ID-gke-gpu-bucketReplace
PROJECT_IDwith your Cloud de Confiance project ID.Create a Pod that has a TensorFlow container:
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-ai-namespace apply -f -This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith the corresponding values.Create a sample file in the bucket:
touch sample-file gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucketWait for your Pod to become ready:
kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-ai-namespace --timeout=180sWhen the Pod is ready, the output is the following:
pod/test-tensorflow-pod condition metOpen a shell in the TensorFlow container:
kubectl -n gke-ai-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bashTry to read the sample file that you created:
ls /dataThe output shows the sample file.
Check the logs to identify the GPU attached to the Pod:
python3 -m pip install 'tensorflow[and-cuda]' python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"The output shows the GPU attached to the Pod, similar to the following:
... PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')Exit the container:
exitDelete the sample Pod:
kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \ --namespace=gke-ai-namespace
Train and predict using the MNIST dataset
In this section, you run a training workload on the MNIST example dataset.
Copy the example data to the Cloud Storage bucket:
gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursiveCreate the following environment variables:
export K8S_SA_NAME=gpu-k8s-sa export BUCKET_NAME=PROJECT_ID-gke-gpu-bucketReview the training Job:
Deploy the training Job:
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-ai-namespace apply -f -This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith the corresponding values.Wait until the Job has the
Completedstatus:kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-training-job --timeout=180sThe output is similar to the following:
job.batch/mnist-training-job condition metCheck the logs from the TensorFlow container:
kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-ai-namespaceThe output shows the following events occur:
- Install required Python packages
- Download the MNIST dataset
- Train the model using a GPU
- Save the model
- Evaluate the model
... Epoch 12/12 927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954 Learning rate for epoch 12 is 9.999999747378752e-06 938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05 157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861 Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446 Training finished. Model savedDelete the training workload:
kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml
Deploy an inference workload
In this section, you deploy an inference workload that takes a sample dataset as input and returns predictions.
Copy the images for prediction to the bucket:
gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursiveReview the inference workload:
Deploy the inference workload:
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-ai-namespace apply -f -This command substitutes the environment variables that you created into the corresponding references in the manifest. You can also open the manifest in a text editor and replace
$K8S_SA_NAMEand$BUCKET_NAMEwith the corresponding values.Wait until the Job has the
Completedstatus:kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180sThe output is similar to the following:
job.batch/mnist-batch-prediction-job condition metCheck the logs from the TensorFlow container:
kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-ai-namespaceThe output is the prediction for each image and the model's confidence in the prediction, similar to the following:
Found 10 files belonging to 1 classes. 1/1 [==============================] - 2s 2s/step The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence. The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence. The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence. The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence. The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence. The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence. The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence. The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence. The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence. The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.
Clean up
To avoid incurring charges to your Cloud de Confiance account for the resources that you created in this guide, do one of the following:
- Keep the GKE cluster: Delete the Kubernetes resources in the cluster and the Cloud de Confiance resources
- Keep the Cloud de Confiance project: Delete the GKE cluster and the Cloud de Confiance resources
- Delete the project
Delete the Kubernetes resources in the cluster and the Cloud de Confiance resources
Delete the Kubernetes namespace and the workloads that you deployed:
kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml kubectl delete namespace gke-ai-namespaceDelete the Cloud Storage bucket:
Go to the Buckets page:
Select the checkbox for
PROJECT_ID-gke-gpu-bucket.Click Delete.
To confirm deletion, type
DELETEand click Delete.
Delete the Cloud de Confiance service account:
Go to the Service accounts page:
Select your project.
Select the checkbox for
gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com.Click Delete.
To confirm deletion, click Delete.
Delete the GKE cluster and the Cloud de Confiance resources
Delete the GKE cluster:
Go to the Clusters page:
Select the checkbox for
gke-gpu-cluster.Click Delete.
To confirm deletion, type
gke-gpu-clusterand click Delete.
Delete the Cloud Storage bucket:
Go to the Buckets page:
Select the checkbox for
PROJECT_ID-gke-gpu-bucket.Click Delete.
To confirm deletion, type
DELETEand click Delete.
Delete the Cloud de Confiance service account:
Go to the Service accounts page:
Select your project.
Select the checkbox for
gke-ai-sa@PROJECT_ID.s3ns.iam.gserviceaccount.com.Click Delete.
To confirm deletion, click Delete.
Delete the project
- In the Cloud de Confiance console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.