Encrypt GPU workload data in use with Confidential GKE Nodes


You can encrypt GPU workload data in-use by running the workloads on encrypted Confidential Google Kubernetes Engine Nodes. This page shows Security engineers and Operators how to improve security for the data in accelerated workloads, such as AI/ML tasks. You should be familiar with the following concepts:

About running GPU workloads on Confidential GKE Nodes

You can request Confidential GKE Nodes for your GPU workloads by using one of the following methods:

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Requirements and limitations

Regardless of the Confidential GKE Nodes configuration method that you choose, you must meet all of the following requirements:

  • The nodes must be in a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
  • The nodes must use only one NVIDIA H100 80 GB GPU and the a3-highgpu-1g machine type.
  • The nodes must use the Intel TDX Confidential Computing technology.
  • You must have quota for preemptible H100 80 GB GPUs (compute.googleapis.com/preemptible_nvidia_h100_gpus) in your node locations. For more information about managing your quota, see View and manage quotas.

In addition to these requirements, you must meet specific conditions depending on the Confidential GKE Nodes configuration method that you choose, as described in the following table:

Configuration method Requirements Limitations
ComputeClasses
  • You can't use flex-start with queued provisioning with ComputeClasses.
  • You can't use GPU sharing features like time-sharing or multi-instance GPUs.
Manual configuration in Standard mode
  • Use Spot VMs, preemptible VMs, flex-start (Preview), or flex-start with queued provisioning.
  • Use one of the following GKE versions:
    • Manual GPU driver installation: 1.32.2-gke.1297000 or later.
    • Automatic GPU driver installation: 1.33.3-gke.1392000 or later.
    • Flex-start with queued provisioning: 1.32.2-gke.1652000 or later.
  • You can't use flex-start (Preview) if you enable Confidential GKE Nodes for the entire cluster.
  • You can't use GPU sharing features like time-sharing or multi-instance GPUs.

Required roles

To get the permissions that you need to create Confidential GKE Nodes, ask your administrator to grant you the following IAM roles on the Trusted Cloud project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Use ComputeClasses to run confidential GPU workloads

You can define your Confidential GKE Nodes configuration in a ComputeClass. ComputeClasses are Kubernetes custom resources that let you declaratively set node configurations for GKE autoscaling and scheduling. You can follow the steps in this section in any Autopilot or Standard cluster that runs GKE version 1.33.3-gke.1392000 or later.

To use a ComputeClass to run GPU workloads on Confidential GKE Nodes, follow these steps:

  1. Save the following ComputeClass manifest as a YAML file:

    apiVersion: cloud.google.com/v1
    kind: ComputeClass
    metadata:
      name: COMPUTECLASS_NAME
    spec:
      nodePoolConfig:
        confidentialNodeType: TDX
      priorityDefaults:
        location:
          zones: ['ZONE1','ZONE2']
      priorities:
      - gpu:
          type: nvidia-h100-80gb
          count: 1
          driverVersion: default
        spot: true
      activeMigration:
        optimizeRulePriority: true
      nodePoolAutoCreation:
        enabled: true
      whenUnsatisfiable: DoNotScaleUp
    

    Replace the following:

    • COMPUTECLASS_NAME: a name for the ComputeClass.
    • ZONE1,ZONE2: a comma-separated list of zones to create nodes in, such as ['us-central1-a','us-central1-b']. Specify zones that support the Intel TDX Confidential Computing technology. For more information, see View supported zones.
  2. Create the ComputeClass:

    kubectl apply -f PATH_TO_MANIFEST
    

    Replace PATH_TO_MANIFEST with the path to the ComputeClass manifest file.

  3. To run your GPU workload on Confidential GKE Nodes, select the ComputeClass in the workload manifest. For example, save the following Deployment manifest, which selects a ComputeClass and GPUs, as a YAML file:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: confidential-gpu-deployment
      labels:
        app: conf-gpu
    spec:
      selector:
        matchLabels:
          app: conf-gpu
      replicas: 1
      template:
        metadata:
          labels:
            app: conf-gpu
        spec:
          nodeSelector:
            cloud.google.com/compute-class: COMPUTECLASS_NAME
          containers:
          - name: example-app
            image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
            resources:
              limits:
                cpu: "4"
                memory: "16Gi"
                nvidia.com/gpu: 1
              requests:
                cpu: "4"
                memory: "16Gi"
    

    Replace COMPUTECLASS_NAME with the name of the ComputeClass that you created.

  4. Create the Deployment:

    kubectl apply -f PATH_TO_DEPLOYMENT_MANIFEST
    

    Replace PATH_TO_DEPLOYMENT_MANIFEST with the path to the Deployment manifest.

When you create your GPU workload, GKE uses the configuration in the ComputeClass to create Confidential GKE Nodes with attached GPUs.

Manually configure Confidential GKE Nodes in GKE Standard

You can run GPU workloads on Confidential GKE Nodes in Standard mode clusters or node pools. For GPU workloads, your Confidential GKE Nodes must use the Intel TDX Confidential Computing technology.

Enable Confidential GKE Nodes in new Standard clusters

You can enable Confidential GKE Nodes for your entire Standard cluster, so that every GPU node pool that you create uses the same Confidential Computing technology. When you create a new Standard mode cluster that uses Confidential GKE Nodes for GPU workloads, ensure that you specify the following cluster settings:

  • Location: a region or a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
  • Confidential Computing type: Intel TDX
  • Cluster version: one of the following versions, depending on how you want to install your GPU drivers:

    • Manual GPU driver installation: 1.32.2-gke.1297000 or later.
    • Automatic GPU driver installation: 1.33.3-gke.1392000 or later.

You can optionally configure GPUs for the default node pool that GKE creates in your cluster. However, we recommend that you use a separate node pool for your GPUs, so that at least one node pool in the cluster can run any workload.

For more information, see Enable Confidential GKE Nodes on Standard clusters.

Use Confidential GKE Nodes with GPUs in Standard node pools

If your cluster doesn't have Confidential GKE Nodes enabled, you can enable Confidential GKE Nodes in specific new or existing GPU node pools. The control plane and node pools must meet the requirements in the Availability section. When you configure the node pool, you can choose to install GPU drivers automatically or manually.

  • To create a new GPU node pool that uses Confidential GKE Nodes, select one of the following options:

    Console

    1. In the Trusted Cloud console, go to the Kubernetes clusters page:

      Go to Kubernetes clusters

    2. Click the name of the Standard mode cluster to modify.

    3. Click Add node pool. The Add a node pool page opens.

    4. On the Node pool details pane, do the following:

      1. Select Specify node locations.
      2. Select only the supported zones that are listed in the Availability section.
      3. Ensure that the control plane version is one of the versions that's listed in the Availability section.
    5. In the navigation menu, click Nodes.

    6. On the Configure node settings pane, do the following:

      1. In the Machine configuration section, click GPUs.
      2. In the GPU type menu, select NVIDIA H100 80GB.
      3. In the Number of GPUs menu, select 1.
      4. Ensure that Enable GPU sharing isn't selected.
      5. In the GPU Driver installation section, select one of the following options:

        • Google-managed: GKE automatically installs a driver. If you select this option, in the Version drop-down list, select one of the following driver versions:

          • Default: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
          • Latest: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
        • User-managed: skip automatic driver installation. If you select this option, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.

      6. In the Machine type section, ensure that the machine type is a3-highgpu-1g.

      7. Select Enable nodes on spot VMs or configure flex-start with queued provisioning.

    7. When you're ready to create the node pool, click Create.

    gcloud

    You can create GPU node pools that run Confidential GKE Nodes on Spot VMs or by using flex-start with queued provisioning.

    • Create a GPU node pool that runs Confidential GKE Nodes on Spot VMs:

      gcloud container node-pools create NODE_POOL_NAME \
          --cluster=CLUSTER_NAME \
          --confidential-node-type=tdx --location=LOCATION \
          --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
          --spot --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION \
          --machine-type=a3-highgpu-1g
      

      Replace the following:

      • NODE_POOL_NAME: a name for your new node pool.
      • CLUSTER_NAME: the name of your existing cluster.
      • LOCATION: the location for your new node pool. The location must support using GPUs in Confidential GKE Nodes.
      • NODE_LOCATION1,NODE_LOCATION2,...: a comma-separated list of zones to run the nodes in. These zones must support using NVIDIA Confidential Computing. For more information, see View supported zones.
      • DRIVER_VERSION: the GPU driver version to install. Specify one of the following values:

      • default: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.

      • latest: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.

      • disabled: skip automatic driver installation. If you specify this value, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.

    • Create a GPU node pool that runs Confidential GKE Nodes by using flex-start with queued provisioning:

      gcloud container node-pools create NODE_POOL_NAME \
          --cluster=CLUSTER_NAME \
          --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
          --machine-type=a3-highgpu-1g --confidential-node-type=tdx \
          --location=LOCATION \
          --flex-start --enable-queued-provisioning \
          --enable-autoscaling --num-nodes=0 --total-max-nodes=TOTAL_MAX_NODES \
          --location-policy=ANY --reservation-affinity=none --no-enable-autorepair \
          --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION
      

      Replace TOTAL_MAX_NODES with the maximum number of nodes that the node pool can automatically scale to.

      For more information about the configuration options in flex-start with queued provisioning, see Run a large-scale workload with flex-start with queued provisioning.

  • To update your existing node pools to use the Intel TDX Confidential Computing technology, see Update an existing node pool.

Manually install GPU drivers that support Confidential GKE Nodes

If you didn't enable automatic driver installation when you created or updated your node pools, you must manually install a GPU driver that supports Confidential GKE Nodes.

This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.

For instructions, see the "COS" tab in Manually install NVIDIA GPU drivers.

Troubleshoot

For troubleshooting information, see Troubleshoot GPUs in GKE.

What's next