Pass-through cc-disabled H100 to a non-confidential VM

gzs715 · May 23, 2024, 1:39pm

We are testing with SEV-SNP+H100. The cc mode with a single GPU works fine by following the deployment guide. Now we want to test non-cc mode with a regular VM. First we --set-cc-mode=off.

$ sudo python3 nvidia_gpu_tools.py --gpu-bdf=43:00.0 --set-cc-mode=off --reset-after-cc-mode-switch
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-bdf=43:00.0', '--set-cc-mode=off', '--reset-after-cc-mode-switch']
2024-05-09,03:02:58.309 WARNING  GPU 0000:43:00.0 ? 0x2330 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
Topo:
  PCI 0000:40:01.1 0x1022:0x14ab
   PCI 0000:41:00.0 0x1000:0xc030
    PCI 0000:42:00.0 0x1000:0xc030
     GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:02:58.311 INFO     Selected GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:02:58.311 WARNING  GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 has CC mode on, some functionality may not work
2024-05-09,03:02:58.413 INFO     GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 CC mode set to off. It will be active after GPU reset.
2024-05-09,03:03:00.071 INFO     GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 was reset to apply the new CC mode.
2024-05-09,03:03:00.072 WARNING  GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 restoring power control to auto

We can see that cc is disabled.

$ sudo python3 nvidia_gpu_tools.py --gpu-bdf=43:00.0 --query-cc-settings
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-bdf=43:00.0', '--query-cc-settings']
2024-05-09,03:03:19.437 WARNING  GPU 0000:43:00.0 ? 0x2330 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
Topo:
  PCI 0000:40:01.1 0x1022:0x14ab
   PCI 0000:41:00.0 0x1000:0xc030
    PCI 0000:42:00.0 0x1000:0xc030
     GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:03:19.439 INFO     Selected GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:03:19.551 INFO     GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 CC settings:
2024-05-09,03:03:19.551 INFO       enable = 0
2024-05-09,03:03:19.551 INFO       enable-devtools = 0
2024-05-09,03:03:19.551 INFO       enable-bar0-filter = 0
2024-05-09,03:03:19.552 INFO       enable-allow-inband-control = 1
2024-05-09,03:03:19.552 INFO       enable-devtools-allow-inband-control = 1
2024-05-09,03:03:19.552 INFO       enable-bar0-filter-allow-inband-control = 1
2024-05-09,03:03:19.552 WARNING  GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 restoring power control to auto

Then we invoke the launch_vm.sh with -x to set the docc=false
The VM starts ok and nvidia-smi can see the GPU

Thu May  9 03:22:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:01:00.0 Off |                    0 |
| N/A   25C    P0             67W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But when we try to run some test program with CUDA, it failed with Error 802.

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/zgu/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
 return torch._C._cuda_getDeviceCount() > 0
False
>>>

Is there any way to test the GPU in non-cc mode? Do I miss some steps?

Topic		Replies	Views
Broken GPU due to PCI Confidential Computing pcie , linux-driver	2	37	October 30, 2024
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	242	September 11, 2024
Driver 560.X fails to initialize H100 GPUs, but previous versions work fine Linux kernel	4	381	October 30, 2024
How to run H100 GPU without CC Mode? Confidential Computing cuda	5	368	February 28, 2024
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	2989	August 19, 2023
no CUDA-capable device is detected CUDA Setup and Installation	6	141278	February 9, 2018
Userguide to get started with H100 GPUs? Confidential Computing cuda , tensorflow , python	6	908	January 23, 2024
Announcing Confidential Computing General Access on NVIDIA H100 Tensor Core GPUs Technical Blog	1	233	April 25, 2024
What is the Compute Capability of a GeForce GT 710 CUDA Programming and Performance	9	26170	October 12, 2021
Is "CC mode = on" available? Confidential Computing	4	860	October 28, 2023

Pass-through cc-disabled H100 to a non-confidential VM

Related topics