We are testing with SEV-SNP+H100. The cc mode with a single GPU works fine by following the deployment guide. Now we want to test non-cc mode with a regular VM. First we --set-cc-mode=off
.
$ sudo python3 nvidia_gpu_tools.py --gpu-bdf=43:00.0 --set-cc-mode=off --reset-after-cc-mode-switch
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-bdf=43:00.0', '--set-cc-mode=off', '--reset-after-cc-mode-switch']
2024-05-09,03:02:58.309 WARNING GPU 0000:43:00.0 ? 0x2330 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
Topo:
PCI 0000:40:01.1 0x1022:0x14ab
PCI 0000:41:00.0 0x1000:0xc030
PCI 0000:42:00.0 0x1000:0xc030
GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:02:58.311 INFO Selected GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:02:58.311 WARNING GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 has CC mode on, some functionality may not work
2024-05-09,03:02:58.413 INFO GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 CC mode set to off. It will be active after GPU reset.
2024-05-09,03:03:00.071 INFO GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 was reset to apply the new CC mode.
2024-05-09,03:03:00.072 WARNING GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 restoring power control to auto
We can see that cc is disabled.
$ sudo python3 nvidia_gpu_tools.py --gpu-bdf=43:00.0 --query-cc-settings
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-bdf=43:00.0', '--query-cc-settings']
2024-05-09,03:03:19.437 WARNING GPU 0000:43:00.0 ? 0x2330 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
Topo:
PCI 0000:40:01.1 0x1022:0x14ab
PCI 0000:41:00.0 0x1000:0xc030
PCI 0000:42:00.0 0x1000:0xc030
GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:03:19.439 INFO Selected GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000
2024-05-09,03:03:19.551 INFO GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 CC settings:
2024-05-09,03:03:19.551 INFO enable = 0
2024-05-09,03:03:19.551 INFO enable-devtools = 0
2024-05-09,03:03:19.551 INFO enable-bar0-filter = 0
2024-05-09,03:03:19.552 INFO enable-allow-inband-control = 1
2024-05-09,03:03:19.552 INFO enable-devtools-allow-inband-control = 1
2024-05-09,03:03:19.552 INFO enable-bar0-filter-allow-inband-control = 1
2024-05-09,03:03:19.552 WARNING GPU 0000:43:00.0 H100-SXM 0x2330 BAR0 0xab042000000 restoring power control to auto
Then we invoke the launch_vm.sh
with -x
to set the docc=false
The VM starts ok and nvidia-smi
can see the GPU
Thu May 9 03:22:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:01:00.0 Off | 0 |
| N/A 25C P0 67W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But when we try to run some test program with CUDA, it failed with Error 802.
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/zgu/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
>>>
Is there any way to test the GPU in non-cc mode? Do I miss some steps?