Use CC in multi-GPU system (with NvSwitch)

Yifan-Tan · September 20, 2023, 12:10pm

Host platform is a 8-GPU HGX system, with NvSwitch.

https://github.com/NVIDIA/nvtrust/blob/main/host_tools/python/gpu_cc_tool.py does not provide code for configuring NvSwitch.

Is NvSwitch required to be properly configured to enable confidential computing mode?

rnertney · September 20, 2023, 3:38pm

Currently in the Early Access, we do not provide multi-GPU CC support. We will provide the appropriate code when we release the version with multi-GPU support.

You may, however, utilize the tool to configure multiple GPUs assigned to a single CVM. They will operate independently as expected, just without leveraging NVLinks that may connect them (for example, in a Deep Learning Inference server)

Yifan-Tan · September 20, 2023, 4:01pm

The host is a multi-GPU platform. The requirement is to seperate one GPU for the guest CVM to use confidential computing. Is this usage supported in Early Access?

rnertney · September 20, 2023, 4:07pm

Yes, you may run several CVMs with any number of GPUs attached.

Any CVM with multiple GPUs will report PeerAccess as disabled, and you will not be able to use them for any P2P accesses.

Yifan-Tan · September 20, 2023, 4:26pm

I tried utilizing the tool on one of the 8 GPUs, setting the CC mode to devtools:

> sudo python3 gpu_cc_tool.py --gpu=0 --set-cc-mode devtools --reset-after-cc-mode-switch

2023-09-21,00:21:48.000 INFO     Selected GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000                                                                                                                                             
2023-09-21,00:21:48.193 INFO     GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 CC mode set to devtools. It will be active after GPU reset.                                                                                          
2023-09-21,00:21:49.864 INFO     GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 was reset to apply the new CC mode.

I attach this GPU to a CVM. However, any CUDA application would make the kernel driver to crash. Crash location in driver code:

// kernel-open/nvidia-uvm/uvm_channel.c

static NV_STATUS channel_manager_create_conf_computing_pools(uvm_channel_manager_t *manager, unsigned *preferred_ce)
{
    // Skip some code

    status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_SEC2, 0, &sec2_pool); // This function returned with status NV_OK

    // Skip some code

    status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_WLC, wlc_lcic_ce_index, &wlc_pool); // Crashed here
}

Driver output:

[30838.220082] nvidia-uvm: uvm_channel.c:1532 uvm_channel_check_errors[pid:2503] Channel error likely caused by push 'GPFIFO submit to 'ID 1:3 (0x1:0x3) WLC 2' via 'UVM_CHANNEL_TYPE_SEC2'' started at uvm_channel.c:1247 in submit_ctrl_gpfifo_indirect()
[30838.222513] nvidia-uvm: uvm_channel.c:1547 uvm_channel_check_errors[pid:2503] Assert failed, condition 0 not true: Fatal error: Generic RC error [NV_ERR_RC_ERROR]

I add printf in uvm_channel_get_status (called by uvm_channel_check_errors). The errorNotifier->status is -1 (0xffff).

I did not configure the NvSwitch device. Furthermore, I did not passthrough the NvSwitch to CVM.

What’s the problem here?

By the way, if I set the CC mode to on, the error happens at another place. Its output is shown at Is "CC mode = on" available? - #3 by Yifan-Tan.

Yifan-Tan · October 28, 2023, 5:02pm

Solved: Nvidia H100 driver in guest · Issue #19 · NVIDIA/nvtrust · GitHub

system · November 11, 2023, 5:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is "CC mode = on" available? Confidential Computing	4	951	October 28, 2023
Enabling CC mode on passthrough H100 GPU on RHEL guest VM fails SPDM CUDA Setup and Installation	0	306	May 17, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	1030	February 15, 2024
Announcing Confidential Computing General Access on NVIDIA H100 Tensor Core GPUs Technical Blog	1	253	April 25, 2024
Pass-through cc-disabled H100 to a non-confidential VM Confidential Computing	0	339	May 23, 2024
Installing the NVIDIA Driver and CUDA Toolkit failed Confidential Computing	2	3953	January 3, 2024
VBIOS update on H100 (Intel TDX + H100) Confidential Computing	12	632	March 8, 2025
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	3048	August 27, 2023
H100 GPUs in Confidential Computing (CC) mode Confidential Computing	1	148	December 30, 2024
Rror getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory General Topics and Other SDKs gpu	9	1535	October 19, 2023

Use CC in multi-GPU system (with NvSwitch)

Related topics