Use CC in multi-GPU system (with NvSwitch)

Host platform is a 8-GPU HGX system, with NvSwitch. does not provide code for configuring NvSwitch.

Is NvSwitch required to be properly configured to enable confidential computing mode?

Currently in the Early Access, we do not provide multi-GPU CC support. We will provide the appropriate code when we release the version with multi-GPU support.

You may, however, utilize the tool to configure multiple GPUs assigned to a single CVM. They will operate independently as expected, just without leveraging NVLinks that may connect them (for example, in a Deep Learning Inference server)

The host is a multi-GPU platform. The requirement is to seperate one GPU for the guest CVM to use confidential computing. Is this usage supported in Early Access?

Yes, you may run several CVMs with any number of GPUs attached.

Any CVM with multiple GPUs will report PeerAccess as disabled, and you will not be able to use them for any P2P accesses.

I tried utilizing the tool on one of the 8 GPUs, setting the CC mode to devtools:

> sudo python3 --gpu=0 --set-cc-mode devtools --reset-after-cc-mode-switch

2023-09-21,00:21:48.000 INFO     Selected GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000                                                                                                                                             
2023-09-21,00:21:48.193 INFO     GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 CC mode set to devtools. It will be active after GPU reset.                                                                                          
2023-09-21,00:21:49.864 INFO     GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 was reset to apply the new CC mode.

I attach this GPU to a CVM. However, any CUDA application would make the kernel driver to crash. Crash location in driver code:

// kernel-open/nvidia-uvm/uvm_channel.c

static NV_STATUS channel_manager_create_conf_computing_pools(uvm_channel_manager_t *manager, unsigned *preferred_ce)
    // Skip some code

    status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_SEC2, 0, &sec2_pool); // This function returned with status NV_OK

    // Skip some code

    status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_WLC, wlc_lcic_ce_index, &wlc_pool); // Crashed here

Driver output:

[30838.220082] nvidia-uvm: uvm_channel.c:1532 uvm_channel_check_errors[pid:2503] Channel error likely caused by push 'GPFIFO submit to 'ID 1:3 (0x1:0x3) WLC 2' via 'UVM_CHANNEL_TYPE_SEC2'' started at uvm_channel.c:1247 in submit_ctrl_gpfifo_indirect()
[30838.222513] nvidia-uvm: uvm_channel.c:1547 uvm_channel_check_errors[pid:2503] Assert failed, condition 0 not true: Fatal error: Generic RC error [NV_ERR_RC_ERROR]

I add printf in uvm_channel_get_status (called by uvm_channel_check_errors). The errorNotifier->status is -1 (0xffff).

I did not configure the NvSwitch device. Furthermore, I did not passthrough the NvSwitch to CVM.

What’s the problem here?

By the way, if I set the CC mode to on, the error happens at another place. Its output is shown at Is "CC mode = on" available? - #3 by Yifan-Tan.

Solved: Nvidia H100 driver in guest · Issue #19 · NVIDIA/nvtrust · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.