Host platform is a 8-GPU HGX system, with NvSwitch.
https://github.com/NVIDIA/nvtrust/blob/main/host_tools/python/gpu_cc_tool.py does not provide code for configuring NvSwitch.
Is NvSwitch required to be properly configured to enable confidential computing mode?
Currently in the Early Access, we do not provide multi-GPU CC support. We will provide the appropriate code when we release the version with multi-GPU support.
You may, however, utilize the tool to configure multiple GPUs assigned to a single CVM. They will operate independently as expected, just without leveraging NVLinks that may connect them (for example, in a Deep Learning Inference server)
The host is a multi-GPU platform. The requirement is to seperate one GPU for the guest CVM to use confidential computing. Is this usage supported in Early Access?
Yes, you may run several CVMs with any number of GPUs attached.
Any CVM with multiple GPUs will report PeerAccess as disabled, and you will not be able to use them for any P2P accesses.
I tried utilizing the tool on one of the 8 GPUs, setting the CC mode to devtools:
> sudo python3 gpu_cc_tool.py --gpu=0 --set-cc-mode devtools --reset-after-cc-mode-switch
2023-09-21,00:21:48.000 INFO Selected GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000
2023-09-21,00:21:48.193 INFO GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 CC mode set to devtools. It will be active after GPU reset.
2023-09-21,00:21:49.864 INFO GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 was reset to apply the new CC mode.
I attach this GPU to a CVM. However, any CUDA application would make the kernel driver to crash. Crash location in driver code:
static NV_STATUS channel_manager_create_conf_computing_pools(uvm_channel_manager_t *manager, unsigned *preferred_ce)
// Skip some code
status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_SEC2, 0, &sec2_pool); // This function returned with status NV_OK
// Skip some code
status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_WLC, wlc_lcic_ce_index, &wlc_pool); // Crashed here
[30838.220082] nvidia-uvm: uvm_channel.c:1532 uvm_channel_check_errors[pid:2503] Channel error likely caused by push 'GPFIFO submit to 'ID 1:3 (0x1:0x3) WLC 2' via 'UVM_CHANNEL_TYPE_SEC2'' started at uvm_channel.c:1247 in submit_ctrl_gpfifo_indirect()
[30838.222513] nvidia-uvm: uvm_channel.c:1547 uvm_channel_check_errors[pid:2503] Assert failed, condition 0 not true: Fatal error: Generic RC error [NV_ERR_RC_ERROR]
I add printf in
uvm_channel_get_status (called by
errorNotifier->status is -1 (0xffff).
I did not configure the NvSwitch device. Furthermore, I did not passthrough the NvSwitch to CVM.
What’s the problem here?
By the way, if I set the CC mode to on, the error happens at another place. Its output is shown at Is "CC mode = on" available? - #3 by Yifan-Tan.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.