Currently in the Early Access, we do not provide multi-GPU CC support. We will provide the appropriate code when we release the version with multi-GPU support.
You may, however, utilize the tool to configure multiple GPUs assigned to a single CVM. They will operate independently as expected, just without leveraging NVLinks that may connect them (for example, in a Deep Learning Inference server)
The host is a multi-GPU platform. The requirement is to seperate one GPU for the guest CVM to use confidential computing. Is this usage supported in Early Access?
I tried utilizing the tool on one of the 8 GPUs, setting the CC mode to devtools:
> sudo python3 gpu_cc_tool.py --gpu=0 --set-cc-mode devtools --reset-after-cc-mode-switch
2023-09-21,00:21:48.000 INFO Selected GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000
2023-09-21,00:21:48.193 INFO GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 CC mode set to devtools. It will be active after GPU reset.
2023-09-21,00:21:49.864 INFO GPU 0000:18:00.0 H100-PCIE 0x2324 BAR0 0xee042000000 was reset to apply the new CC mode.
I attach this GPU to a CVM. However, any CUDA application would make the kernel driver to crash. Crash location in driver code:
// kernel-open/nvidia-uvm/uvm_channel.c
static NV_STATUS channel_manager_create_conf_computing_pools(uvm_channel_manager_t *manager, unsigned *preferred_ce)
{
// Skip some code
status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_SEC2, 0, &sec2_pool); // This function returned with status NV_OK
// Skip some code
status = channel_pool_add(manager, UVM_CHANNEL_POOL_TYPE_WLC, wlc_lcic_ce_index, &wlc_pool); // Crashed here
}
Driver output:
[30838.220082] nvidia-uvm: uvm_channel.c:1532 uvm_channel_check_errors[pid:2503] Channel error likely caused by push 'GPFIFO submit to 'ID 1:3 (0x1:0x3) WLC 2' via 'UVM_CHANNEL_TYPE_SEC2'' started at uvm_channel.c:1247 in submit_ctrl_gpfifo_indirect()
[30838.222513] nvidia-uvm: uvm_channel.c:1547 uvm_channel_check_errors[pid:2503] Assert failed, condition 0 not true: Fatal error: Generic RC error [NV_ERR_RC_ERROR]
I add printf in uvm_channel_get_status (called by uvm_channel_check_errors). The errorNotifier->status is -1 (0xffff).
I did not configure the NvSwitch device. Furthermore, I did not passthrough the NvSwitch to CVM.