We recently get a 8xH100 + 2x8468CPU, unfortunatly, one GPU cant be detected by the driver, so the topology is
We are carrying a test on bus bandwidth with nvlink sharp on this system, but we get a busBW around 375 even with NCCL_ALGO=NVLS.
We also test the code multi-gpu-programming-models/multi_node_p2p at master · NVIDIA/multi-gpu-programming-models · GitHub which leverages nvlink sharp to calculate the L2_norm, but the code gives a error
according to the cuda driver API CUDA Driver API :: CUDA Toolkit Documentation
it simply means the IMEX channel is not correct configured.
My question is
if one GPU is lost on a 8xh100 system, is it still possible to enable nvlink sharp?
If so, how to enable it? manually configure the driver by using nvidia-modprobe?