I’m reaching out because I’m having trouble understanding if I should or should not have P2P capabilities in my system.
I’m coming at this from a pytorch bug where you can find more detailled discussion there:
With nvidia-driver 535.* and 550.* I don’t have any P2P capabilities
With nvidia-driver 545.*, I do have P2P capabilities and the nvidia P2P script is indeed improving the bandwidth
NCCL tests from GitHub - NVIDIA/nccl-tests: NCCL Tests are hanging with driver 545.* and are working when I disable P2P for NCCL with NCCL_P2P_DISABLE=1
There isn’t any way to know if you should have P2P capability, except for the tests provided by NVIDIA in the form of cudaDeviceCanAccessPeer().
And of course bugs are always possible. I don’t know what GPUs you have although I can see they are cc8.9 GPUs with probably 16GB. The lowest level cc8.9 enterprise GPU I am familiar with is the L4, which has 24GB of memory, so I imagine these are GeForce GPUs (RTX 40-series, Ada generation, RTX 4080 perhaps). In recent years NVIDIA does not support P2P on most GeForce GPUs that I am familiar with unless a NVLink bridge is installed. And not all GeForce GPUs support NVLink bridges.
Hi, thanks for taking the time to answer.
I’ve been wondering this past week if the problem was coming from the mobo/pcie config or the drivers.
Indeed, the setup is:
2 GPU geforce RTX 4070 TI SUPER 16Go
Motherboard MSI X670e Carbon Wifi
CPU AMD ryzen 7900x
So based on your input, it seems that there is a bug in driver 545.* which makes the GPUs believe they can do PHB P2P when they actually can’t which explains why everything is failing when I have this driver installed.
Should I report this bug somewhere?
Also, do you agree that the solution is then to update my drivers to 550.* to fix this issue if I don’t want to wait for a new patch on the 545.* version?
Thanks for filing a bug ticket . This is a known issue to us . Your observation is correct that the P2P report was once broken on all R545 series . R535 , R550 and go on for later releases , it is fixed .
We feel sorry for the breakage . But we have no more plan to backport fixes to R545 , please move forward with R550 drivers .