Direct GPU <-> GPU communication does not seem to work properly

Hello there!

I’m reaching out because I’m having trouble understanding if I should or should not have P2P capabilities in my system.
I’m coming at this from a pytorch bug where you can find more detailled discussion there:

Summary so far:

Running device_to_device_memcpy_read_ce.
 Invalid value when checking the pattern at <0x7fefec000000>
 Current offset [ 0/67108864]

So the data is not copied properly but driver 545.* still thinks it can do it.

Do you know how to theoretically know if I should have P2P capabilities or not? Did I miss anything obvious to test?

There isn’t any way to know if you should have P2P capability, except for the tests provided by NVIDIA in the form of cudaDeviceCanAccessPeer().

And of course bugs are always possible. I don’t know what GPUs you have although I can see they are cc8.9 GPUs with probably 16GB. The lowest level cc8.9 enterprise GPU I am familiar with is the L4, which has 24GB of memory, so I imagine these are GeForce GPUs (RTX 40-series, Ada generation, RTX 4080 perhaps). In recent years NVIDIA does not support P2P on most GeForce GPUs that I am familiar with unless a NVLink bridge is installed. And not all GeForce GPUs support NVLink bridges.

From the OP’s first link:
_CudaDeviceProperties(name='NVIDIA Graphics Device', major=8, minor=9, total_memory=15868MB, multi_processor_count=66)

So possibly RTX 4070 Ti Super.

Yep, agreed, based on the SM count.

Hi, thanks for taking the time to answer.
I’ve been wondering this past week if the problem was coming from the mobo/pcie config or the drivers.

Indeed, the setup is:

  • 2 GPU geforce RTX 4070 TI SUPER 16Go
  • Motherboard MSI X670e Carbon Wifi
  • CPU AMD ryzen 7900x

So based on your input, it seems that there is a bug in driver 545.* which makes the GPUs believe they can do PHB P2P when they actually can’t which explains why everything is failing when I have this driver installed.
Should I report this bug somewhere?

Also, do you agree that the solution is then to update my drivers to 550.* to fix this issue if I don’t want to wait for a new patch on the 545.* version?

I suspect it is probably already fixed if the R550 drivers do not show the issue. It’s possible that a later R545 driver also has the fix.

Anyone can file a bug if they wish at any time. The instructions are linked to a sticky post at the top of this sub-forum.

Yes, if that were my system, I would use the R550 driver.

Perfect. Thanks for your help!

[Public] Hi Morgan ,

Thanks for filing a bug ticket . This is a known issue to us . Your observation is correct that the P2P report was once broken on all R545 series . R535 , R550 and go on for later releases , it is fixed .
We feel sorry for the breakage . But we have no more plan to backport fixes to R545 , please move forward with R550 drivers .

Hope this helps.

Best,
Yuki

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.