Multi-GPU Peer to Peer access failing on Tesla K80

I apologize for dropping the ball here. Not sure why I didn’t see your last 2 updates.

If you are still pursuing this, I would like to first point out that the only permanent solution/fix would be to get an updated BIOS from the system vendor that fixes this issue. Supermicro is certainly aware of the underlying issue here as they have applied the necessary fix via BIOS update to some of their other GPU-enabled products.

Anyway, to proceed with the process, you would need to disable ACS on the motherboard, and re-test the simpleP2P test (without rebooting). The steps to disable ACS would be:

setpci -s 83:08.0 f2a.w=0000
setpci -s 83:10.0 f2a.w=0000

Note that this will probably require root privilege. You can then re-verify the changed settings by running the previous lspci commands:

sudo lspci -s 83:08.0 -vvvv | grep -i acs
sudo lspci -s 83:10.0 -vvvv | grep -i acs

at which point for each you should see reported the ACSCtl line with all negative settings:

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

At this point you should re-run the simpleP2P test to see if the verification errors are still reported.
This change is not permanent: a reboot will restore the machine to the previous state.

If this fixes the issue, then you can leave it as-is if you wish, or else provide this information to your system vendor. They can advise you on the availability of a SBIOS with such a fix.

If it does not fix the issue, I am out of ideas, and it is probably best to refer back to your system vendor.

Regarding your additional question, I suspect that if you used two Tesla K40c GPUs, for example, you would not see this issue, and P2P transfers would “just work” but I am just guessing on that. You would have to try it to be sure.

A related supermicro faq entry is here:

http://www.supermicro.com/support/faqs/faq.cfm?faq=20732