CUDA peer to peer example ./simpleP2P failing

Hello everyone,

I am having the following problem with my multi-GPU setup. The CUDA simpleP2P does not work:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = " Tesla K40c" IS capable of Peer-to-Peer (P2P)
> GPU1 = " Tesla K40c" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer-to-Peer (P2P) access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : No
> Peer-to-Peer (P2P) access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available between GPU0 <-> GPU1, waiving test.

I have checked that the two cards are on the same PCI-E root, and I think they are:

$ lspci | grep NVIDIA
81:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
82:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)

$ lspci -t
..
..
+-[0000:80]-+-02.0-[81]----00.0
|           +-03.0-[82]----00.0
|           +-05.0
|           +-05.1
|           +-05.2
|           \-05.4
...
...

Also, nvidia-smi is telling me that the GPUs communicate with PCI-E bridge:

$ nvidia-smi topo -m

            GPU0        GPU1       CPU Affinity
GPU0          X         PHB        8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31
GPU1        PHB           X        8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31

Legend:

    X = Self
    SOC = Path traverse a socket-level link (e.g QPI)
    PHB = Path traverses a PCIe host bridge
    PXB = Path traverses multiple PCIe internal switches
    PIX = Path traverses a PCIe internal switch

I am running the system with Fedora 20, with the official CUDA instalation. All other examples work. Any ideas why peer to peer memory access does not work ?

look at the smi output, and note the meaning of the phb legend

and look at the output of the program when it starts - it detects 2 separate gpus that allow p2p, but finds that the 2 gpus can not engage in p2p with respect to each other

you should be able to tell from the schematics of the motherboard whether the gpus are slotted on the same root

you can probably also look at the sample source, to determine how the pre-tests are conducted, to further know why the sample is aborting

Hi little_jimmy,

But there is something I dont get. If I am not mistaken, the output of ‘lspci -t’ says that indeed they share the same PCI-E root??

my point being: there is the human definition of a root complex, and the machine definition thereof
many cases i have found it useful to note the machine definition of x
evidently, from your sample output, cuda feels all is not well, such that it preemptively aborts the sample program
in this case, how does cuda determine if all is well? look at the source code

i shall see if i can determine from the source, how cuda determines if a and b are on the same root complex

cuda calls cudaDeviceCanAccessPeer() to determine whether a can access b, according to simpleP2P.cu

one really wishes to know what is inside cudaDeviceCanAccessPeer() to truly know how cuda, etc sees/ defines ‘on the same root complex’

you are not running a 32 bit application, are you? i see p2p is disabled for 32 bit

i almost want to say p2p hates sli too; i am not sure whether you truly have ‘multiple’ devices with sli, even though you have multiple devices

I am not using SLI, and the ./simpleP2P application should have compiled for 64-bit, I just hit ‘make’ on the NVIDIA samples.
About cudaDeviceCanAccessPeer(), yes that is the function that answers negative.

The computer is rackeable server 2U, two CPU sockets. I can handle up to 4 GPUs.
Each socket can handle 2 GPUs. They are put in pairs, two on each side.
Since I have 2 GPUs, I put both on the same side, so they are handled by the same CPU socket.
I could try putting one GPU on each side, but that would mean they will be handled by different CPU sockets, which I doubt it will help, do you think ?

probably not

i think the take-away at this point is that p2p can ‘fail’/ be prohibited, based on a number of factors/ conditions, including, but not limited to the devices being/ not being on the same root complex
thus, even if the devices are on the same root complex, p2p may still fail/ be prohibited, for other reasons

i still wish to know how the driver - or whoever responsible - determines whether 2 devices are on the same root complex; it must make some kernel call or something, i do not know
i do not know if it is possible to peek at cudaDeviceCanAccessPeer(), and i can not find the time to look into this right now; i shall try a bit later

Yes you are probably right. In the past I was using both GPUs on a gaming system with arch linux distro, and the simpleP2P got stuck in a very suspicious way. For that reason, when we bought this new server I thought it would be a nice opportunity to install Fedora 20 which is officialy supported by CUDA and be able to use P2P. But unfortunately the result was negative :(

Lets hope we can find the reason why it is failing. If I find anything useful I will post back
Thanks for the help little_jimmy,

cudaDeviceCanAccessPeer() returns a value/ error code

can you perhaps grab this?

use the debugger and step the sample to catch the value returned

if the sample as-is would not allow this, you may need to copy it, and add a few lines such that you can catch the value returned

cudaDeviceCanAccessPeer() seems to be merely a wrapper function, but i need more time
it would already be helpful just to know which functions it is calling

here are the disassembly, control flow and pseudo code of cudaDeviceCanAccessPeer()

unfortunately, it is less comprehensible than i hoped for

the function makes a number of lower level calls too

hence, i am not sure whether it only tests for ‘on the same root complex’, or how it tests for ‘on the same root complex’

a number of new points have come to my attention:

the hardware must support bus mastering - the mother board, etc

bus mastering must be enabled for the particular device; i note cudaDeviceEnablePeerAccess(), but i do not know whether it is necessary to ensure p2p is enabled for the device, prior to calling cudaDeviceCanAccessPeer(), and/ or whether the sample takes care of cudaDeviceEnablePeerAccess()

seemingly, the OS may block bus mastering too, if it views this as in its best interest; hence, you should probably check whether your OS allows/ has prohibited bus mastering
cudadevicecanaccesspeer.txt (5.44 KB)
cudaDeviceCanAccessPeer_pseudo.txt (1.88 KB)
cudadevicecanaccesspeer_control_flow.pdf (20.1 KB)

Hi little_jimmy.

Finally the problem has been solved. I moved the GPUs to PCI-E slots 1 and 2 and after a reboot it worked.

What I believe is that the motherboard plus the OS (fedora) is configured so that the PCI-E slots are filled in order. Is the only explanation I have for why it now works and before not.

Thank you a lot for the help and care you took in this matter! I hope this post will be helpful for other people who may have the same problem in the future.

i am glad you got that baby to bed

i note one can enable/ disable bus mastering in bios

also, lspci -vb seems to reveal whether the bus master flag is up/ down for a device, and also shows how the devices are linked on the pci bus, making it rather easy to note whether a ‘direct link’ (bus mastering capable link) exists/ can exist

are the pci slots of the motherboard the same? perhaps some are x16, x8 x4, etc
i do not know whether this too may have an impact; perhaps