P2P Transfer not working on 2 Tesla C2070 with PEX8647 switch

I am using a system with 2 Tesla C2070 under the same IOH. There are PEX8647 switches between the IOH and the GPUs. The diagram of the system is attached.

Strangely the simpleP2P SDK example does not work. It fails with an error message: simpleP2P.cu(154) : cudaSafeCall() Runtime API error : invalid argument.

Checking the source code, the error comes from this line: cutilSafeCall(cudaDeviceEnablePeerAccess(gpuid_tesla[1], gpuid_tesla[0]));

Does the failure have anything to do with the PEX8647 switch? Should P2P work in this configuration?

The CUDA driver is version 4rc2.

Thanks.
moz-screenshot-4.png

What is the output of “/sbin/lspci -tv |grep -y nvidia”?

| | -08.0-[0000:87]–±00.0 nVidia Corporation Unknown device 06d1

| | -00.1 nVidia Corporation Unknown device 0be5

| | -08.0-[0000:83]–±00.0 nVidia Corporation Unknown device 06d1

| | -00.1 nVidia Corporation Unknown device 0be5

         |                                         \-08.0-[0000:0c]--+-00.0  nVidia Corporation Unknown device 06d1

         |                                                           \-00.1  nVidia Corporation Unknown device 0be5

Actually there is another C2070 connected to the other IOH. But the P2P test was run on the 2 C2070 under the same IOH. I am not sure if these 2 C2070 are connected to the same PEX8647 switch.

It may be possible that you are selecting two GPUs under two different IOHs.
The order in the driver and the order in the lspci may be different.
nvidia-smi -q should give you the info of the PCI slot for each card.

Could you try to play with different combinations of devices to see if you find a working one?
Before starting the test, set the variable CUDA_VISIBLE_DEVICES.

If you have 4 cards, these should cover all the possible combination.
export CUDA_VISIBLE_DEVICES=0,1
export CUDA_VISIBLE_DEVICES=0,2
export CUDA_VISIBLE_DEVICES=0,3
export CUDA_VISIBLE_DEVICES=1,2
export CUDA_VISIBLE_DEVICES=1,3
export CUDA_VISIBLE_DEVICES=2,3

One more question, what is the server model?

Using the following commands:

numactl --cpunodebind=0 --membind=0 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --device=0 --memory=pinned

numactl --cpunodebind=1 --membind=1 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --device=1 --memory=pinned

numactl --cpunodebind=1 --membind=1 ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest --device=2 --memory=pinned

I got the same H2D and D2H bandwidth on all 3 GPUs, so my conclusion was that GPU 1 and 2 are under the same IOH. Do you think this is incorrect?

I tried P2P example on GPU 0 and 1, but it was hanging and the process couldn’t be killed. Only a restart can fix that. The last output of the P2P example is “Creating event handles…”.

As I mentioned, P2P on GPU 1 and 2 failed with “invalid argument” while enabling peer access.

I am not sure what the server model is, but it is from Novatte.

We got it working.

Apparently the second parameter in this function call: cutilSafeCall(cudaDeviceEnablePeerAccess(gpuid_tesla[1], gpuid_tesla[0])) should be 0 and not gpuid_tesla[0]. The error was thrown because I set gpuid_tesla[0] to 1. It’s a bug in the SDK example.

Thanks!