P2P access not enabled, is this a software or a hardware issue?

Hello Forum,

I’m just starting to get my feet wet with Multi-GPU. I’m running GNU/Linux x86_64 (Ubuntu 14.04 LTS)

I have installed the latest version of the CUDA Toolkit (CUDA 7.5)

I attempted to run the simpleP2P from the CUDA samples and I’m getting the following:

/usr/local/cuda-7.5/samples/bin/x86_64/linux/release$ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "     Tesla K20c" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "    Tesla C2070" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K20c (GPU0) -> Tesla C2070 (GPU1) : No
> Peer access from Tesla C2070 (GPU1) -> Tesla K20c (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

The GPUs are both SM 2.0 or higher so I guess that’s not the problem. This is a partial output from the deviceQuery code:

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20c"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
...
Device 1: "Tesla C2070"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    2.0

I’ve also run the multi GPU sample and it fails, which I assume is due to P2P. Is my assumption correct?

/usr/local/cuda-7.5/samples/bin/x86_64/linux/release$ ./simpleMultiGPU 
Starting simpleMultiGPU
CUDA-capable device count: 2
Generating input data...

CUDA error at simpleMultiGPU.cu:121 code=2(cudaErrorMemoryAllocation) "cudaStreamCreate(&plan[i].stream)"

Another developer experiencing a similar problem also posted the following diagnostics:

  • Checking the GPU cards share the same PCI-E root.
$ lspci | grep NVIDIA
03:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Tesla C2050 / C2070] (rev a3)
03:00.1 Audio device: NVIDIA Corporation GF100 High Definition Audio Controller (rev a1)
04:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)

Relevant output from lspci -t

\-[0000:00]-+-00.0
             +-01.0-[01-02]----00.0-[02]--
             +-03.0-[03]--+-00.0
             |            \-00.1
             +-07.0-[04]----00.0

In addition the information given by the nvidia-smi tool about topology.

$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	PHB	
GPU1	PHB	 X 	0-5

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Peer to Peer access generally requires the GPUs to be of the same architectural generation. So a cc2.0 and a cc3.5 GPU, while each individually capable of P2P, cannot participate in P2P transactions with each other. The architectural issues here are not fully specified in the documentation, so your only recourse is to use the result of cudaDeviceCanAccessPeer() to assess the viability of this feature, as indicated in the documentation:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access[/url]

(As an aside, posting “diagnostics” run by another developer, if they were run on a different system, may have no bearing on the issue you are reporting. The topology investigation needs to be performed on the system in question for it to have any relevance.)

The failure of simpleMultiGPU is a separate issue; it does not depend on P2P access and I can successfully run that sample on a CUDA 7.5 system with a cc2.0 and a cc3.5 GPU. It’s a rather curious error so I’m at a loss to speculate what may be causing it. You might try rebooting the system and see if that error persists, and also run other codes like vectorAdd to confirm proper operation of your CUDA install.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

Hello txbob,

Thank you for clarifying the requirement of matching GPUs architectures. (I also posted in SF about another system where I use the same GPU’s http://stackoverflow.com/questions/33563326/p2p-memory-access-fail-while-running-multi-gpu-cuda-sample-simplep2p)

Yes, the topology is from my own system, I meant that someone wrote here before about his/her problems running the simpleP2P sample, and he/she himself used those “diagnostics”.

Since you mention that you would like to see how vectorAdd behaves in order to confirm the proper operation of my CUDA install, I’m copying the output I got from that sample using first the cc_2.0 device and after the cc_3.5.

First I’m showing the GPU id’s, as reported from nvidia-smi:

$ nvidia-smi
Fri Nov  6 12:05:54 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla C2070         Off  | 0000:03:00.0      On |                  144 |
| 44%   88C    P8    N/A /  N/A |    465MiB /  5367MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   35C    P8    14W / 225W |     13MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

And here are the vectorAdd runs for both of them:

$ CUDA_VISIBLE_DEVICES=0 ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

$ CUDA_VISIBLE_DEVICES=1 ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!

With the last one the program complains about failing to allocate memory. To perform this runs, I just rebooted the system. Previously, I didn’t experience any problems running my own CUDA kernels in any of this GPU’s.

Many thanks!!

If vectorAdd won’t run properly that also suggests that the multiGPU test is not likely to work correctly.

In fact I would guess that the CUDA_VISIBLE_DEVICES=1 is selecting the C2070, not the K20c

Furthermore, the C2070 is indicating 144 uncorrectable ECC errors in your nvidia-smi output.

My guess is that there is something wrong with the C2070 device.

Actually that posting is different in at least 2 ways:

  1. You are not using the same GPUs (this thread involves a K20c and a C2070, the SO posting involves two K20c devices).

  2. The system topologies are different.

The SO system is restricted by the system topology (the socket-level link prevents P2P). The system in this thread is not restricted by topology (PHB does not necessarily indicate that P2P is not possible.)

Hello txbob,

You’re right. The CUDA_VISIBLE_DEVICES=1 is picking up the C2070. I added this lines to the vectorAdd sample in order to confirm.

// Verify which device is running
    cudaDeviceProp props;
    int device;
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&props, device);
    printf("GPU Device %s : with compute capability %d.%d \n",
                        props.name, props.major, props.minor);

And after rerunning the sample using the CUDA_VISIBLE_DEVICES flag I got.

$ CUDA_VISIBLE_DEVICES=0 ./vectorAdd 
GPU Device Tesla K20c : with compute capability 3.5 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

$ CUDA_VISIBLE_DEVICES=1 ./vectorAdd 
GPU Device Tesla C2070 : with compute capability 2.0 
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!

Is there any software tool I can use to further proof the device?

Thanks!

You can study the usage of nvidia-smi, which gives some options for inspection and modification of the device.

The GPU Deployment kit also has a healthmon utility which is more involved to use.

I think it’s very likely that any Tesla C2070 at this point would be out of warranty. If a system reboot does not restore any functionality, you might want to discard it like you would a disk drive that had gone bad.

This GPU had been working fine till only recently that I updated to the latest CUDA Toolkit. Rebooting was not helping the problem went away. At the end I found in stackoverflow this report http://stackoverflow.com/questions/12295768/disabled-ecc-support-for-tesla-c2070-and-ubuntu-12-04 from another developer with a dual GPU configuration who was experiencing ECC errors with a Tesla C2070.

The way he solved it was switching the primary display setting in the BIOS configuration. (changing the primary to be the motherboard’s on-board VGA) I don’t have that option in my BIOS, instead I just changed the order in which VGA controllers are looked upon and that solved the problem.

Now both the vectorAdd and the simpleMultiGPU code samples are working.

$ ./simpleMultiGPU 
Starting simpleMultiGPU
CUDA-capable device count: 2
Generating input data...

Computing with 2 GPUs...
  GPU Processing time: 14.985000 (ms)

Computing with Host CPU...

Comparing GPU and Host CPU results...
  GPU sum: 16777280.000000
  CPU sum: 16777294.395033
  Relative difference: 8.580068E-07

Also I stop finding ECC errors when using nvidia-smi

$ nvidia-smi 
Tue Nov 10 12:54:54 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla C2070         Off  | 0000:03:00.0      On |                    0 |
| 45%   88C    P8    N/A /  N/A |    305MiB /  5367MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   35C    P8    14W / 225W |     13MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Big thanks to TXBOB for his help and suggestions throughout this long debugging.