CUDA P2P crash with threadripper

Dear all,

I’ve just installed 2 GTX 1080ti on Threadripper 1950x. However, if I run the P2P benchmarks provided by cuda’s sample (such as simpleP2P, p2pBandwidthLatencyTest), they crash.
The cause should be caused by the following function call:

cudaMemcpy(g1, g0, buf_size, cudaMemcpyDefault)

And g0 and g1 are defined as:
float *g0;
checkCudaErrors(cudaMalloc(&g0, buf_size));
float *g1;
checkCudaErrors(cudaMalloc(&g1, buf_size));

I’ve also enabled AMD-vi and IOMMU, but it still does not work. Does this mean that cuda’s UVA can only work on Intel platform?

Looking forward to your help.

Sorry, I made a mistake, the codes that caused the problem is:

 printf("Run kernel on GPU%d, taking source data from GPU%d and writing to GPU%d...\n",
gpuid[0], gpuid[1], gpuid[0]);
checkCudaErrors(cudaSetDevice(gpuid[0]));
SimpleKernel<<<blocks, threads>>>(g1, g0);
checkCudaErrors(cudaDeviceSynchronize());

Similar problem here. We replaced a few of our old Intel Xeon nodes with Threadripper 1920 systems with two TitanX GPUs. P2P transfers in MXnet fail. Syslog showing thousands of entries like this:

Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000.... flags=0x0030]

GPUs are permanently at 100%. If I run separate task on each GPU (no P2P) there are no issues.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 32%   71C    P2   104W / 250W |   1157MiB / 12207MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:41:00.0  On |                  N/A |
| 37%   77C    P2   110W / 250W |   1171MiB / 12202MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

I’ve solved the problem and posted it in another thread with the same title. The sollution is to diable IOMMU in bios settings. Nvidia has its own memory managing mechanism.

Thank you, great advice! I can confirm that this solves the problem for a ASUS X399-A mainboards.

The respective setting can be found under:

Advanced => AMD CBS => NBIO Common Options => IOMMU Configuration

It’s my pleasure. Glad to hear that you’ve solved it.