CUDA P2P crash with threadripper

Dear all,

I’ve just installed 2 GTX 1080ti on Threadripper 1950x. However, if I run the P2P benchmarks provided by cuda’s sample (such as simpleP2P, p2pBandwidthLatencyTest), they crash.
The cause should be caused by the following function call:

cudaMemcpy(g1, g0, buf_size, cudaMemcpyDefault)

And g0 and g1 are defined as:
float *g0;
checkCudaErrors(cudaMalloc(&g0, buf_size));
float *g1;
checkCudaErrors(cudaMalloc(&g1, buf_size));

I’ve also enabled AMD-vi and IOMMU, but it still does not work. Does this mean that cuda’s UVA can only work on Intel platform?

Looking forward to your help.

Sorry, I made a mistake, the codes that caused the problem is:

printf("Run kernel on GPU%d, taking source data from GPU%d and writing to GPU%d...\n",
gpuid[0], gpuid[1], gpuid[0]);
SimpleKernel<<<blocks, threads>>>(g1, g0);

not sure what you mean by “crash”
Do you mean an error is indicated? If so, what error? (just paste the actual output)

anyway, P2P is generally based on a “whitelist” mechanism. If your platform is not in the whitelist, the driver will not enable P2P support.

It’s entirely possible that the driver doesn’t recognize your motherboard.

Well, I solved this problem.

By saying crash I meant that the system totally did not respond. And sometimes the CPU would throw bugs like (NMI watchdog: Bug: soft lockup …).

However, this problem happened because the configuration of IOMMU on my motherboard was set to auto mode. Then I switched it to enable mode. In both these two modes, cuda’s P2P met problems. Finally the IOMMU was switched to disable mode, then the problem was solved.

Thanks for you attention.

Thank you! You are a life saver! I couldn’t figure out why my Titan RTXs kept crashing while running both CUDA workloads and things like basic 3D applications (games or Unreal Engine 4). Disabling IOMMU on my Threadripper 3970X solved the issue completely.

For some reason if I have IOMMU enabled, I get constant Nvidia driver crashes and sometimes system lockups while running any CUDA workloads or while working inside of Unreal Engine 4 (or playing games). Not sure what the issue is but Nvidia might want to look into it. I’m going to submit a bug report.
I also noticed strange behavior with IOMMU enabled like Code 43 error appearing on one or both GPUs seemingly at random after a cold boot. I am not currently running Windows inside of a VM, it’s running native. I would have to DDU the drivers for the Code 43 to disappear. Since disabling IOMMU, all issues have disappeared.