Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?

makoe · October 13, 2021, 1:03pm

cudaMemcpyDeviceToDevice between two WDDM devices works fine.
cudaMemcpyDeviceToDevice between two TCC devices works fine, too.

But copying between WDDM and TCC corrupts the data and the copied image is partially broken.

njuffa · October 13, 2021, 11:01pm

That’s an interesting question. Do you have a simple reproducer program that you can post here so others can try?

I wonder whether the issue is operating system dependent, driver version dependent, transfer size dependent, or hardware dependent. I can think of reasons why this should work and reasons why this might run into issues. I assume you have carefully checked the documentation and it doesn’t address this scenario one way or the other, i.e. this is a gray area.

My (possibly long outdated) knowledge is that cudaMemcpyDeviceToDevice is only supported for copies on the same device, not across devices. Your case of a TCC-controlled device and a second WDDM-controlled device would be a special sub-case of the latter scenario.

Robert_Crovella · October 14, 2021, 2:55am

Are you doing careful, rigorous CUDA error checking for your test case?

makoe · October 14, 2021, 8:19am

Yes, every function call is tested against errors and there are none.

makoe · October 14, 2021, 8:33am

cudaMemcpyDeviceToDevice works fine from one device to another device if they use the same driver type. Unfortunately, I haven’t found any information in the documentation whether this should be possible or not.

I don’t have a simple test program yet but the following pseudo code should do it:

cudaSetDevice(0);
void* ptr0 = cudaMallocPitch(100MB);
cudaMemset2D(ptr0,…);

cudaSetDevice(1);
void* ptr1 = cudaMallocPitch(100MB);

cudaSetDevice(0);
cudaMemcpy2D(ptr1, ptr0, …);

Then compare the content of ptr0 with the content of ptr1.
Maybe I’ll write a simple test program to reproduce this issue soon.

njuffa · October 14, 2021, 9:56am

The point of a simple reproducer is that you can confirm that it fails for you and I can then try it on my two-GPU system here to see whether it behaves the same on my system. If you decide to file a bug report with NVIDIA you would also need a reproducer.

makoe · October 14, 2021, 12:08pm

Attached is a simple test program.
Used in VS2019 with CUDA 11.4 Update 2 in Debug/Release in x64 and all NVIDIA drivers are up to date.

CudaWddmTccTransfer.zip (3.2 KB)

Robert_Crovella · October 14, 2021, 6:58pm

I’m not able to see the issue. I tried using CUDA 11.4.2 in a Debug project, and got this:

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Thu Oct 14 11:55:52 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.41       Driver Version: 471.41       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000       WDDM  | 00000000:04:00.0 Off |                  N/A |
| 46%   26C    P8     5W / 105W |    138MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        TCC  | 00000000:06:00.0 Off |                  N/A |
| 46%   24C    P8     4W / 105W |      8MiB /  8120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4124    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      5060    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      6372    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      7028    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A      7076    C+G   ...ty\Common7\IDE\devenv.exe    N/A      |
|    0   N/A  N/A      9168    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC  Device: Quadro P4000
WDDM Device: Quadro P4000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>

njuffa · October 14, 2021, 7:20pm

The program executes without errors for me using CUDA 11.1 (Windows 10 Pro for Workstations, single Intel Xeon W-2133 CPU):

C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 12:18:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 462.31       Driver Version: 462.31       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        TCC  | 00000000:17:00.0 Off |                  N/A |
| 48%   39C    P8     7W /  75W |      8MiB /  5053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 4000    WDDM  | 00000000:65:00.0  On |                  N/A |
| 30%   37C    P8     4W / 125W |    936MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC  Device: Quadro P2000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

njuffa · October 15, 2021, 12:43am

I am able to reproduce the issue after updating to the latest available driver:

C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 17:40:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12       Driver Version: 472.12       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        TCC  | 00000000:17:00.0 Off |                  N/A |
| 50%   42C    P8     7W /  75W |      8MiB /  5053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 4000    WDDM  | 00000000:65:00.0  On |                  N/A |
| 30%   41C    P8     3W / 125W |    267MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC  Device: Quadro P2000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are unequal! First error at: (0,0); WDDM: 32; TCC: -100

So this would appear to be a recent regression in the driver.

Robert_Crovella · October 15, 2021, 2:08pm

I’m not able to witness the problem with 472.12 either.

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Fri Oct 15 07:01:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12       Driver Version: 472.12       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000       WDDM  | 00000000:04:00.0 Off |                  N/A |
| 46%   25C    P8     5W / 105W |    127MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        TCC  | 00000000:06:00.0 Off |                  N/A |
| 46%   23C    P8     5W / 105W |      8MiB /  8120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4124    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      6372    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      7028    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A      9168    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC  Device: Quadro P4000
WDDM Device: Quadro P4000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>

makoe · October 15, 2021, 2:52pm

Maybe it’s the exact same GPUs that aren’t causing a problem?

Robert_Crovella · October 15, 2021, 3:39pm

It could be. You haven’t indicated which GPUs you are using anywhere that I can see.

My suggestion would be to file a bug.

njuffa · October 15, 2021, 7:23pm

I second the suggestion to file a bug.

From what I can tell, the problem seems to be dependent on transfer size. In my limited experiments, if the memory block copied is smaller than 2²¹ bytes = 2 MB in size, the copy success.

Hypothesis: This might be the size of an internally allocated DMA transfer buffer, with the copy failing when multiple transfers through that buffer are necessary.

Yuki_Ni · February 11, 2022, 10:09am

I’ll help post the internal Dev team conclusion of the ticket here to more users .
The issue will be fixed in a future CUDA release (Driver fix) .
The current WAR is to turn on hardware scheduling may help if you are on it’s supported win OS . See https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/

Topic		Replies	Views
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3993	January 28, 2022
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5065	November 27, 2018
deviceQuery passes and then fails CUDA Setup and Installation	4	2146	July 6, 2016
Possible bug in cudaMemcpyDeviceToDevice? CUDA Programming and Performance	3	1443	August 6, 2009
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2475	May 10, 2016
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1906	January 12, 2019
P2P broken on Cuda5.0 driver CUDA Programming and Performance	5	3608	July 2, 2012
Nvidia Quadro RTX 5000 support for TCC? CUDA Programming and Performance	12	3878	July 22, 2023
FAO: Nvidia Engineers:- Memory Leak in cudaMemcpyAsync Only occurs on Host To Device memory transfer CUDA Programming and Performance	4	5872	August 18, 2010
Unable to run several CUDA samples. CUDA Programming and Performance	2	821	April 1, 2019

Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?

Related topics