Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?

Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?

cudaMemcpyDeviceToDevice between two WDDM devices works fine.
cudaMemcpyDeviceToDevice between two TCC devices works fine, too.

But copying between WDDM and TCC corrupts the data and the copied image is partially broken.

That’s an interesting question. Do you have a simple reproducer program that you can post here so others can try?

I wonder whether the issue is operating system dependent, driver version dependent, transfer size dependent, or hardware dependent. I can think of reasons why this should work and reasons why this might run into issues. I assume you have carefully checked the documentation and it doesn’t address this scenario one way or the other, i.e. this is a gray area.

My (possibly long outdated) knowledge is that cudaMemcpyDeviceToDevice is only supported for copies on the same device, not across devices. Your case of a TCC-controlled device and a second WDDM-controlled device would be a special sub-case of the latter scenario.

Are you doing careful, rigorous CUDA error checking for your test case?

Yes, every function call is tested against errors and there are none.

cudaMemcpyDeviceToDevice works fine from one device to another device if they use the same driver type. Unfortunately, I haven’t found any information in the documentation whether this should be possible or not.

I don’t have a simple test program yet but the following pseudo code should do it:

cudaSetDevice(0);
void* ptr0 = cudaMallocPitch(100MB);
cudaMemset2D(ptr0,…);

cudaSetDevice(1);
void* ptr1 = cudaMallocPitch(100MB);

cudaSetDevice(0);
cudaMemcpy2D(ptr1, ptr0, …);

Then compare the content of ptr0 with the content of ptr1.
Maybe I’ll write a simple test program to reproduce this issue soon.

The point of a simple reproducer is that you can confirm that it fails for you and I can then try it on my two-GPU system here to see whether it behaves the same on my system. If you decide to file a bug report with NVIDIA you would also need a reproducer.

Attached is a simple test program.
Used in VS2019 with CUDA 11.4 Update 2 in Debug/Release in x64 and all NVIDIA drivers are up to date.

CudaWddmTccTransfer.zip (3.2 KB)

I’m not able to see the issue. I tried using CUDA 11.4.2 in a Debug project, and got this:

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Thu Oct 14 11:55:52 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.41       Driver Version: 471.41       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000       WDDM  | 00000000:04:00.0 Off |                  N/A |
| 46%   26C    P8     5W / 105W |    138MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        TCC  | 00000000:06:00.0 Off |                  N/A |
| 46%   24C    P8     4W / 105W |      8MiB /  8120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4124    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      5060    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      6372    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      7028    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A      7076    C+G   ...ty\Common7\IDE\devenv.exe    N/A      |
|    0   N/A  N/A      9168    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC  Device: Quadro P4000
WDDM Device: Quadro P4000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>

The program executes without errors for me using CUDA 11.1 (Windows 10 Pro for Workstations, single Intel Xeon W-2133 CPU):

C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 12:18:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 462.31       Driver Version: 462.31       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        TCC  | 00000000:17:00.0 Off |                  N/A |
| 48%   39C    P8     7W /  75W |      8MiB /  5053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 4000    WDDM  | 00000000:65:00.0  On |                  N/A |
| 30%   37C    P8     4W / 125W |    936MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC  Device: Quadro P2000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

I am able to reproduce the issue after updating to the latest available driver:

C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 17:40:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12       Driver Version: 472.12       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        TCC  | 00000000:17:00.0 Off |                  N/A |
| 50%   42C    P8     7W /  75W |      8MiB /  5053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 4000    WDDM  | 00000000:65:00.0  On |                  N/A |
| 30%   41C    P8     3W / 125W |    267MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC  Device: Quadro P2000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are unequal! First error at: (0,0); WDDM: 32; TCC: -100

So this would appear to be a recent regression in the driver.

I’m not able to witness the problem with 472.12 either.

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Fri Oct 15 07:01:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12       Driver Version: 472.12       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000       WDDM  | 00000000:04:00.0 Off |                  N/A |
| 46%   25C    P8     5W / 105W |    127MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        TCC  | 00000000:06:00.0 Off |                  N/A |
| 46%   23C    P8     5W / 105W |      8MiB /  8120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4124    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      6372    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      7028    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A      9168    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC  Device: Quadro P4000
WDDM Device: Quadro P4000

Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory

        NO CUDA Error

Comparing WDDM with TCC
Memories are equal!!

C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>

Maybe it’s the exact same GPUs that aren’t causing a problem?

It could be. You haven’t indicated which GPUs you are using anywhere that I can see.

My suggestion would be to file a bug.

I second the suggestion to file a bug.

From what I can tell, the problem seems to be dependent on transfer size. In my limited experiments, if the memory block copied is smaller than 221 bytes = 2 MB in size, the copy success.

Hypothesis: This might be the size of an internally allocated DMA transfer buffer, with the copy failing when multiple transfers through that buffer are necessary.

I’ll help post the internal Dev team conclusion of the ticket here to more users .
The issue will be fixed in a future CUDA release (Driver fix) .
The current WAR is to turn on hardware scheduling may help if you are on it’s supported win OS . See https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/