makoe
October 13, 2021, 1:03pm
1
Is cudaMemcpyDeviceToDevice between a WDDM device and a TCC device possible?
cudaMemcpyDeviceToDevice between two WDDM devices works fine.
cudaMemcpyDeviceToDevice between two TCC devices works fine, too.
But copying between WDDM and TCC corrupts the data and the copied image is partially broken.
njuffa
October 13, 2021, 11:01pm
2
That’s an interesting question. Do you have a simple reproducer program that you can post here so others can try?
I wonder whether the issue is operating system dependent, driver version dependent, transfer size dependent, or hardware dependent. I can think of reasons why this should work and reasons why this might run into issues. I assume you have carefully checked the documentation and it doesn’t address this scenario one way or the other, i.e. this is a gray area.
My (possibly long outdated) knowledge is that cudaMemcpyDeviceToDevice
is only supported for copies on the same device, not across devices. Your case of a TCC-controlled device and a second WDDM-controlled device would be a special sub-case of the latter scenario.
Are you doing careful, rigorous CUDA error checking for your test case?
makoe
October 14, 2021, 8:19am
4
Yes, every function call is tested against errors and there are none.
makoe
October 14, 2021, 8:33am
5
cudaMemcpyDeviceToDevice
works fine from one device to another device if they use the same driver type. Unfortunately, I haven’t found any information in the documentation whether this should be possible or not.
I don’t have a simple test program yet but the following pseudo code should do it:
cudaSetDevice(0);
void* ptr0 = cudaMallocPitch(100MB);
cudaMemset2D(ptr0,…);
cudaSetDevice(1);
void* ptr1 = cudaMallocPitch(100MB);
cudaSetDevice(0);
cudaMemcpy2D(ptr1, ptr0, …);
Then compare the content of ptr0 with the content of ptr1.
Maybe I’ll write a simple test program to reproduce this issue soon.
njuffa
October 14, 2021, 9:56am
6
The point of a simple reproducer is that you can confirm that it fails for you and I can then try it on my two-GPU system here to see whether it behaves the same on my system. If you decide to file a bug report with NVIDIA you would also need a reproducer.
makoe
October 14, 2021, 12:08pm
7
Attached is a simple test program.
Used in VS2019 with CUDA 11.4 Update 2 in Debug/Release in x64 and all NVIDIA drivers are up to date.
CudaWddmTccTransfer.zip (3.2 KB)
I’m not able to see the issue. I tried using CUDA 11.4.2 in a Debug project, and got this:
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Thu Oct 14 11:55:52 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.41 Driver Version: 471.41 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P4000 WDDM | 00000000:04:00.0 Off | N/A |
| 46% 26C P8 5W / 105W | 138MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P4000 TCC | 00000000:06:00.0 Off | N/A |
| 46% 24C P8 4W / 105W | 8MiB / 8120MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4124 C+G Insufficient Permissions N/A |
| 0 N/A N/A 5060 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 6372 C+G Insufficient Permissions N/A |
| 0 N/A N/A 7028 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 7076 C+G ...ty\Common7\IDE\devenv.exe N/A |
| 0 N/A N/A 9168 C+G ...y\ShellExperienceHost.exe N/A |
+-----------------------------------------------------------------------------+
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC Device: Quadro P4000
WDDM Device: Quadro P4000
Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory
NO CUDA Error
Comparing WDDM with TCC
Memories are equal!!
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>
njuffa
October 14, 2021, 7:20pm
9
The program executes without errors for me using CUDA 11.1 (Windows 10 Pro for Workstations, single Intel Xeon W-2133 CPU):
C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 12:18:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 462.31 Driver Version: 462.31 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2000 TCC | 00000000:17:00.0 Off | N/A |
| 48% 39C P8 7W / 75W | 8MiB / 5053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 4000 WDDM | 00000000:65:00.0 On | N/A |
| 30% 37C P8 4W / 125W | 936MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC Device: Quadro P2000
Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory
NO CUDA Error
Comparing WDDM with TCC
Memories are equal!!
njuffa
October 15, 2021, 12:43am
10
I am able to reproduce the issue after updating to the latest available driver:
C:\Users\Norbert\My Programs>nvidia-smi
Thu Oct 14 17:40:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12 Driver Version: 472.12 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2000 TCC | 00000000:17:00.0 Off | N/A |
| 50% 42C P8 7W / 75W | 8MiB / 5053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 4000 WDDM | 00000000:65:00.0 On | N/A |
| 30% 41C P8 3W / 125W | 267MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
C:\Users\Norbert\My Programs>CudaWddmTccTransfer.exe
Find TCC and WDDM device
WDDM Device: Quadro RTX 4000
TCC Device: Quadro P2000
Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory
NO CUDA Error
Comparing WDDM with TCC
Memories are unequal! First error at: (0,0); WDDM: 32; TCC: -100
So this would appear to be a recent regression in the driver.
I’m not able to witness the problem with 472.12 either.
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>nvidia-smi
Fri Oct 15 07:01:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.12 Driver Version: 472.12 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P4000 WDDM | 00000000:04:00.0 Off | N/A |
| 46% 25C P8 5W / 105W | 127MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P4000 TCC | 00000000:06:00.0 Off | N/A |
| 46% 23C P8 5W / 105W | 8MiB / 8120MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4124 C+G Insufficient Permissions N/A |
| 0 N/A N/A 6372 C+G Insufficient Permissions N/A |
| 0 N/A N/A 7028 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 9168 C+G ...y\ShellExperienceHost.exe N/A |
+-----------------------------------------------------------------------------+
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>test1_11.4
Find TCC and WDDM device
TCC Device: Quadro P4000
WDDM Device: Quadro P4000
Allocate and initialize host memory
Allocate and initialize TCC memory
Allocate WDDM memory
Copy from TCC to WDDM
Download WDDM memory
NO CUDA Error
Comparing WDDM with TCC
Memories are equal!!
C:\Users\Administrator\source\repos\test1_11.4\x64\Debug>
makoe
October 15, 2021, 2:52pm
12
Maybe it’s the exact same GPUs that aren’t causing a problem?
It could be. You haven’t indicated which GPUs you are using anywhere that I can see.
My suggestion would be to file a bug .
njuffa
October 15, 2021, 7:23pm
14
I second the suggestion to file a bug.
From what I can tell, the problem seems to be dependent on transfer size. In my limited experiments, if the memory block copied is smaller than 221 bytes = 2 MB in size, the copy success.
Hypothesis: This might be the size of an internally allocated DMA transfer buffer, with the copy failing when multiple transfers through that buffer are necessary.
Yuki_Ni
February 11, 2022, 10:09am
15
I’ll help post the internal Dev team conclusion of the ticket here to more users .
The issue will be fixed in a future CUDA release (Driver fix) .
The current WAR is to turn on hardware scheduling may help if you are on it’s supported win OS . See https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/