There is so little info on this topic that most of my GL/DX<>CUDA interop experiments feel like criminal investigations. :)
Furthermore, the DX11 CUDA 7.5 Sample doesn’t even compile anymore on Win10. The rest of the samples are ancient DX9/10 code.
My latest experiment is to create a minimalistic DX11<>CUDA interop example to match my GL<>CUDA interop implementation.
I was also looking for a way to write to the back buffer (as a CUDA surface) since the explicit DX11 swap chain seems like an ideal interface for spraying pixels.
The good news is that interacting with the DXGI swap chain is really really easy. Going full screen reveals just how fast you can flip the swap chain of back buffers. It’s a semi-meaningless result but 10,000+ FPS shows that a standard Window message loop is probably never going to be your bottleneck. You can easily do this in one page of code.
The bad news is that the CUDA interop routines don’t seem to be able to write to the back buffer.
Here is what the Runtime API manual states:
OK, that’s not encouraging.
However, if you abuse the CUDA interop API and register>map>unmap>unregister every frame it looks like you can write directly to the back buffer but with horrible performance (140 FPS full screen) because you’re burning up almost 10 milliseconds.
Here is some nvprof output that shows you just how expensive the cudaGraphicsXXX functions are and why you should never do what I just did:
==6092== NVPROF is profiling process 6092, command: dx
==6092== Profiling application: dx
==6092== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 1.06770s 663 1.6104ms 866.23us 1.8698ms pxl_kernel <-- surf2DWrite() 4K pixels
==6092== API calls:
Time(%) Time Calls Avg Min Max Name
65.72% 4.24575s 663 6.4039ms 2.6965ms 18.527ms cudaGraphicsUnregisterResource <-- SHOULD
31.98% 2.06573s 663 3.1157ms 1.9470ms 97.755ms cudaGraphicsD3D11RegisterResource <-- NEVER BE
1.16% 74.672ms 663 112.63us 58.027us 238.37us cudaGraphicsMapResources <-- CALLED
0.76% 49.191ms 663 74.194us 49.493us 180.91us cudaGraphicsUnmapResources <-- IN A LOOP
0.21% 13.851ms 663 20.891us 13.369us 49.493us cudaLaunch
0.10% 6.7152ms 663 10.128us 5.4040us 23.040us cudaCreateSurfaceObject
0.02% 1.3978ms 663 2.1080us 1.1380us 14.506us cudaGraphicsSubResourceGetMappedArray
I would really be interested in hearing from an NVIDIA engineer on whether writing directly to the back buffer is possible since reportedly DX11 Compute Shaders have no such limitation.