DX11 <> CUDA interop is slow compared to GL <> CUDA

There is so little info on this topic that most of my GL/DX<>CUDA interop experiments feel like criminal investigations. :)

Furthermore, the DX11 CUDA 7.5 Sample doesn’t even compile anymore on Win10. The rest of the samples are ancient DX9/10 code.

My latest experiment is to create a minimalistic DX11<>CUDA interop example to match my GL<>CUDA interop implementation.

I was also looking for a way to write to the back buffer (as a CUDA surface) since the explicit DX11 swap chain seems like an ideal interface for spraying pixels.

The good news is that interacting with the DXGI swap chain is really really easy. Going full screen reveals just how fast you can flip the swap chain of back buffers. It’s a semi-meaningless result but 10,000+ FPS shows that a standard Window message loop is probably never going to be your bottleneck. You can easily do this in one page of code.

The bad news is that the CUDA interop routines don’t seem to be able to write to the back buffer.

Here is what the Runtime API manual states:

OK, that’s not encouraging.

However, if you abuse the CUDA interop API and register>map>unmap>unregister every frame it looks like you can write directly to the back buffer but with horrible performance (140 FPS full screen) because you’re burning up almost 10 milliseconds.

Here is some nvprof output that shows you just how expensive the cudaGraphicsXXX functions are and why you should never do what I just did:

==6092== NVPROF is profiling process 6092, command: dx
==6092== Profiling application: dx
==6092== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  1.06770s       663  1.6104ms  866.23us  1.8698ms  pxl_kernel          <-- surf2DWrite() 4K pixels

==6092== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 65.72%  4.24575s       663  6.4039ms  2.6965ms  18.527ms  cudaGraphicsUnregisterResource    <-- SHOULD 
 31.98%  2.06573s       663  3.1157ms  1.9470ms  97.755ms  cudaGraphicsD3D11RegisterResource <-- NEVER BE
  1.16%  74.672ms       663  112.63us  58.027us  238.37us  cudaGraphicsMapResources          <-- CALLED
  0.76%  49.191ms       663  74.194us  49.493us  180.91us  cudaGraphicsUnmapResources        <-- IN A LOOP
  0.21%  13.851ms       663  20.891us  13.369us  49.493us  cudaLaunch
  0.10%  6.7152ms       663  10.128us  5.4040us  23.040us  cudaCreateSurfaceObject
  0.02%  1.3978ms       663  2.1080us  1.1380us  14.506us  cudaGraphicsSubResourceGetMappedArray

I would really be interested in hearing from an NVIDIA engineer on whether writing directly to the back buffer is possible since reportedly DX11 Compute Shaders have no such limitation.

Update:

Creating a dedicated CUDA interop surface and copying it to the back buffer is the only practical approach you should consider since resource-to-resource copies are going to be fast.

But beware of DX11 interop… the cudaGraphicsMap/UnmapResources() routines appear to be dog slow!

Even worse, the DX avg. “map” time jumps when the interop app switches to fullscreen mode.

Running the same kernel on DX11 and GL simple interop skeletons reveals quite a difference:

DX11 → ID3D11Texture2D → CUDA Surface:

==10480== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 43.76%  439.40ms      1069  411.04us  50.062us  823.47us  cudaGraphicsUnmapResources
 40.44%  406.03ms      1069  379.82us  52.338us  577.99us  cudaGraphicsMapResources

OpenGL → GL_RENDERBUFFER → CUDA Surface

==7464== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 25.31%  145.64ms      8232  17.692us  9.9550us  77.369us  cudaGraphicsMapResources
 16.59%  95.480ms      8232  11.598us  10.240us  71.111us  cudaGraphicsUnmapResources

[ Configuration: Latest drivers, CUDA 7.5, GTX 750 Ti. ]

I’ll be sticking with OpenGL for now.

Old thread, but it’s stuck in my mind; I appreciated you posting your findings here.

Just replaced the DX11 rendering system in my engine with openGL - it appears to be performing a lot better now, I’ll have to run some benchmarks just to make sure, but I believe it’s fairly noticeable the performance increase.

Slightly disappointed with the Cuda 8 RC samples though, no DX12 or Vulkan interop. samples. I tried implementing a DX12 version some time ago, not realising that Cuda didn’t support it, which was probably a saving grace, the amount of initalisation code and leg work you have to do, the implementation just got really dull and boring.

I know this thread is really old, but maybe so are the NvDecoder examples. :-(

I really need some recent information on what is the difference in speed and stability (and debuggability) of the opengl interop vs the Dx11 one?

I want to use nvdecode to a single texture that is sampled and analyzed via cuda with runtime api and then render the texture with added graphics on top of it.

I have merged the NvDecoder sample with my cuda code and it got hung up on context and when I made it all work on one context it keeps locking up the card.

This is not for a game, the display is mostly to use during development but will continue on as a monitor.