So, internally we’ve officially reverted to software rendering (Microsoft OpenGL to a bitmap), and from there we upload it to CUDA… because it’s faster than hardware accelerated OpenGL + CUDA/GL interop…
Sad, huh?
I know the green team have been saying they’re working on improving CUDA/GL interop performance, but seriously guys… is anything being done? (the new itnerop APIs in 3.0 don’t make much of a difference).
As it stands, if I render on the CPU (microsoft’s software OpenGL) and upload to CUDA - I can get all that done in about 2ms (~1ms rendering, ~1ms upload).
With hardware accelerated GL, it takes 2-3ms to render - and another 2-3ms to lock/map the textures to CUDA… so that’s 4-6ms to do everything in hardware… How can it possibly be this slow???
cuGraphicsMapResources still takes 500us-3ms (spikes wildly) to map 2 320x240 images (1 GL_RGB8UI_EXT, 1 GL_R32F).
Then hardware accelerated OpenGL takes 2-3ms to render those two 320x240 images (MRT outputs (passthrough shader, depth output) - of a ~400 triangle mesh with a ~256^2 GL_LUMINANCE texture).
Is anyone able to give advice/feedback on their own personal experiences with OpenGL interop in a CUDA development environment? Should I just give up? Should our company invest 3 months to write a rasterizer/raytracer in CUDA? or should I stick to the CPU?
(Sorry if this sounds like a rant, it somewhat is - I just finished making the final changes to revert to using the CPU for this - and it still doesn’t seem right to be doing this…)
And to preempt the likely response of “do something while the GPU is rendering”, all we have to do is CUDA work - and executing CUDA kernels after issuing GL commands simply postpones the GL commands…
I was having some OpenGL interop performance issues a little while ago and I eventually discovered that I was getting into some sub-optimal code path by not passing NULL as the data pointer when creating my OpenGL buffer objects.
I found this to be the case regardless of the usage hint that I passed in. I also found that the time taken on the slow path was proportional to the buffer size, which might explain why your interop overhead is less than what I observed. Also, I found that if I created a buffer using the fast path, but then tried to upload data from host memory using glBufferSubData at any time during the OpenGL buffer object’s lifetime, it would cause me to end up on the slow path again.
So the moral of the story is when using OpenGL interop, always use CUDA to upload data from from host memory into the OpenGL buffer. If you use OpenGL to do this, then it will put you on a slow code path.
I did this test on Windows XP with CUDA 2.3 desktop drivers.
Very interesting, but sadly it doesn’t quite help me.
All of the OpenGL data I have is relatively static (upload data to OpenGL once, never again - so uploading data to OpenGL isn’t my bottleneck).
All of my performance problems seem to be OpenGL taking longer in HW than SW to rasterize (which seems odd, Core2 duo vs. GTX 260), and getting the framebuffer (colour / depth) to CUDA…
I’m using the new CUDA 3.0 interop API for getting the framebuffer from OpenGL to CUDA when using HW accelerated OpenGL, which seems to take a few milliseconds (almost as if it’s still converting/copying the textures from OpenGL to CUDA memory)…
how do you measure the times? i was also quite confused by the timing of cuGraphicsMapResources until i noticed that it was actually the framebuffer operations before the first cuda call in my render method which influenced my measurements. be sure to put a glFinish() in front of your timer.start command for cuGraphicsMapResources. in my case, i did heavy FBO+MRT which slowed things down.
i also recommend to put cutilSafeCall( cudaThreadSynchronize() ); before your cpu timer start/stop commands.
Even if your OpenGL data is static, I found that you still need to upload it to the GPU using the CUDA API (cudaMemCopy, etc). If you upload your data using the OpenGL API (glBufferData or glBufferSubData, etc) then that buffer will be on a slow code path and you’ll pay a significant performance penalty (in my case about 20ms) every time you map and unmap that buffer, regardless of whether or not your data is static.
So uploading the data to OpenGL might indeed be your bottleneck, even though you only do it once, because it might be putting your buffer on a slow code path. To ensure that this isn’t happening, always pass a NULL pointer to glBufferData when allocating the OpenGL buffer.
Indeed I’m syncing both OpenGL (glFinish) and CUDA (cuCtxSynchronize) - which is how I determined HW accelerated OpenGL is slower than SW OpenGL for such simple scenes (3ms on HW, vs. ~1ms on SW).
Still, mapping the framebuffer textures takes a few milliseconds after a glFinish.
I’ll look into using CUDA to transfer data to OpenGL (though not entirely sure if this is feasible, due to the fact most our GL code is done in another library)… I really hope this isn’t the case though External Media