This is an old topic, but since the CUDA-OpenGL/DirectX interoperability is important for visualization and high performance application, i want to raise the issue once again.
There’re two main reasons that we need high performance CUDA-OpenGL/DirectX interoperability, that are
We want to exploit special Graphic Hardware, or some rendering trick from Open GL/DirectX
We don’t want visualization to be the bottle neck of our application
People raise the problem of low bandwidth between OpenGL pixel buffer and CUDA memory, which is almost the same bandwidth that we can archive with transfering data back to CPU and sending it to OpenGL. Theoretically, we want to see the same bandwidth between CUDA and OpenGL as from deice to device memory in CUDA. This issue has been considered a driver bug, and is supposed to be fixed with CUDA 2.0 version.
Now CUDA 2.0 has been released. What i see is only a small improvement in the bandwidth 450Mbs vs 390Mbs, that is way below the 62Gb of device-to-device bandwidth, even lower than 1.2GBs of pageable CPU-GPU bandwidth, so i wonder if the bug has been fixed.
My questions are :
Has the bandwidth issue been fix ? Can it be fixed ?
if it has been fixed, what should i do to achieve the maximum bandwidth
Is there any official tool/ application to measure the bandwidth
Is there any difference (performance concern) between CUDA-OpenGL and CUDA-DirectX iter-operability
So if some one has the same issue, or have found the solution for the issue please share your opinion and solutions here. If possible, share some bandwidth measure results and configuration of your machine, your OS system
We’re aware there are still performance issues with CUDA/OpenGL interop and are working on improving this (wow I sound like a marketing guy!). There were some fixes in CUDA 2.0, but it’s still not where we’d like it to be.
If you have specific test cases, please send them to me directly.
Direct3D 9 interop has the ability to read and write to D3D textures directly (see the simpleD3D9Texture sample in the SDK). This is an advantage over OpenGL interop, which has to go through buffer objects currently.
This is still a huge issue for us as well. While it is nice that D3D works well, D3D is not an option for a large chunk of the embedded (vehicles) industry, thus we must have device-to-device transfer in OpenGL.
Another big limitation is that you can’t register and map a PBO and then have OpenGL write to it. You can register/map after a glReadPixels (for example), but this is actually slower than doing a glReadPixels to host memory and then using cudaMemcpy to send them back to the device! ARGH!!
I hate to say this… but upper management here has lost interest in CUDA now that 2.0 has arrived and GL interop still has major performance issues. I really want to use CUDA, but I will need to show something soon for that to happen.
Yes, it does, it is not the first time we hear something like this. Anyway, thank Simon, at least I know someone has been aware the problem, the problem is not closed, and people are working hard to solve it
It sound kinda good new, at least I know the problem can be solved. However, I don’t use DirectX 9, and many others do like me, so we keep waiting and hopefully the fix come soon, long before the CUDA 3.0 or something else is announced, it is supposed to be the fix with 2.0.
Once again, thank Simon for shedding a light on the problem, hoping to hear some good new from you guy soon
According to the releasenotes it is solved. There are also some extra calls needed to do opengl interop (if you run opengl interop programs from 2.0 SDK on 2.1, they fail when running the calculation on a C1060 tesla). Also DX10 interop has been added according to the releasenotes, but it would be nice to hear confirmation from people that were having problems before.
Unfortunately, it looks like it might be the same for my company too - I’ve still got them slightly interested in CUDA, but I expect my time will be redirected elsewhere in the near future (potentially forever?) - due to the inability to get the performance required.