cudaMalloc vs mapped VBO


Now that the CUDA <-> OpenGL API is out, I’m adapting my previous code to use it, instead of transfering the data back to the CPU and then to OpenGL.

My question is: Is there any overhead in using a mapped VBO instead of allocating the same memory using cudaMalloc?

To put a little context, the application I’m working on is a generic deformable body simulator, supporting multiple models (mass-springs, FEM, SPH fluids, …), as well as different integration algorithms (explicit Euler, RK4, implicit, …). To do this the system relies on a set of data vectors, holding the current state as well as temporary values. My problem is that I don’t know in advance which vector will contain the final state to be rendered. So I can either:[list=1]

copy the final state to an OpenGL VBO (using a device-to-device memcpy)

allocate all vectors as VBOs (either as separate VBOs or a single large VBO)

change the design so that the final state is always stored in the same vector

Obviously the first solution adds the overhead of one additional copy per frame, while the third might require lots of changes in the code. So the second one would be the easiest and most efficient solution, but I’m not sure if the API would handle it well…

PS: the result of this will be released as open-source in the SOFA simulation framework, hopefully within the next few weeks :)

There shouldn’t be any performance differences between using a mapped VBO/PBO and memory allocated by CUDA - they’re both allocated as linear GPU memory.

Assuming the amount of data you’re transferring is relatively small, copying the state using a device to device memcpy is probably the easiest solution to your problem.

Your simulation work sounds interesting, let us know what your results are like!

Thanks for your response.
I will go for the solution of copying the result for now. I’m using OGRE 3D for the rendering part, so I need to figure out how to adapt its hardware buffer abstraction to link to CUDA, but it shouldn’t be too difficult.

I don’t have any definite results yet (as many kernels are not optimized), but I obtained speed-ups of up to 16X at the first tests, which is quite encouraging.
However, currently the main drawback is that I can only process one object at a time on the GPU, as CUDA (or the GPU hardware?) only support one kernel in flight. As in my simulation each object can be of a different nature, each requires its own datasets and kernels, and it would be difficult to unify it in a “super-kernel” where thread groups process different objects in parallel…