cudaMalloc vs mapped VBO

hufo · February 22, 2007, 6:04pm

Hi,

Now that the CUDA <-> OpenGL API is out, I’m adapting my previous code to use it, instead of transfering the data back to the CPU and then to OpenGL.

My question is: Is there any overhead in using a mapped VBO instead of allocating the same memory using cudaMalloc?

To put a little context, the application I’m working on is a generic deformable body simulator, supporting multiple models (mass-springs, FEM, SPH fluids, …), as well as different integration algorithms (explicit Euler, RK4, implicit, …). To do this the system relies on a set of data vectors, holding the current state as well as temporary values. My problem is that I don’t know in advance which vector will contain the final state to be rendered. So I can either:[list=1]

[*] copy the final state to an OpenGL VBO (using a device-to-device memcpy)

[*] allocate all vectors as VBOs (either as separate VBOs or a single large VBO)

[*] change the design so that the final state is always stored in the same vector

Obviously the first solution adds the overhead of one additional copy per frame, while the third might require lots of changes in the code. So the second one would be the easiest and most efficient solution, but I’m not sure if the API would handle it well…

PS: the result of this will be released as open-source in the SOFA simulation framework, hopefully within the next few weeks :)

Simon_Green · February 23, 2007, 11:39am

There shouldn’t be any performance differences between using a mapped VBO/PBO and memory allocated by CUDA - they’re both allocated as linear GPU memory.

Assuming the amount of data you’re transferring is relatively small, copying the state using a device to device memcpy is probably the easiest solution to your problem.

Your simulation work sounds interesting, let us know what your results are like!

hufo · March 2, 2007, 12:26pm

Thanks for your response.
I will go for the solution of copying the result for now. I’m using OGRE 3D for the rendering part, so I need to figure out how to adapt its hardware buffer abstraction to link to CUDA, but it shouldn’t be too difficult.

I don’t have any definite results yet (as many kernels are not optimized), but I obtained speed-ups of up to 16X at the first tests, which is quite encouraging.
However, currently the main drawback is that I can only process one object at a time on the GPU, as CUDA (or the GPU hardware?) only support one kernel in flight. As in my simulation each object can be of a different nature, each requires its own datasets and kernels, and it would be difficult to unify it in a “super-kernel” where thread groups process different objects in parallel…

Topic		Replies	Views
cudaMalloc vs VBO CUDA Programming and Performance	0	583	August 7, 2011
GLMapBufferObject & copying from VBO to GPU's other part of memory? CUDA Programming and Performance	1	1067	April 20, 2010
Cuda OpenGL Interoperability efficiency problem CUDA Programming and Performance	4	1770	August 28, 2011
OpenGL VBO Mapping CUDA Programming and Performance	2	6546	January 6, 2011
VBOs don't improve performance? What am I doing wrong? CUDA Programming and Performance	0	1529	July 2, 2008
Newbie question - OpenGL and CUDA CUDA Programming and Performance	5	3112	November 14, 2008
CUDA to VBO transfer problems CUDA Programming and Performance	3	9361	March 22, 2007
CUDA doesn't perceive VBO data's modification with glMapBuffer() ? CUDA Programming and Performance	2	1309	January 24, 2011
OpenGL & CUDA CUDA Programming and Performance	12	9947	January 16, 2009
MapBufferObject/UnmapBufferObject timings problem CUDA Programming and Performance	0	3208	October 20, 2008

cudaMalloc vs mapped VBO

Related topics