Cuda OpenGL Interoperability efficiency problem

Irbis · August 19, 2011, 11:35am

I wonder why a way to initialize vbo make a big difference in fps when interacting with cuda. When I create vbo there are two possibilities:

vbo reserves only memory space with the given data size (in this case positions of the particles are first time write to vbo within the kernel and later modify in the kernel):

gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, null, GL3.GL_DYNAMIC_DRAW);

vbo reserves memory space with the given data size and get some initial data (positions of the particles - ofcourse these values are later modify in the kernel):

gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, FloatBuffer.wrap(particlesPositions), GL3.GL_DYNAMIC_DRAW);

~408 fps
~75 fps

You can check this behaviour using a Simple OpenGL example from Nvidia GPU Computing SDK.

Irbis · August 27, 2011, 7:03am

I assume that in both cases vbo stores vertices in the gpu memory. Can anyone explain what’s going on ?

quirin · August 27, 2011, 10:24am

I believe it is because a GL buffer gets allocated in a lazy-fashion. Its buffer memory is not allocated until it is actually accessed.

If I understand your syntax correctly, in the first case, where you get 408fps, it might be CUDA that allocates the memory as soon as you map it to CUDA.

In the second case, it is OpenGL that allocates the memory.

Obviously it seems to make a difference whether OpenGL or CUDA allocates the memory. Maybe it needs to manually copy the data between the contexts? Try to vary the size (e.g. the number of particles) and see if that affects the mapping costs.

Irbis · August 27, 2011, 4:14pm

When I change the number of particles the difference is similar: ~710, ~160 or ~1050 , ~280.

quirin · August 28, 2011, 7:22am

I assume that the better the framerate the lower the number of particles. Take the ratios of your two algorithms:

low number of particles: 1050/280 = 3.75
med number of particles: 710/160 = 4.75
hi number of particles: 408/75 = 5.44

The more particles you use, the higher the performance gap becomes. A hidden copy by the driver could exlain, why the ratios are not the same.

Can you try to just time the mapping, i.e. only use the cuda kernel once and then keep the data unchanged?

Topic		Replies	Views
Pass openGL data to CUDA. Question about speed. CUDA Programming and Performance	4	1889	August 22, 2016
Implementing graphics interoperability in molecular dynamics code CUDA Programming and Performance	0	1128	April 17, 2012
cudaGraphicsGLRegisterBuffer and cudaErrorMemoryAllocation CUDA Programming and Performance	4	7206	December 1, 2023
OpenGL and CUDA 3.0+ CUDA Programming and Performance	3	4043	September 6, 2011
Newbie question - OpenGL and CUDA CUDA Programming and Performance	5	3061	November 14, 2008
OpenGL & CUDA CUDA Programming and Performance	12	9860	January 16, 2009
OpenGL interop performance issues again... (or rather, still...) CUDA Programming and Performance	7	2466	April 16, 2009
CUDA Runs 1/10th the speed of openGL CUDA Programming and Performance	9	7211	September 3, 2008
OpenGL performance issue. glReadPixels and cudaGLMapBufferObject bad performance. CUDA Programming and Performance	2	6246	March 24, 2010
CUDA Multi-GPU with OpenGL interop CUDA Programming and Performance	8	13033	December 13, 2010

Cuda OpenGL Interoperability efficiency problem

Related topics