doubts about transferring/mapping framebuffer textures to cuda space

shaegler · March 19, 2010, 1:40pm

hi fellow cuda users

this is a follow-up question for my ongoing quest for a fast deferred shading algorithm: http://forums.nvidia.com/index.php?showtopic=163071

i am wondering what the fastest method is to map framebuffer textures into cuda space and what speeds i can expect. first of all, the timings of my machine:

./bandwidthTest Starting...

Running on...

Device 0: GeForce 9600M GT

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 2102.3

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 1524.4

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 14926.6

&&&& TEST PASSED

now, i want to transmit two framebuffer textures using PBOs:

...	

cutilSafeCall(cudaThreadSynchronize());

timer.start();

glReadBuffer(GL_COLOR_ATTACHMENT0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glReadBuffer(GL_COLOR_ATTACHMENT1);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[1]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glFinish(); // make sure gl is done

cutilSafeCall(cudaThreadSynchronize());

cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );

#ifdef NEW_STYLE

		cudaStream_t cuda_stream;

		cutilSafeCall(cudaStreamCreate(&cuda_stream));

		cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));

#else

		  cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartSymbols, mPBO[1]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResultPixels, mPBO[6]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResult2Pixels, mPBO[7]));

#endif

cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );

cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );

cutilSafeCall(cudaThreadSynchronize());

timer.stop();

timings with the cpu timer:

6ms in average for two 640x480 RGBA textures
14ms in average for two 1024x768 RGBA textures

important remarks:

it does not matter if i use the old cudaGLMapBufferObject or the new cudaGraphicsMapResources method
the gpu timer confirms: almost all time is used in glReadPixels!

doing a little math: 1024x768 * 4 (rgba) * 4 (floats) * 2 (two textures) = 24mb

as all of this happens on-device, ideally we should get 14926/24 = 620 fps → around 2ms

my questions:

in case of 1024x768, why do i loose ~10ms??? is it simply overhead??? is it just the well-feared slowness of glReadPixels???
in cuda3, i should be able to directly map the framebuffer textures as cuda textures without PBOs using cudaGraphicsGLRegisterImage and cuda arrays, but i simply cannot get to work, the cudaBindTextureToArray always throws “invalid argument”. does anybody have working sample code for that?

TIA,

simon

shaegler · March 20, 2010, 8:49pm

update:

i repeated the measurements on a Quadro 4800 which has 3-4 times the device-to-device bandwith of the 9600M (58541mb/s vs 14926mb/s):

also, fortunately cuda3 came out yesterday with updated examples, so i managed to avoid the PBOs for the framebuffer textures and replaced them with cudaGraphicsGLRegisterImage and cudaGraphicsSubResourceGetMappedArray. i am now able to directly map framebuffer textures as cuda arrays and bind them to cuda textures for access in the kernel (btw: mapping renderbuffers does not work!).

...

// timer start

cutilSafeCall(cudaGraphicsMapResources(2, mCudaResources, 0)); // ~ 3.6ms

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&startPixArray, mCudaResources[0], 0, 0)); // <0.2ms

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&startRuleArray, mCudaResources[1], 0, 0)); // <0.2ms

// thread sync && timer stop -> ~4ms @ 1k

...

mapping two 1024x768 (640x480) rgba framebuffer textures with cudaGraphicsMapResources/cudaGraphicsSubResourceGetMappedArray now takes about 4ms (2.5ms), so not a big gain compared to PBOs.

so, did i finally hit the hardware/driver limit? 2ms for a 1k texture seems high… what do other people doing deferred shading measure?

and: i noticed that most of the time is spent inside cudaGraphicsMapResources (see code above). why is that?

note: i could reproduce this behavior in the postProcessGL sdk example. so maybe that’s just how it is.

any comment on these numbers is appreciated,

simon

hi fellow cuda users

this is a follow-up question for my ongoing quest for a fast deferred shading algorithm: http://forums.nvidia.com/index.php?showtopic=163071

i am wondering what the fastest method is to map framebuffer textures into cuda space and what speeds i can expect. first of all, the timings of my machine:
./bandwidthTest Starting...

Running on...

Device 0: GeForce 9600M GT

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 2102.3

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 1524.4

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 14926.6

&&&& TEST PASSED
now, i want to transmit two framebuffer textures using PBOs:
...	

cutilSafeCall(cudaThreadSynchronize());

timer.start();

glReadBuffer(GL_COLOR_ATTACHMENT0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glReadBuffer(GL_COLOR_ATTACHMENT1);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[1]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glFinish(); // make sure gl is done

cutilSafeCall(cudaThreadSynchronize());

cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );

#ifdef NEW_STYLE

		cudaStream_t cuda_stream;

		cutilSafeCall(cudaStreamCreate(&cuda_stream));

		cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));

#else

		  cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartSymbols, mPBO[1]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResultPixels, mPBO[6]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResult2Pixels, mPBO[7]));

#endif

cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );

cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );

cutilSafeCall(cudaThreadSynchronize());

timer.stop();
timings with the cpu timer:

6ms in average for two 640x480 RGBA textures

14ms in average for two 1024x768 RGBA textures

important remarks:

it does not matter if i use the old cudaGLMapBufferObject or the new cudaGraphicsMapResources method

the gpu timer confirms: almost all time is used in glReadPixels!

doing a little math: 1024x768 * 4 (rgba) * 4 (floats) * 2 (two textures) = 24mb

as all of this happens on-device, ideally we should get 14926/24 = 620 fps → around 2ms

my questions:

in case of 1024x768, why do i loose ~10ms??? is it simply overhead??? is it just the well-feared slowness of glReadPixels???

in cuda3, i should be able to directly map the framebuffer textures as cuda textures without PBOs using cudaGraphicsGLRegisterImage and cuda arrays, but i simply cannot get to work, the cudaBindTextureToArray always throws “invalid argument”. does anybody have working sample code for that?

TIA,

simon

mvgalen · March 22, 2010, 4:36pm

With regards to glReadPixels, see my reply in this thread:

[url=“http://forums.nvidia.com/index.php?showtopic=93135&view=findpost&p=542216”]http://forums.nvidia.com/index.php?showtop...st&p=542216[/url]

shaegler · March 23, 2010, 11:35am

thanks.

i the mean time, i managed to avoid glReadPixels by directly mapping the framebuffer textures and i also noticed that my timing was wrong. the apparent slowness of cudaGraphicsMapResources was actually due to the framebuffer still being busy with the first render pass. adding a glFinish before doing the first cuda call revealed this. so i’m back to optimizing the usual vbo render stuff and i already hit the next problem: http://www.opengl.org/discussion_boards/ub…4525#Post274525

cheers,

simon

Topic		Replies	Views
cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960 CUDA Programming and Performance	3	4058	March 19, 2010
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76295	February 14, 2010
PBO/glReadPixels/cudaGLMapBufferObject performance difference between vista and linux CUDA Programming and Performance	1	18313	January 12, 2010
device->host->device copy vs cudaGLMapBufferObject 6vs9ms, shouldn't mapping be way faster CUDA Programming and Performance	0	4813	July 12, 2007
A problem of CUDA & OpenGL interoperation CUDA Programming and Performance	4	3951	May 17, 2009
Faster method of shifting texture data CUDA Programming and Performance cuda	11	491	July 31, 2023
CUDA & OpenGL FrameBuffer Object. CUDA Programming and Performance	3	9631	September 8, 2011
CUDA and OpenGL data transfer CUDA Programming and Performance	9	21295	October 6, 2007
Reading and Writing OpenGL Textures with Cuda CUDA Programming and Performance	24	24961	June 26, 2023
OpenGL performance issue. glReadPixels and cudaGLMapBufferObject bad performance. CUDA Programming and Performance	2	6237	March 24, 2010

doubts about transferring/mapping framebuffer textures to cuda space

Related topics