doubts about transferring/mapping framebuffer textures to cuda space

hi fellow cuda users

this is a follow-up question for my ongoing quest for a fast deferred shading algorithm: http://forums.nvidia.com/index.php?showtopic=163071

i am wondering what the fastest method is to map framebuffer textures into cuda space and what speeds i can expect. first of all, the timings of my machine:

./bandwidthTest Starting...

Running on...

Device 0: GeForce 9600M GT

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 2102.3

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 1524.4

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)		Bandwidth(MB/s)

   33554432					 14926.6

&&&& TEST PASSED

now, i want to transmit two framebuffer textures using PBOs:

...	

cutilSafeCall(cudaThreadSynchronize());

timer.start();

glReadBuffer(GL_COLOR_ATTACHMENT0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glReadBuffer(GL_COLOR_ATTACHMENT1);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[1]);

glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glFinish(); // make sure gl is done

cutilSafeCall(cudaThreadSynchronize());

cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );

#ifdef NEW_STYLE

		cudaStream_t cuda_stream;

		cutilSafeCall(cudaStreamCreate(&cuda_stream));

		cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));

#else

		  cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartSymbols, mPBO[1]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResultPixels, mPBO[6]));

		cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResult2Pixels, mPBO[7]));

#endif

cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );

cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );

cutilSafeCall(cudaThreadSynchronize());

timer.stop();

timings with the cpu timer:

  • 6ms in average for two 640x480 RGBA textures

  • 14ms in average for two 1024x768 RGBA textures

important remarks:

  • it does not matter if i use the old cudaGLMapBufferObject or the new cudaGraphicsMapResources method

  • the gpu timer confirms: almost all time is used in glReadPixels!

doing a little math: 1024x768 * 4 (rgba) * 4 (floats) * 2 (two textures) = 24mb

as all of this happens on-device, ideally we should get 14926/24 = 620 fps → around 2ms

my questions:

  • in case of 1024x768, why do i loose ~10ms??? is it simply overhead??? is it just the well-feared slowness of glReadPixels???

  • in cuda3, i should be able to directly map the framebuffer textures as cuda textures without PBOs using cudaGraphicsGLRegisterImage and cuda arrays, but i simply cannot get to work, the cudaBindTextureToArray always throws “invalid argument”. does anybody have working sample code for that?

TIA,

simon

update:

i repeated the measurements on a Quadro 4800 which has 3-4 times the device-to-device bandwith of the 9600M (58541mb/s vs 14926mb/s):

also, fortunately cuda3 came out yesterday with updated examples, so i managed to avoid the PBOs for the framebuffer textures and replaced them with cudaGraphicsGLRegisterImage and cudaGraphicsSubResourceGetMappedArray. i am now able to directly map framebuffer textures as cuda arrays and bind them to cuda textures for access in the kernel (btw: mapping renderbuffers does not work!).

...

// timer start

cutilSafeCall(cudaGraphicsMapResources(2, mCudaResources, 0)); // ~ 3.6ms

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&startPixArray, mCudaResources[0], 0, 0)); // <0.2ms

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&startRuleArray, mCudaResources[1], 0, 0)); // <0.2ms

// thread sync && timer stop -> ~4ms @ 1k

...

mapping two 1024x768 (640x480) rgba framebuffer textures with cudaGraphicsMapResources/cudaGraphicsSubResourceGetMappedArray now takes about 4ms (2.5ms), so not a big gain compared to PBOs.

so, did i finally hit the hardware/driver limit? 2ms for a 1k texture seems high… what do other people doing deferred shading measure?

and: i noticed that most of the time is spent inside cudaGraphicsMapResources (see code above). why is that?

note: i could reproduce this behavior in the postProcessGL sdk example. so maybe that’s just how it is.

any comment on these numbers is appreciated,

simon

With regards to glReadPixels, see my reply in this thread:

[url=“http://forums.nvidia.com/index.php?showtopic=93135&view=findpost&p=542216”]http://forums.nvidia.com/index.php?showtop...st&p=542216[/url]

thanks.

i the mean time, i managed to avoid glReadPixels by directly mapping the framebuffer textures and i also noticed that my timing was wrong. the apparent slowness of cudaGraphicsMapResources was actually due to the framebuffer still being busy with the first render pass. adding a glFinish before doing the first cuda call revealed this. so i’m back to optimizing the usual vbo render stuff and i already hit the next problem: http://www.opengl.org/discussion_boards/ub…4525#Post274525

cheers,

simon