hi fellow cuda users
this is a follow-up question for my ongoing quest for a fast deferred shading algorithm: http://forums.nvidia.com/index.php?showtopic=163071
i am wondering what the fastest method is to map framebuffer textures into cuda space and what speeds i can expect. first of all, the timings of my machine:
./bandwidthTest Starting...
Running on...
Device 0: GeForce 9600M GT
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2102.3
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1524.4
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 14926.6
&&&& TEST PASSED
now, i want to transmit two framebuffer textures using PBOs:
...
cutilSafeCall(cudaThreadSynchronize());
timer.start();
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);
glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
glReadBuffer(GL_COLOR_ATTACHMENT1);
glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[1]);
glReadPixels(0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
glFinish(); // make sure gl is done
cutilSafeCall(cudaThreadSynchronize());
cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );
#ifdef NEW_STYLE
cudaStream_t cuda_stream;
cutilSafeCall(cudaStreamCreate(&cuda_stream));
cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));
#else
cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));
cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartSymbols, mPBO[1]));
cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResultPixels, mPBO[6]));
cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevResult2Pixels, mPBO[7]));
#endif
cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );
cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );
cutilSafeCall(cudaThreadSynchronize());
timer.stop();
timings with the cpu timer:
-
6ms in average for two 640x480 RGBA textures
-
14ms in average for two 1024x768 RGBA textures
important remarks:
-
it does not matter if i use the old cudaGLMapBufferObject or the new cudaGraphicsMapResources method
-
the gpu timer confirms: almost all time is used in glReadPixels!
doing a little math: 1024x768 * 4 (rgba) * 4 (floats) * 2 (two textures) = 24mb
as all of this happens on-device, ideally we should get 14926/24 = 620 fps → around 2ms
my questions:
-
in case of 1024x768, why do i loose ~10ms??? is it simply overhead??? is it just the well-feared slowness of glReadPixels???
-
in cuda3, i should be able to directly map the framebuffer textures as cuda textures without PBOs using cudaGraphicsGLRegisterImage and cuda arrays, but i simply cannot get to work, the cudaBindTextureToArray always throws “invalid argument”. does anybody have working sample code for that?
TIA,
simon