cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960

shaegler · March 16, 2010, 1:55pm

hi cuda users

i have a deferred shading setup with the following steps:

render (GLSL) to FBO with multiple renderbuffers attached
transfer some of the renderbuffers via PBO/TBO and cudaGraphicsMapResources to CUDA
process pixels with CUDA
transfer back to texture using PBO, also map the non-cuda-processed pbo to texture
render result (GLSL) using TBOs and usual textures

my problem is that cudaGraphicsMapResources in step 2 takes 20ms (macbookpro nvidia 9600M) to map 4 PBOs (2 read/ 2 write) as cuda pointers. i expect this number to be ~1ms or less… is my expectation wrong?

related question: i read in the programming guide for 3.0 that renderbuffers can be mapped directly with cuda (avoid pbo) using cudaGraphicsGLRegisterBuffer but i cannot get it to work. does anybody have some example code for that (RBO → cuda texture)???

note: i use a cpu timer and do a cudaThreadSynchronize before/after each start/stop of the timer. i render with 1024x768 resolution

here are the relevant code sections, first initialisation:

btw: my code is inspired from this nice blog post: http://www.rauwendaal.net/blog/howtousecud…tyapiwithopengl

void init() {

// setup FBO

glBindFramebuffer(GL_FRAMEBUFFER, mFBO[0]);

glBindRenderbuffer(GL_RENDERBUFFER, mRBO[0]);

glRenderbufferStorage(GL_RENDERBUFFER, GL_RGBA32F, mOSWidth, mOSHeight);

glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, mRBO[0]);

… more renderbuffers here…

// setup PBOs

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);

glBufferData(GL_PIXEL_PACK_BUFFER, bufferMemSize, NULL, GL_DYNAMIC_COPY);

… more read PBOs …

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, mPBO[6]);

glBufferData(GL_PIXEL_UNPACK_BUFFER, bufferMemSize, NULL, GL_DYNAMIC_COPY);

… more write PBOs …

// create TBO to map PBOs to cuda and register them

glBindTexture(GL_TEXTURE_BUFFER_EXT, mTBO[0]);

glTexBufferEXT(GL_TEXTURE_BUFFER_EXT, GL_RGBA32F, mPBO[0]);

glBindTexture(GL_TEXTURE_BUFFER_EXT, 0);

… more …

cutilSafeCall(cudaGraphicsGLRegisterBuffer(&mCudaResources[0], mPBO[0], cudaGraphicsMapFlagsNone));

cutilSafeCall(cudaGraphicsGLRegisterBuffer(&mCudaResources[1], mPBO[1], cudaGraphicsMapFlagsNone));

cutilSafeCall(cudaGraphicsGLRegisterBuffer(&mCudaResources[2], mPBO[6], cudaGraphicsMapFlagsNone));

cutilSafeCall(cudaGraphicsGLRegisterBuffer(&mCudaResources[3], mPBO[7], cudaGraphicsMapFlagsNone));

cudaStream_t cuda_stream;

cutilSafeCall(cudaStreamCreate(&cuda_stream));

cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));

mapStartPix(mCudaResources[0]); // <— uses cudaGraphicsResourceGetMappedPointer to get the pointers

mapStartRules(mCudaResources[1]);

mapResultData1(mCudaResources[2]);

mapResultData2(mCudaResources[3]);

cutilSafeCall(cudaGraphicsUnmapResources(4, mCudaResources, cuda_stream));

cutilSafeCall(cudaStreamDestroy(cuda_stream));

// create some textures for render step 5.

// …

}

and the render loop:

void render() {

// STEP 1

… just render to FBO …

// STEP 2: read back renderbuffers using pbo

glReadBuffer(GL_COLOR_ATTACHMENT0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, mPBO[0]);

glReadPixels(0, 0, mOSWidth, mOSHeight, GL_RGBA, GL_FLOAT, 0);

… repeat for other RBOs …

glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

glBindFramebuffer(GL_FRAMEBUFFER, 0);

// HERE IS THE SLOWDOWN

cudaStream_t cuda_stream;

cutilSafeCall(cudaStreamCreate(&cuda_stream));

cutilSafeCall(cudaGraphicsMapResources(4, mCudaResources, cuda_stream));

// STEP3 : launch cuda kernel

// …

cutilSafeCall(cudaGraphicsUnmapResources(4, mCudaResources, cuda_stream));

cutilSafeCall(cudaStreamDestroy(cuda_stream));

// STEP 4: bind textures to pbos which are not already bound to TBOs

glBindTexture(GL_TEXTURE_2D, mResultTexID[2]);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, mVPWidth, mVPHeight, 0, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, mPBO[2]);

glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, mVPWidth, mVPHeight, GL_RGBA, GL_FLOAT, 0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

… repeat …

// STEP 5: render to screen

… standard glsl pass using the TBOs written by cuda and the normal textures transferred from the renderbuffers by pbo

}

i appreciate any help!

kind regards,

simon

shaegler · March 17, 2010, 12:55pm

hi again

i have investigated some more and also switched back to the old interop API where i noticed that only the first of my four cudaGLMapBufferObject calls is taking too long (~19ms), the subsequent calls use 0.4ms. i used the following code to measure the time:

[codebox]

cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );

cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));

cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );

cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );

[/codebox]

so, did anybody experience a similar behavior?

best,

simon

hi cuda users

i have a deferred shading setup with the following steps:

render (GLSL) to FBO with multiple renderbuffers attached

transfer some of the renderbuffers via PBO/TBO and cudaGraphicsMapResources to CUDA

process pixels with CUDA

transfer back to texture using PBO, also map the non-cuda-processed pbo to texture

render result (GLSL) using TBOs and usual textures

my problem is that cudaGraphicsMapResources in step 2 takes 20ms (macbookpro nvidia 9600M) to map 4 PBOs (2 read/ 2 write) as cuda pointers. i expect this number to be ~1ms or less… is my expectation wrong?

related question: i read in the programming guide for 3.0 that renderbuffers can be mapped directly with cuda (avoid pbo) using cudaGraphicsGLRegisterBuffer but i cannot get it to work. does anybody have some example code for that (RBO → cuda texture)???

note: i use a cpu timer and do a cudaThreadSynchronize before/after each start/stop of the timer. i render with 1024x768 resolution

here are the relevant code sections, first initialisation:

btw: my code is inspired from this nice blog post: http://www.rauwendaal.net/blog/howtousecud…tyapiwithopengl

and the render loop:

i appreciate any help!

kind regards,

simon

raflegan · March 17, 2010, 3:08pm

Maybe the delay is due to OpenGL not having finished up. The function ‘cudaGraphicsMapResources’ guarantees that all graphics calls are finished before it is executed (from the function documentation):

So I suppose it calls glFinish() or something similar before it is executed, maybe this is stalling the operation. Have you tried calling glFinish() just before the call to cudaGraphicsMapResources()? That way you could test if the stall is due to OpenGL not having finished its operation.

regards,

Paul

shaegler · March 19, 2010, 2:15pm

hi paul

you were right, an actual glFinish/glFlush command revealed that the slowdown is caused by glReadPixels etc… i opened a new topic about what performance to ideally expect:

http://forums.nvidia.com/index.php?showtopic=163479

thanks for your help,

simon

Topic		Replies	Views
doubts about transferring/mapping framebuffer textures to cuda space CUDA Programming and Performance	3	2853	March 23, 2010
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76671	February 14, 2010
A problem of CUDA & OpenGL interoperation CUDA Programming and Performance	4	4005	May 17, 2009
CUDA GL Interop CUDA Programming and Performance	0	3186	December 16, 2010
cudaGraphicsResourceGetMappedPointer in parallel There is a way to get various resource pointer at t CUDA Programming and Performance	6	2580	October 25, 2011
Error mapping PBO cudaGraphicsResource CUDA Programming and Performance	0	729	June 27, 2013
cudaGraphics Map/Unmap of D3D11 resources is slow CUDA Programming and Performance	0	162	June 6, 2024
Question about cudaGraphicsMapResources speed. CUDA Programming and Performance	0	941	October 16, 2013
OpenGL Performance Problem Mapping to Pbo Decreases Performance CUDA Programming and Performance	0	1518	January 10, 2008
OpenGL performance issue. glReadPixels and cudaGLMapBufferObject bad performance. CUDA Programming and Performance	2	6305	March 24, 2010

cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960

Related topics