Slow OpenGL Interoperabilty with texture memory memcopyDtoA


I’m doing a post-processing OpenGL effect using CUDA. But the cudaMemcpyToArray function takes 50% (memcopyDtoA in the CUDA Profiler) of the execution time and I really don’t understand why! In the SDK sample (postProcessGL), the function takes nothing! This leads to a ridiculous frame rate!

Here is what I’m doing:

CUDA registering:

cutilSafeCall(cudaGraphicsGLRegisterImage(&texResourceInput, idTextureInput, GL_TEXTURE_2D, cudaGraphicsMapFlagsReadOnly));

cutilSafeCall(cudaGraphicsGLRegisterImage(&texResourceOutput, idTextureOutput, GL_TEXTURE_2D, cudaGraphicsMapFlagsWriteDiscard));

Allocation of one buffer (just 1 times)

cutilSafeCall(cudaMalloc((void **)&tabTexture, width * sizeof(uchar4) * height));


cutilSafeCall(cudaGraphicsMapResources(1, &texResourceInput, 0));


cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&arrayCudaInput, texResourceInput, 0, 0));

cutilSafeCall(cudaBindTextureToArray(tex, arrayCudaInput));

//launch kernel

cutilSafeCall(cudaGraphicsUnmapResources(1, &texResourceInput, 0));

cutilSafeCall(cudaGraphicsMapResources(1, &texResourceOutput, 0));


cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&arrayCudaOutput, texResourceOutput, 0, 0));

cutilSafeCall(cudaMemcpyToArray(arrayCudaOutput, 0, 0, tabTexture, 

						  width * sizeof(uchar4) * height, 

						  cudaMemcpyDeviceToDevice ));

cutilSafeCall(cudaGraphicsUnmapResources(1, &texResourceOutput, 0));

The texture declaration :

texture<uchar4, 2, cudaReadModeElementType> tex;

What I’m doing is simple: I’m reading the content of idTextureInput (an opengl texture) and writing the result in idTextureOutput (an other opengl texture).

So, why my copy is so long ?

Just to compare :

                                   GPU TIME  CPU TIME   Mem transfert

My code : memcpyDtoA 1956.26 2571.19 5.24288e+06 0

SDK code : memcpyDtoA 214.432 2.189 1.04858e+06 0

I’m transfering more data but there is a * 10 factor for GPU time for just * 5 memory size factor. And for CPU time… *1000…



I have a new element: when I’m calling

cutilSafeCall(cudaMemcpyToArray(arrayCudaOutput, 0, 0, (void*)tabTexture, width * height * sizeof(uchar4), 

									cudaMemcpyDeviceToDevice ));

I have width * height * sizeof(uchar4) = 50 000, but with the profiler I see a copy of 5 000 000 !

If I replace width * height * sizeof(uchar4) by 1, it continues to copy 5 000 000 !

Do you have any ideas ?


I found where the problem comes from : “Why not use RGBA8 ? CUDA cannot consistently work with this format, at least for now.”
I found this message in the SDK sample.