Hello,
I’m doing a post-processing OpenGL effect using CUDA. But the cudaMemcpyToArray function takes 50% (memcopyDtoA in the CUDA Profiler) of the execution time and I really don’t understand why! In the SDK sample (postProcessGL), the function takes nothing! This leads to a ridiculous frame rate!
Here is what I’m doing:
CUDA registering:
cutilSafeCall(cudaGraphicsGLRegisterImage(&texResourceInput, idTextureInput, GL_TEXTURE_2D, cudaGraphicsMapFlagsReadOnly));
cutilSafeCall(cudaGraphicsGLRegisterImage(&texResourceOutput, idTextureOutput, GL_TEXTURE_2D, cudaGraphicsMapFlagsWriteDiscard));
Allocation of one buffer (just 1 times)
cutilSafeCall(cudaMalloc((void **)&tabTexture, width * sizeof(uchar4) * height));
Main:
cutilSafeCall(cudaGraphicsMapResources(1, &texResourceInput, 0));
cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&arrayCudaInput, texResourceInput, 0, 0));
cutilSafeCall(cudaBindTextureToArray(tex, arrayCudaInput));
//launch kernel
cutilSafeCall(cudaGraphicsUnmapResources(1, &texResourceInput, 0));
cutilSafeCall(cudaGraphicsMapResources(1, &texResourceOutput, 0));
cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&arrayCudaOutput, texResourceOutput, 0, 0));
cutilSafeCall(cudaMemcpyToArray(arrayCudaOutput, 0, 0, tabTexture,
width * sizeof(uchar4) * height,
cudaMemcpyDeviceToDevice ));
cutilSafeCall(cudaGraphicsUnmapResources(1, &texResourceOutput, 0));
The texture declaration :
texture<uchar4, 2, cudaReadModeElementType> tex;
What I’m doing is simple: I’m reading the content of idTextureInput (an opengl texture) and writing the result in idTextureOutput (an other opengl texture).
So, why my copy is so long ?
Just to compare :
GPU TIME CPU TIME Mem transfert
My code : memcpyDtoA 1956.26 2571.19 5.24288e+06 0
SDK code : memcpyDtoA 214.432 2.189 1.04858e+06 0
I’m transfering more data but there is a * 10 factor for GPU time for just * 5 memory size factor. And for CPU time… *1000…
Thanks.