CUDA / OpenGL Interoperability : Questions about speed

(I have post my question before, but no reply yet probably people who might know it didnt see it, so I summarize it here again.)

I want to use the rendered results from openGL in CUDA. For this application, I have read things including PostProcessGL example and many informations about CUDA/OpenGL interoperability. However, I still meet the speed issue. Hope some people can give me some information, many thanks.

Firstly, I used glreadpixel to take the depth / image to the HOST memory, and then use cudaMalloc to pass to memory on DEVICE. It sounds trivial, and it takes less than 1.5 ms/each for me. (it depends on the resolution I define, lets just treat it as an example.)

Then, I tried to use CUDA/OpenGL interoperability to pass the depth/image in openGL (which is on DEVICE memory) to CUDA directly without passing through HOST memory. And theoretically, this action should be much quicker than 1.5 ms/each. Please see the code following. It works fine.

But the thing is, this took about 2 ms/each for me, which is even longer than DEVICE>HOST>DEVICE, and it doesn’t make sense to me. Hope some people can give me some direction.

– Code – (I use taking depth as an example, taking image is similar.)

///// openGL set up /CUDA set up and cudaGLSetGLDevice command

/////

///// some variables
GLuint gl_buffer_depth;
cudaGraphicsResource_t cudaResourceDepth;
float depth_map_ptr;
cudaMalloc((void
*) &depth_map_ptr, sizeof(float) * size);
size_t size_depth = sizeof(depth_map_ptr);
/////

///// create pbo, and register CUDA<>openGL
glGenBuffers(1, &gl_buffer_depth);
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_buffer_depth);
glBufferData(GL_PIXEL_PACK_BUFFER, size, NULL, GL_DYNAMIC_COPY);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaResourceDepth, gl_buffer_depth, cudaGraphicsMapFlagsNone);
/////

///// take depth to pbo
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_buffer_depth);
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, 0); // this still takes me 2 ms
///// map to CUDA
cudaGraphicsMapResources(1, &cudaResourceDepth, 0);
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_ptr, &size_depth, cudaResourceDepth);
cudaGraphicsUnmapResources(1, &cudaResourceDepth, 0);

///// CUDA kernel

///// END