(I have post my question before, but no reply yet probably people who might know it didnt see it, so I summarize it here again.)
I want to use the rendered results from openGL in CUDA. For this application, I have read things including PostProcessGL example and many informations about CUDA/OpenGL interoperability. However, I still meet the speed issue. Hope some people can give me some information, many thanks.
Firstly, I used glreadpixel to take the depth / image to the HOST memory, and then use cudaMalloc to pass to memory on DEVICE. It sounds trivial, and it takes less than 1.5 ms/each for me. (it depends on the resolution I define, lets just treat it as an example.)
Then, I tried to use CUDA/OpenGL interoperability to pass the depth/image in openGL (which is on DEVICE memory) to CUDA directly without passing through HOST memory. And theoretically, this action should be much quicker than 1.5 ms/each. Please see the code following. It works fine.
But the thing is, this took about 2 ms/each for me, which is even longer than DEVICE>HOST>DEVICE, and it doesn’t make sense to me. Hope some people can give me some direction.
– Code – (I use taking depth as an example, taking image is similar.)
///// openGL set up /CUDA set up and cudaGLSetGLDevice command
///// some variables
cudaMalloc((void*) &depth_map_ptr, sizeof(float) * size);
size_t size_depth = sizeof(depth_map_ptr);
///// create pbo, and register CUDA<>openGL
glBufferData(GL_PIXEL_PACK_BUFFER, size, NULL, GL_DYNAMIC_COPY);
cudaGraphicsGLRegisterBuffer(&cudaResourceDepth, gl_buffer_depth, cudaGraphicsMapFlagsNone);
///// take depth to pbo
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, 0); // this still takes me 2 ms
///// map to CUDA
cudaGraphicsMapResources(1, &cudaResourceDepth, 0);
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_ptr, &size_depth, cudaResourceDepth);
cudaGraphicsUnmapResources(1, &cudaResourceDepth, 0);
///// CUDA kernel