Doing device->host copy in parallel

At the moment I’m spending about 20% of the time just doing device->host memcpy’s. Has anybody had any success in doing this copy in parallel i.e., reading back the results from launch N while launch N+1 is in flight?

So far the only approach I’ve tried that has gotten anywhere is to cudaMemcpy the result to a PBO, glMapBuffer the PBO and then use another thread to memcpy that pointer. However, if I read the docs right I have to use cudaGLRegisterBufferObject and cudaGLUnregisterBufferObject every frame for this to be safe, and that wipes out any performance gains.

Currently, data transfers between host and device cannot be done in parallel with kernel execution.
Future releases might support this though.

That’s why I’m trying to use the CUDA-OpenGL interface to hack this. The approach I mentioned above does generate a speedup and I appear to be getting correct results, all I need to know is whether I’m going to get bitten by a race condition later on due to using glMapBuffer on a buffer that is registered (but not mapped) with CUDA.

Thanks

Bruce

We’re also very, very interested in this.

Bruce, how much success have you had with this with using the CUDA-OpenGL interface? Would you be willing to share some examples of what you’ve done?

Thanks,

Jamie

This is being developed for a proprietary app, so unfortunately I’m not in a position to post code samples. The short answer is that doing it as I described and playing fast and loose with the buffer registration (leaving the buffer always registered - and I’m still trying to find out whether it is legal with glMapBuffer) amortises about half the cost of the transfer, on a dual-core CPU. Registering and unregistering the buffer has enough overhead that it becomes a net loss.

If you find any other approaches that look promising I’d be interested too.

Interesting technique, but I’m surprised this is faster.

The intent of cudaGLRegisterBufferObject is that you should only have to do it once when you create the buffer object, not every frame.

Unfortunately there’s a bug in the current release that means you need to do this every frame (at least if you’re using more than one buffer object). We’re hoping to get this fixed for the next release.

We’re continuing to work on optimizing memory transfers.