Doing device->host copy in parallel

bmerry · April 11, 2007, 7:39am

At the moment I’m spending about 20% of the time just doing device->host memcpy’s. Has anybody had any success in doing this copy in parallel i.e., reading back the results from launch N while launch N+1 is in flight?

So far the only approach I’ve tried that has gotten anywhere is to cudaMemcpy the result to a PBO, glMapBuffer the PBO and then use another thread to memcpy that pointer. However, if I read the docs right I have to use cudaGLRegisterBufferObject and cudaGLUnregisterBufferObject every frame for this to be safe, and that wipes out any performance gains.

Cyril_Zeller · April 12, 2007, 5:42am

Currently, data transfers between host and device cannot be done in parallel with kernel execution.
Future releases might support this though.

bmerry · April 12, 2007, 11:12am

That’s why I’m trying to use the CUDA-OpenGL interface to hack this. The approach I mentioned above does generate a speedup and I appear to be getting correct results, all I need to know is whether I’m going to get bitten by a race condition later on due to using glMapBuffer on a buffer that is registered (but not mapped) with CUDA.

Thanks

Bruce

e.ping · April 12, 2007, 2:02pm

We’re also very, very interested in this.

Bruce, how much success have you had with this with using the CUDA-OpenGL interface? Would you be willing to share some examples of what you’ve done?

Thanks,

Jamie

bmerry · April 12, 2007, 2:12pm

This is being developed for a proprietary app, so unfortunately I’m not in a position to post code samples. The short answer is that doing it as I described and playing fast and loose with the buffer registration (leaving the buffer always registered - and I’m still trying to find out whether it is legal with glMapBuffer) amortises about half the cost of the transfer, on a dual-core CPU. Registering and unregistering the buffer has enough overhead that it becomes a net loss.

If you find any other approaches that look promising I’d be interested too.

Simon_Green · April 12, 2007, 2:40pm

Interesting technique, but I’m surprised this is faster.

The intent of cudaGLRegisterBufferObject is that you should only have to do it once when you create the buffer object, not every frame.

Unfortunately there’s a bug in the current release that means you need to do this every frame (at least if you’re using more than one buffer object). We’re hoping to get this fixed for the next release.

We’re continuing to work on optimizing memory transfers.

Topic		Replies	Views
Semantics of cudaGLRegisterBufferObject CUDA Programming and Performance	4	3635	January 7, 2009
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76595	February 14, 2010
Newbie question - OpenGL and CUDA CUDA Programming and Performance	5	3121	November 14, 2008
device->host->device copy vs cudaGLMapBufferObject 6vs9ms, shouldn't mapping be way faster CUDA Programming and Performance	0	4842	July 12, 2007
Transfer to OpenGL buffer CUDA Programming and Performance	1	3104	April 14, 2009
OpenGL interop performance issues again... (or rather, still...) CUDA Programming and Performance	7	2523	April 16, 2009
Retrieving data from device to host memory while computer is rendering OpenGL graphics through the s CUDA Programming and Performance	2	1442	November 9, 2009
OpenGL performance issue. glReadPixels and cudaGLMapBufferObject bad performance. CUDA Programming and Performance	2	6285	March 24, 2010
Continuously moving data from CPU mem to GPU mem? CUDA Programming and Performance	4	3289	October 26, 2007
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9805	September 22, 2007

Doing device->host copy in parallel

Related topics