Best way to get result back to the host?


If I have a kernel that operates on a group of pixels and generates a single 32-bit result, what is the best way to get it back to the host? I’m currently doing the following:

  1. Allocate a 32-bit block of global device memory.
  2. Run the kernel giving it the address of the allocated memory.
  3. Memcpy the result from the device back to the host.
  4. Free the 32-bit block of device memory.

I can pass parameters from the host to the device kernel, so is there some simpler way to get a small result back?


Even returning a single number, what you’re doing is about as good as it gets. There are cases where I’ve done exactly that.

Other options:

  1. Reuse your input buffer, clobbering part of it to store the output. You still cudaMemcpy but you save one alloc/free.
  2. Store the result to a global device variable and use cudaMemcpySymbol.

zero-copy would probably do very well here…

Thank you James K and tmurray,

I kind of had the feeling this was the case. I’m trying to set up a general framework for offloading certain types of image processing to a CUDA device without the caller knowing to much about the platform (might not be possible, but I thought I’d try). As part of this I would like to be able to create an asynchronous kernel that can have multiple instance spawned off in seperate streams. Because of this I cannot necessarily create a device symbol and MemcpyFromSymbol since I will need an indeterminate number of them. I was also starting to play around with Zero-Copy (while waiting for my GTX285 to arrive), but discovered that allocating pinned memory is very expensive - so much so that it is very inefficent to alloc/free for all the possible one-off cases I might have. I might look into allocating a block of them to be used as needed, and if more are requested, abort, or add another block to a deque of blocks.

Thanks again,