If I have a kernel that operates on a group of pixels and generates a single 32-bit result, what is the best way to get it back to the host? I’m currently doing the following:
- Allocate a 32-bit block of global device memory.
- Run the kernel giving it the address of the allocated memory.
- Memcpy the result from the device back to the host.
- Free the 32-bit block of device memory.
I can pass parameters from the host to the device kernel, so is there some simpler way to get a small result back?
Even returning a single number, what you’re doing is about as good as it gets. There are cases where I’ve done exactly that.
- Reuse your input buffer, clobbering part of it to store the output. You still cudaMemcpy but you save one alloc/free.
- Store the result to a global device variable and use cudaMemcpySymbol.
zero-copy would probably do very well here…
Thank you James K and tmurray,
I kind of had the feeling this was the case. I’m trying to set up a general framework for offloading certain types of image processing to a CUDA device without the caller knowing to much about the platform (might not be possible, but I thought I’d try). As part of this I would like to be able to create an asynchronous kernel that can have multiple instance spawned off in seperate streams. Because of this I cannot necessarily create a device symbol and MemcpyFromSymbol since I will need an indeterminate number of them. I was also starting to play around with Zero-Copy (while waiting for my GTX285 to arrive), but discovered that allocating pinned memory is very expensive - so much so that it is very inefficent to alloc/free for all the possible one-off cases I might have. I might look into allocating a block of them to be used as needed, and if more are requested, abort, or add another block to a deque of blocks.