Device to host data transfers Optimum method to transfer scalars?

When transferring data between the host and device, it’s obvious that much of the latency is setting up the transfer, independent of the amount of data being sent. So in general, transferring a large amount of data is more efficient that a small amount since the setup overhead is fairly consistent.

Also, sending a single value or “scalar” to the device can be done as a kernel argument, but there seems to be no efficient way to return a single value since all kernel functions must have a “void” return type.

What is the most efficient method to return a scalar value from the device to the host?

I don’t think that there is an efficient method. If you are also copying back an array, appending the scalar to the end of the array will keep you from having to pay the set up cost twice.
I know the arguments to a kernel call end up stored in the shared memory, though I don’t know whether they are bumped through global memory first; I would guess they are. Either way, CUDA doesn’t give you access to the reverse functionality, not that it would necessarily be any faster.

alternatively, plan to make more enough computations on GPU so that setting up communication would take much less time than the computations themselves and so communication costs can be neglected. For example make several passes for differet input data and then return array of scalars (if they can be done independetly of course)

Sometimes it also makes sense to call a slow kernel (that does not give any benifit in computing something on GPU, rather on CPU) but just so that you don’t need this transfer at all.

Yes, CUDA really forces you to “think different” :-)

Thanks, very good advice!

Currently, I am adding complexity to the algorithm to return an array of answers. In this case, it is the most optimum approach anyway since I was a bit light on the number of threads I was using…

With FPGAs data can be read in “burst” mode across the PCI bus or registers can be memory mapped for quick access to scalars. I was hoping there was a memory mapped register equivalent in the GPU realm.