is there alternative way to copy a single result from GPU to host?
for example after doing reduction in GPU, I want copy the result back. I always use cudamemcopy but then I need initialize cudamalloc for both host and GPU, it get tedious doing that for multiple single result. any short way to copy back to host?
- pokes head up and looks around *
I’d just use Thrust, tbh.
thrust::reduce returns a value back to you so it hides all the copy and synchronization mechanisms behind some interfaces. Keep in mind, these interfaces are not zero cost.
To get a result from the GPU back to the host in a way that you want, you will have to invoke a hard synchronization mechanism and have a copy from the device to the host. This stuff isn’t free so don’t blame Thrust if this becomes problematic from a performance perspective.
Note, you can actually just encapsulate your common instructions into host-side functions like Thrust is doing for you if you would like to implement it yourself so you can have more control and knowledge of your costs.
cuda unified memory eliminates some of the tedium, on supported platforms
You can also use zero-copy techniques. Both of these are described in the cuda programming guide and have various sample codes.
try thrust, but I need do some raw ptr to thrust ptr, got performance hit on those.
on the unified mem, cudaDeviceSynchronize() how much hit to the performance compare to regular cudmemcpy?