I was wondering, if my code is embarrassingly parallel such that threads do not need to communicate. Can I save GPU shared memory by directing each thread to copy it’s output (single float value) to RAM, will the IO (passing through gpu mem) block/slow/corrupt the work of other threads? How can I achieve that? Asynchronous mem copy?
Use zerocopy to store results directly to host memory if your intention is saving memory on the device. Allocate memory on the host using [font=“Courier New”]cudaHostAlloc()[/font] with the [font=“Courier New”]cudaHostAllocMapped[/font] flag (optionally set [font=“Courier New”]cudaHostAllocWriteCombined[/font] as well if you only ever write to the memory from the GPU side). Then obtain a device pointer for the host memory via [font=“Courier New”]cudaHostGetDevicePointer()[/font]. This device pointer can then be used like any other device pointer, even though the memory resides on the host.