I was wondering, if my code is embarrassingly parallel such that threads do not need to communicate. Can I save GPU shared memory by directing each thread to copy it’s output (single float value) to RAM, will the IO (passing through gpu mem) block/slow/corrupt the work of other threads? How can I achieve that? Asynchronous mem copy?
while kernel instances 1,2 and 4 are still computing copy result of kernal_3 to host/RAM.
Yes it is possible. Use streams and asynchroneous copies.
I just read more about streams and asynch copy. I think my original question was misleading.
What I meant is having the same kernel call and one of the instances of that kernal copying it’s result back to RAM.
so calling KernelX goes like:
and since one of the instances is done computing, can it copy its result to RAM?
No I do not think that is possible,
Maybe writing to mapped host mem would solve your problem?
Use zerocopy to store results directly to host memory if your intention is saving memory on the device. Allocate memory on the host using [font=“Courier New”]cudaHostAlloc()[/font] with the [font=“Courier New”]cudaHostAllocMapped[/font] flag (optionally set [font=“Courier New”]cudaHostAllocWriteCombined[/font] as well if you only ever write to the memory from the GPU side). Then obtain a device pointer for the host memory via [font=“Courier New”]cudaHostGetDevicePointer()[/font]. This device pointer can then be used like any other device pointer, even though the memory resides on the host.