Someone else asked this. They didn’t report back after trying it, but the answer is maybe. In theory, it should work if you set up two streams and keep doing async memcpies on the 2nd one. In practice, the runtime might not be designed for this. What it is designed for is not a “continuous” workflow, but a pipelined one. You execute the kernel repeatedly, and using streams api you simultaneously memcpy data for the next kernel execution (not the current one).
What do you mean ‘maybe’ and ‘in theory’? This is a documented feature of CUDA, and is designed to works for the vast majority of CUDA capable cards currently release.
kernel writes buffer1, DMA buffer0 back to host – this works just fine
kernel writes buffer1, DMA buffer1 back to host as kernel 1 is writing – please don’t do that because it is a bad idea even if it seems like it might work
The OP said the kernel was operating on ‘previously sent’ (eg: past-tense = the operation has completed) data.
Of course, the programming guide never guarantees nor even mentions anything relating to the order of DMA transfers and how it relates to the execution of any concurrent kernels - thus the latter is not a documented feature.
i did some test about cuda stream just as crroush expects. there are 8 streams and two operations in one stream( 1) mem copying from host to device, 2) counting on data which are transfered by all the streams) in my test. the result shows that counting operations are done before mem transfering in some streams have been finished, namely, what counting got using streams is less than that without streams