Is it possible to continuously DMA data to global memory, while a kernel(s) are operating on previously sent data?
Someone else asked this. They didn’t report back after trying it, but the answer is maybe. In theory, it should work if you set up two streams and keep doing async memcpies on the 2nd one. In practice, the runtime might not be designed for this. What it is designed for is not a “continuous” workflow, but a pipelined one. You execute the kernel repeatedly, and using streams api you simultaneously memcpy data for the next kernel execution (not the current one).
What do you mean ‘maybe’ and ‘in theory’? This is a documented feature of CUDA, and is designed to works for the vast majority of CUDA capable cards currently release.
Refer to the Programming Guide section 126.96.36.199.
You two are answering different questions.
kernel writes buffer1, DMA buffer0 back to host – this works just fine
kernel writes buffer1, DMA buffer1 back to host as kernel 1 is writing – please don’t do that because it is a bad idea even if it seems like it might work
The OP said the kernel was operating on ‘previously sent’ (eg: past-tense = the operation has completed) data.
Of course, the programming guide never guarantees nor even mentions anything relating to the order of DMA transfers and how it relates to the execution of any concurrent kernels - thus the latter is not a documented feature.
i did some test about cuda stream just as crroush expects. there are 8 streams and two operations in one stream( 1) mem copying from host to device, 2) counting on data which are transfered by all the streams) in my test. the result shows that counting operations are done before mem transfering in some streams have been finished, namely, what counting got using streams is less than that without streams
The feature of DMAing while kernel execution is a hardware feature and available only on select hardware.
Check out the device properties of your hardware to see if it is enabled b4 testing.
Refer the programming guide to see which property holds that value.
i did the test last month and will retry it later, since i am a little busy now.
anyway, thanks for your advice.