I was playing with addWithCuda program (the sample code that’s present upon creating a new CUDA project in Visual Studio). I modified it a bit so it looks like this:
addWithCuda(...){
mallocs
for(i<100){
memcpy vector A HtoD
memcpy vector B
launch kernel<<>>()
Sleep(30)
memcpy result DtoH
}
}
Then I looked at it in Nsight Systems. The launch latency of kernel was about 30ms - which is more or less the time of Sleep. I find this behaviour weird: I would expect CPU to run the kernel then go to sleep, while GPU is working on a kernel while CPU is sleeping.
So I would like to know a bit more about this mechanism. How is it working, why is it working like that etc.
My systems specs if it is platform specific
Windows 10 x64
GF GTX 3090 / Quadro P2000
CUDA 11.3 / 10.2
You’re running into wddm command batching. You cannot switch your GTX 3090 out of WDDM mode, but it may be possible to switch your Quadro P2000 to TCC mode (using nvidia-smi). In that case this observation should mostly disappear.
I have a Quadro P2000 running in TCC mode alongside a Quadro RTX 4000 running in WDDM mode. The obvious requirement is that the Quadro P2000 is not driving a display. I only use Quadro GPUs, so I cannot say whether mixing Quadro GPUs and consumer GPUs would cause an issue, but I cannot think of a reason why it should.