Hi,
I am trying to develop an application (a render) where kernel results are not important at all for the CPU, that means that they are never queried.
But it looks like that CUDA kernels still take some time to start and return on CPU…
so i was thinking to delegate all CUDA calls to a worker thread, so that the device can continuously receive new tasks from the CPU while kernels are in execution.
Is this doable/useful? And, i thought it was already possibile in CUDA with streams, but looks like that they parallelize execution on the device only…