I have read some reports that invoking empty cuda kernels costs about 10 to 40 microseconds which is quite a lot.
As far as I can tell these reports did not include and look at “cuda streams” where multiple kernels could be pipelined.
So I am wondering and my question is (which I will probably tested out later):
Will cuda streams reduce kernel invocation overhead when multiple “stage/sequential” kernels are added to it ?
Kernel A has to execute.
Kernel B will execute as soon as Kernel A finished.
Kernel C will execute as soon as Kernel B finished.
Perhaps using “streams” will “ready” the kernels so they can immediatly run ?
Perhaps/I think it’s probably also possible to add “sync” commands to the stream… so it waits until the kernel is done executing or so… or perhaps that’s automatic.
Only the last kernel or so has to export data from device back to host… so I am also hoping these kernels can simply continue working on device memory without requiring any copies ?!? I am not sure if that’s exactly possible ?
(Perhaps each kernel invocation requires an host input and host output copy to happen ?)