I have 4 kernels in 4 different streams (I want them to be concurrent) and I used the visual profiler in order to see if they do run in parallel. Their execution in visual profiler looks like this:
That means that they are concurrent (correct me if I am wrong). The duration of each kernel is ~650ms for the first, ~700ms for the second, ~820ms for the third, ~890ms for the last one. If I keep only one kernel the duration is 246ms.
Why is there so much difference between the duration of a single kernel and the duration of each one of the four kernels? Are they concurrent after all?