visual profiler concurrent kernels and kernel duration

Hello everyone,

I have 4 kernels in 4 different streams (I want them to be concurrent) and I used the visual profiler in order to see if they do run in parallel. Their execution in visual profiler looks like this:

AAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDD

That means that they are concurrent (correct me if I am wrong). The duration of each kernel is ~650ms for the first, ~700ms for the second, ~820ms for the third, ~890ms for the last one. If I keep only one kernel the duration is 246ms.

Why is there so much difference between the duration of a single kernel and the duration of each one of the four kernels? Are they concurrent after all?

one possibility:

4 concurrent kernels using the same resources (for example, available memory bandwidth) may run slower than any one of the kernels running by itself.

ok, thank you!

What else could possibly cause this overhead when I apply streams except of available memory bandwidth?