On a rather complex CUDA framework we are working on, I have migrated our code to use streams.
We achieve very good concurrency, but the kernel execution time of the concurrent kernel increase dramatically, negating the streamification effort.
As can be see from the following image from Nsight, concurrency is good, but each kernel use around ~400ms.
In the serial case, we see there is a gap between each kernel invocation (it is actually performing the stream-synchronization, but all kernels queued in stream 1), but kernel execution time is only ~100 ms.
The kernels have the exact same configuration (except the stream argument) in both scenarios.
The kernel is quite memory intensive (traversing a BVH), so my hypothesis is that the shared cache is thrashed when the kernels execute concurrently.
This is using a dual GPU system with a Quadro K6000 and a Tesla K40 using driver 353.62 and CUDA 7.0.
Has anyone observed similar behavior. Is there another explanation for the effect?