Executing kernel from different host threads

Hi, I see that now we can execute kernels independently from different host threads.
I have implemented a test project to launch several host threads and from each host thread calls the same set of kernels, which are safe to be run concurrently.

However, the speed improvement over executing the set of kernels in order using just one host thread isn’t apparent.

Is there anything we should be careful when we do this?
Do we need to create different streams for each host thread so their executions can overlap?
I tried this but it does not improve speed noticeably.

I am using a Tesla C2070 card.

Thank you for your advice :)

is there significant GPU idle time in the singlethreaded case in the first place? simply launching from multiple threads isn’t going to increase performance. multiple streams may, but that’s orthogonal to whether you’re using more than one host thread.