Hi, I see that now we can execute kernels independently from different host threads.
I have implemented a test project to launch several host threads and from each host thread calls the same set of kernels, which are safe to be run concurrently.
However, the speed improvement over executing the set of kernels in order using just one host thread isn’t apparent.
Is there anything we should be careful when we do this?
Do we need to create different streams for each host thread so their executions can overlap?
I tried this but it does not improve speed noticeably.
I am using a Tesla C2070 card.
Thank you for your advice :)