I am working on a fairly complex multi streaming application which mostly uses thrust for CUDA processing.
I have implemented a custom kernel for front end processing that successfully executes concurrently in separate streams with demo cuFFT function calls.
However, when trying to run it concurrently with thrust functors in separate streams, it only executes serially.
So far, I have only run it on a Quadro M1200 GPU (with asyncEngineCount of 1) in Windows 7 with SDK 9.1.
The NV Profiler shows cudaStreamSychronize executing concurrently with the thrust functor blocks, and the custom kernel only executes between the depicted cudaStreamSychronize executions.
I need to understand what steps might be taken to get more concurrent execution of this custom kernel with the thrust functor executions. Would using a P100 with asyncEngineCount of 2 help presuming there are cudaMemcpyAsync calla interspersed with the thrust functor calls?
The thrust functors include calls to cudaMemcpyAsync with device memory that is not pinned (cudaMalloc) followed by cudaStreamSynchronize(). Would using pinned memory everywhere in the functors help with concurrency?
Thanks in advance.