Please help me understand some issues regarding concurrent kernel execution

I am working on a fairly complex multi streaming application which mostly uses thrust for CUDA processing.

I have implemented a custom kernel for front end processing that successfully executes concurrently in separate streams with demo cuFFT function calls.

However, when trying to run it concurrently with thrust functors in separate streams, it only executes serially.

So far, I have only run it on a Quadro M1200 GPU (with asyncEngineCount of 1) in Windows 7 with SDK 9.1.

The NV Profiler shows cudaStreamSychronize executing concurrently with the thrust functor blocks, and the custom kernel only executes between the depicted cudaStreamSychronize executions.

I need to understand what steps might be taken to get more concurrent execution of this custom kernel with the thrust functor executions. Would using a P100 with asyncEngineCount of 2 help presuming there are cudaMemcpyAsync calla interspersed with the thrust functor calls?

The thrust functors include calls to cudaMemcpyAsync with device memory that is not pinned (cudaMalloc) followed by cudaStreamSynchronize(). Would using pinned memory everywhere in the functors help with concurrency?

Thanks in advance.

Based on your text, I wonder if you’re using the word functor correctly. What you’re saying is confusing to me. A thrust functor is a function-object that is passed to a thrust algorithm. For instance, a functor could be passed to thrust::sort to indicate to sort from high-to-low or alternatively from low-to-high.

If you are calling thrust algorithms and in a CUDA 9.1 regime, and you wish to use stream behavior, you need to use specific thrust execution policies. Are you doing that?

Sorry Robert, I was using Functor to refer to any thrust based functions. I didn’t write any of the thrust code and am completely new to using thrust. The application is a mix of standard thrust functions and custom, which are referred to as Functors in the code.

As for execution policy, it appears that thrust::cuda::par.on is used everywhere.

Ok, I see now that it is host memory which needs to be pinned, and that appears to be the case, so it is a complete mystery to me as to why the thrust functions are forcing serial execution of my custom kernel.

kernel concurrency can be hard to witness in practice. If the thrust kernels are “large enough” in terms of resource usage, they may effectively prevent concurrent execution of another kernel.

That could be. Esp on my wimpy Quadro M1200.

So I guess I could test out that hypothesis by increasing the size of my dummy cufft calls until they no longer execute concurrently with my custom kernel?