Please help me understand some issues regarding concurrent kernel execution

robosmith · July 23, 2019, 3:23pm

I am working on a fairly complex multi streaming application which mostly uses thrust for CUDA processing.

I have implemented a custom kernel for front end processing that successfully executes concurrently in separate streams with demo cuFFT function calls.

However, when trying to run it concurrently with thrust functors in separate streams, it only executes serially.

So far, I have only run it on a Quadro M1200 GPU (with asyncEngineCount of 1) in Windows 7 with SDK 9.1.

The NV Profiler shows cudaStreamSychronize executing concurrently with the thrust functor blocks, and the custom kernel only executes between the depicted cudaStreamSychronize executions.

I need to understand what steps might be taken to get more concurrent execution of this custom kernel with the thrust functor executions. Would using a P100 with asyncEngineCount of 2 help presuming there are cudaMemcpyAsync calla interspersed with the thrust functor calls?

The thrust functors include calls to cudaMemcpyAsync with device memory that is not pinned (cudaMalloc) followed by cudaStreamSynchronize(). Would using pinned memory everywhere in the functors help with concurrency?

Thanks in advance.

Robert_Crovella · July 23, 2019, 3:51pm

Based on your text, I wonder if you’re using the word functor correctly. What you’re saying is confusing to me. A thrust functor is a function-object that is passed to a thrust algorithm. For instance, a functor could be passed to thrust::sort to indicate to sort from high-to-low or alternatively from low-to-high.

If you are calling thrust algorithms and in a CUDA 9.1 regime, and you wish to use stream behavior, you need to use specific thrust execution policies. Are you doing that?

[url]https://github.com/thrust/thrust/blob/8551c97870cd722486ba7834ae9d867f13e299ad/examples/cuda/simple_cuda_streams.cu[/url]

robosmith · July 23, 2019, 4:37pm

Sorry Robert, I was using Functor to refer to any thrust based functions. I didn’t write any of the thrust code and am completely new to using thrust. The application is a mix of standard thrust functions and custom, which are referred to as Functors in the code.

As for execution policy, it appears that thrust::cuda::par.on is used everywhere.

robosmith · July 23, 2019, 7:42pm

Ok, I see now that it is host memory which needs to be pinned, and that appears to be the case, so it is a complete mystery to me as to why the thrust functions are forcing serial execution of my custom kernel.

Robert_Crovella · July 23, 2019, 7:55pm

kernel concurrency can be hard to witness in practice. If the thrust kernels are “large enough” in terms of resource usage, they may effectively prevent concurrent execution of another kernel.

robosmith · July 23, 2019, 9:46pm

That could be. Esp on my wimpy Quadro M1200.

So I guess I could test out that hypothesis by increasing the size of my dummy cufft calls until they no longer execute concurrently with my custom kernel?

Topic		Replies	Views
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	7852	January 2, 2018
Concurrent kernel execution Only working with mapped memory CUDA Programming and Performance	6	5745	July 13, 2011
Thrust: Concurrency and Kernels CUDA Programming and Performance	3	612	June 12, 2023
My streams are not running concurrently CUDA Programming and Performance	7	1748	March 6, 2018
How to use thrust::async::for_each with cuda streams? CUDA Programming and Performance cuda	13	3558	May 12, 2021
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5543	April 28, 2012
Thrust and concurrent execution on multi-GPU CUDA Programming and Performance	1	1516	February 21, 2018
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2072	February 5, 2020
Multiple CUFFT in different streams? CUDA Programming and Performance	7	7066	July 5, 2008
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1147	February 4, 2021

Please help me understand some issues regarding concurrent kernel execution

Related topics