Kernel launch concurrency

charliemarquez · December 10, 2014, 6:00pm

Hello,

I have an optimization problem with my code.
With the advice you gave me I managed to optimize my kernel.
Now I’m struggling with the optimization of the kernel launch

I set up different streams, data is managed with cudaMallocManaged

I have a loop that selects the arrays that has to be passed to the first kernel
The first kernel produce a temporary array that is summed up afterwards by another kernel

for (unsigned i=0;i<N;i+=NSTREAMS)
{
	for (unsigned int w=0;w<NSTREAMS;w++)
	{
		kernel1<<<blocks,threads,0,stream[w]>>>(parms)
	}

	for (unsigned int w=0;w<NSTREAMS;w++)
	{
		kernel2<<<blocks,threads,0,stream[w]>>>(parms)
	}

}

I obviously obtained a great improvement compared to the “streamless” version of this code but the profiler still says
that “the multiprocessor of the GPU are almos idle”.
And actually they are

Just in very few cases (depending on the input data) the kernels are overlapping.
Can you point me out what I’m doing wrong here?

Robert_Crovella · December 10, 2014, 6:23pm

concurrent kernel execution requires a number of requirements to be satisfied, and can be difficult to achieve in practice. You haven’t shown any kernel code or kernel launch parameters, but if, for example, your kernel launches are consisting of a large number of blocks, these will typically “fill” the GPU and prevent any significant concurrency.

You might want to read the relevant section of the programming guide:
[url]Programming Guide :: CUDA Toolkit Documentation

and also test things out with the concurrent kernels sample code:
[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#concurrent-kernels[/url]

charliemarquez · December 10, 2014, 8:11pm

The first kernel takes three vectors n1 sized and three vectors n2 sized.
The processing is done at warp level with shfl function, and each kernel uses at most 192 blocks.
The actual number of blocks is min(n1/16,192) as there are 512 threads x block.
The second kernel takes the output vector from the first kernel (192 elements at most),
and sums them up.
I’m using a 8-multiprocessor card, so I thought that by using 8 streams the card would be
100% occupied.
What am I missing?

Robert_Crovella · December 10, 2014, 8:29pm

A kernel launch of 192 blocks can easily fill up an 8 SM GPU, preventing concurrency.

[url]Programming Guide :: CUDA Toolkit Documentation

(8*16 = 128 blocks)

There is no connection between streams and SMs.

charliemarquez · December 10, 2014, 8:32pm

Ok, now I understand.
So, kernel2, that sums up at most 192 elements and just uses one block can be highly parallelized, right?

njuffa · December 10, 2014, 8:37pm

What kind of GPU are you using? With older GPUs (generally speaking pre-Kepler), there can be an issue with false dependencies as all kernel launches go through a single work queue. The following article describes this in the context of CUDA Fortran, but it applies to CUDA C just the same:

[url]http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-fortran/[/url]

Make sure your code does not use any blocking API calls. Make sure you do not accidentally set the environment variable CUDA_LAUNCH_BLOCKING=1. If you are on one of the newer GPUs, you may want to increase the number of concurrent streams supported by the driver by setting environment variable CUDA_DEVICE_MAX_CONNECTIONS to a higher value than the default (which is 8, I think). The hardware maximum is 32.

Robert_Crovella · December 10, 2014, 8:45pm

It certainly should be possible to run multiple kernels in parallel (concurrently), if each of those kernels consist of only a single threadblock. After all the necessary prerequisites for concurrent kernel launch are dealt with (such as the necessity to use streams, cc2.x or higher, etc.), then there are a number of resource limits that must be satisfied, concurrently, in order to observe concurrency. Many of those resource limits are covered in the table 11/12 that I linked.

charliemarquez · December 10, 2014, 9:13pm

Thank you

neoideo · December 10, 2014, 10:23pm

charliemarquez,

Can you show us how you initializaed and declared the streams?? There is a small chance that some setting made the streams not launch concurrently.

charliemarquez · December 11, 2014, 7:25am

Like this

cudaStream_t working_streams[NWORKINGSTREAMS];

for (unsigned int i=0;i<NWORKINGSTREAMS;i++)
    {
        cudaStreamCreate(&working_streams[i]);
    }

neoideo · December 11, 2014, 11:20am

Charlie, try creating the streams this way:

cudaStream_t working_streams[NWORKINGSTREAMS];
for (unsigned int i=0;i<NWORKINGSTREAMS;i++){
    //cudaStreamCreate(&working_streams[i]);
    cudaStreamCreateWithFlags(&working_streams[i], cudaStreamNonBlocking);
}

And test again with the profiler.

Topic		Replies	Views
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2230	October 26, 2016
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2506	April 29, 2019
Concurrent kernel execution CUDA Programming and Performance	2	279	March 26, 2024
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	883	July 27, 2017
Why streams cant run concurrently CUDA Programming and Performance	4	920	March 22, 2018
My streams are not running concurrently CUDA Programming and Performance	7	1761	March 6, 2018
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3028	January 19, 2018
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3654	October 21, 2017
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5544	April 28, 2012
Concurrent executions of streams CUDA Programming and Performance	6	421	December 19, 2022

Kernel launch concurrency

Related topics