Kernel launch concurrency

Hello,

I have an optimization problem with my code.
With the advice you gave me I managed to optimize my kernel.
Now I’m struggling with the optimization of the kernel launch

I set up different streams, data is managed with cudaMallocManaged

I have a loop that selects the arrays that has to be passed to the first kernel
The first kernel produce a temporary array that is summed up afterwards by another kernel

for (unsigned i=0;i<N;i+=NSTREAMS)
{
	for (unsigned int w=0;w<NSTREAMS;w++)
	{
		kernel1<<<blocks,threads,0,stream[w]>>>(parms)
	}

	for (unsigned int w=0;w<NSTREAMS;w++)
	{
		kernel2<<<blocks,threads,0,stream[w]>>>(parms)
	}

}

I obviously obtained a great improvement compared to the “streamless” version of this code but the profiler still says
that “the multiprocessor of the GPU are almos idle”.
And actually they are

Just in very few cases (depending on the input data) the kernels are overlapping.
Can you point me out what I’m doing wrong here?

concurrent kernel execution requires a number of requirements to be satisfied, and can be difficult to achieve in practice. You haven’t shown any kernel code or kernel launch parameters, but if, for example, your kernel launches are consisting of a large number of blocks, these will typically “fill” the GPU and prevent any significant concurrency.

You might want to read the relevant section of the programming guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution

and also test things out with the concurrent kernels sample code:
http://docs.nvidia.com/cuda/cuda-samples/index.html#concurrent-kernels

The first kernel takes three vectors n1 sized and three vectors n2 sized.
The processing is done at warp level with shfl function, and each kernel uses at most 192 blocks.
The actual number of blocks is min(n1/16,192) as there are 512 threads x block.
The second kernel takes the output vector from the first kernel (192 elements at most),
and sums them up.
I’m using a 8-multiprocessor card, so I thought that by using 8 streams the card would be
100% occupied.
What am I missing?

A kernel launch of 192 blocks can easily fill up an 8 SM GPU, preventing concurrency.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

(8*16 = 128 blocks)

There is no connection between streams and SMs.

Ok, now I understand.
So, kernel2, that sums up at most 192 elements and just uses one block can be highly parallelized, right?

What kind of GPU are you using? With older GPUs (generally speaking pre-Kepler), there can be an issue with false dependencies as all kernel launches go through a single work queue. The following article describes this in the context of CUDA Fortran, but it applies to CUDA C just the same:

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-fortran/

Make sure your code does not use any blocking API calls. Make sure you do not accidentally set the environment variable CUDA_LAUNCH_BLOCKING=1. If you are on one of the newer GPUs, you may want to increase the number of concurrent streams supported by the driver by setting environment variable CUDA_DEVICE_MAX_CONNECTIONS to a higher value than the default (which is 8, I think). The hardware maximum is 32.

It certainly should be possible to run multiple kernels in parallel (concurrently), if each of those kernels consist of only a single threadblock. After all the necessary prerequisites for concurrent kernel launch are dealt with (such as the necessity to use streams, cc2.x or higher, etc.), then there are a number of resource limits that must be satisfied, concurrently, in order to observe concurrency. Many of those resource limits are covered in the table 11/12 that I linked.

Thank you

charliemarquez,

Can you show us how you initializaed and declared the streams?? There is a small chance that some setting made the streams not launch concurrently.

Like this

cudaStream_t working_streams[NWORKINGSTREAMS];

for (unsigned int i=0;i<NWORKINGSTREAMS;i++)
    {
        cudaStreamCreate(&working_streams[i]);
    }

Charlie, try creating the streams this way:

cudaStream_t working_streams[NWORKINGSTREAMS];
for (unsigned int i=0;i<NWORKINGSTREAMS;i++){
    //cudaStreamCreate(&working_streams[i]);
    cudaStreamCreateWithFlags(&working_streams[i], cudaStreamNonBlocking);
}

And test again with the profiler.