Concurrent executions of streams

mahmood.nt · December 14, 2022, 10:15am

Hi
I have written a simple program with two streams which should be run concurrently.

  dim3 grid_size( M ); dim3 block_size( N );
  cudaStream_t s1, s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  increment<<< grid_size, block_size, 0, s1 >>>( d_a, N );
  increment<<< grid_size, block_size, 0, s2 >>>( d_a, N );
  cudaMemcpy( h_a, d_a, N * sizeof(int), cudaMemcpyDeviceToHost );
  cudaFree( d_a ); 
  cudaDeviceReset();

The build command is

nvcc -default-stream per-thread -o addv -arch=sm_86 -Xptxas -O3,-v addv.cu

However, when I look into the nsight system report, I see serial launches of streams.

What should I do exactly to fix that?

Robert_Crovella · December 14, 2022, 3:15pm

Things you haven’t shown matter, such as the actual increment code and the values of M and N, as well as the actual GPU you are running on. CUDA provides a concurrent kernels sample code, you may want to start by studying that.

kernels that use up execution resources won’t run concurrently, regardless of what other code you write, or what steps you take. There are numerous questions already on these forums discussing these concepts.

I also notice that the duration of your kernels is approximately 2-3 microseconds. It will be basically impossible to observe concurrency of those kernels. The CUDA launch overhead is on the order of 5 microseconds, minimum.

njuffa · December 14, 2022, 8:03pm

In a recent forum thread we established that minimum kernel launch overhead is down to 3 microseconds with recent hardware and software, but the problem remains that trying to observe concurrency with kernels whose runtime is on the order of the kernel launch latency would be “challenging”.

Robert_Crovella · December 14, 2022, 8:07pm

If the code and the profiler output are aligned, it suggests an actual launch latency for this particular case at ~8-9 microseconds.

mahmood.nt · December 19, 2022, 11:30am

In the following code, I am intending to create two streams on a 256 element vector and each stream does an operation on some of the elements, e.g. A[0:127]++ and A[128:255]=0.

__global__ void increment( int *a, int N )
{
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if ( i < N/2 )
    a[ i ] = a[ i ] + 1;
  else if ( i >= N/2 && i <N )
    a[ i ] = 0;
}
...
  cudaStream_t s1, s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  increment<<< 1, 256, 0, s1 >>>( d_a, 256 );
  increment<<< 1, 256, 0, s2 >>>( d_a, 256 );

So, the kernel is called from two streams each stream has 256 threads (array length). There are two problems with that:
1- The concurrency which you said the kernel is small. So, I was wondering if I call the streams with 100K threads, do I see concurrency or not.
2- Beside the concurrency, the output is:
Original:
0 1 2 … 254 255
Output:
2 3 4 … 0 0
With that code, it is obvious that both kernel calls operate on the entire vector. So, 0->+±>+±>2. I was wondering how to specify the stream number for the threads inside the kernel. How can I fix that?

Robert_Crovella · December 19, 2022, 2:42pm

You could try running the profiler.

You could pass another kernel argument. Like 1 for s1 and 2 for s2. Then condition the code based on that additional argument.

I don’t like working with incomplete test cases, so I probably won’t respond to further inquiries of this type.

Greg · December 19, 2022, 4:22pm

The compute work distributor (CWD) will distribute all work from the highest priority grid before switching to the next grid. If you launch a grid that has enough thread blocks to saturate all SMs then concurrency will only occur at the tail of the grid as thread blocks complete and SM resources are available for the next grid.

Topic		Replies	Views
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2270	October 26, 2016
Kernel launch concurrency CUDA Programming and Performance	10	1851	December 11, 2014
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	548	November 16, 2017
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2908	April 29, 2019
My streams are not running concurrently CUDA Programming and Performance	7	1827	March 6, 2018
Separate kernel grids do not execute concurrent CUDA Programming and Performance	1	3257	December 18, 2009
Two Concurrent Kernels on Different Streams Take 2x Time — Why No Overlap? CUDA Programming and Performance cuda , kernel	3	50	July 30, 2025
No performance improvement using CUDA stream DRIVE AGX Xavier General status_wip , driveos-cuda	11	1673	March 22, 2022
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5582	April 28, 2012
cudaStream performance CUDA Programming and Performance	7	1658	June 21, 2016

Concurrent executions of streams

Related topics