Concurrent executions of streams

I have written a simple program with two streams which should be run concurrently.

  dim3 grid_size( M ); dim3 block_size( N );
  cudaStream_t s1, s2;
  increment<<< grid_size, block_size, 0, s1 >>>( d_a, N );
  increment<<< grid_size, block_size, 0, s2 >>>( d_a, N );
  cudaMemcpy( h_a, d_a, N * sizeof(int), cudaMemcpyDeviceToHost );
  cudaFree( d_a ); 

The build command is

nvcc -default-stream per-thread -o addv -arch=sm_86 -Xptxas -O3,-v

However, when I look into the nsight system report, I see serial launches of streams.

What should I do exactly to fix that?

Things you haven’t shown matter, such as the actual increment code and the values of M and N, as well as the actual GPU you are running on. CUDA provides a concurrent kernels sample code, you may want to start by studying that.

kernels that use up execution resources won’t run concurrently, regardless of what other code you write, or what steps you take. There are numerous questions already on these forums discussing these concepts.

I also notice that the duration of your kernels is approximately 2-3 microseconds. It will be basically impossible to observe concurrency of those kernels. The CUDA launch overhead is on the order of 5 microseconds, minimum.

1 Like

In a recent forum thread we established that minimum kernel launch overhead is down to 3 microseconds with recent hardware and software, but the problem remains that trying to observe concurrency with kernels whose runtime is on the order of the kernel launch latency would be “challenging”.

If the code and the profiler output are aligned, it suggests an actual launch latency for this particular case at ~8-9 microseconds.

In the following code, I am intending to create two streams on a 256 element vector and each stream does an operation on some of the elements, e.g. A[0:127]++ and A[128:255]=0.

__global__ void increment( int *a, int N )
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if ( i < N/2 )
    a[ i ] = a[ i ] + 1;
  else if ( i >= N/2 && i <N )
    a[ i ] = 0;
  cudaStream_t s1, s2;
  increment<<< 1, 256, 0, s1 >>>( d_a, 256 );
  increment<<< 1, 256, 0, s2 >>>( d_a, 256 );

So, the kernel is called from two streams each stream has 256 threads (array length). There are two problems with that:
1- The concurrency which you said the kernel is small. So, I was wondering if I call the streams with 100K threads, do I see concurrency or not.
2- Beside the concurrency, the output is:
0 1 2 … 254 255
2 3 4 … 0 0
With that code, it is obvious that both kernel calls operate on the entire vector. So, 0->+±>+±>2. I was wondering how to specify the stream number for the threads inside the kernel. How can I fix that?

You could try running the profiler.

You could pass another kernel argument. Like 1 for s1 and 2 for s2. Then condition the code based on that additional argument.

I don’t like working with incomplete test cases, so I probably won’t respond to further inquiries of this type.

The compute work distributor (CWD) will distribute all work from the highest priority grid before switching to the next grid. If you launch a grid that has enough thread blocks to saturate all SMs then concurrency will only occur at the tail of the grid as thread blocks complete and SM resources are available for the next grid.