Things you haven’t shown matter, such as the actual increment code and the values of M and N, as well as the actual GPU you are running on. CUDA provides a concurrent kernels sample code, you may want to start by studying that.
kernels that use up execution resources won’t run concurrently, regardless of what other code you write, or what steps you take. There are numerous questions already on these forums discussing these concepts.
I also notice that the duration of your kernels is approximately 2-3 microseconds. It will be basically impossible to observe concurrency of those kernels. The CUDA launch overhead is on the order of 5 microseconds, minimum.
In a recent forum thread we established that minimum kernel launch overhead is down to 3 microseconds with recent hardware and software, but the problem remains that trying to observe concurrency with kernels whose runtime is on the order of the kernel launch latency would be “challenging”.
In the following code, I am intending to create two streams on a 256 element vector and each stream does an operation on some of the elements, e.g. A[0:127]++ and A[128:255]=0.
__global__ void increment( int *a, int N )
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if ( i < N/2 )
a[ i ] = a[ i ] + 1;
else if ( i >= N/2 && i <N )
a[ i ] = 0;
}
...
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
increment<<< 1, 256, 0, s1 >>>( d_a, 256 );
increment<<< 1, 256, 0, s2 >>>( d_a, 256 );
So, the kernel is called from two streams each stream has 256 threads (array length). There are two problems with that:
1- The concurrency which you said the kernel is small. So, I was wondering if I call the streams with 100K threads, do I see concurrency or not.
2- Beside the concurrency, the output is:
Original:
0 1 2 … 254 255
Output:
2 3 4 … 0 0
With that code, it is obvious that both kernel calls operate on the entire vector. So, 0->+±>+±>2. I was wondering how to specify the stream number for the threads inside the kernel. How can I fix that?
The compute work distributor (CWD) will distribute all work from the highest priority grid before switching to the next grid. If you launch a grid that has enough thread blocks to saturate all SMs then concurrency will only occur at the tail of the grid as thread blocks complete and SM resources are available for the next grid.