Cuda Streams for Concurrent Kernel Calls

I am currently learning how to use cuda streams and I am trying to create a simple example. In this example I have a simple kernel that squares the elements in an array. I also created two streams and an input array for each stream. Then on each stream I copy the data to the device and launch the kernel. Here is my code

#include <stdio.h>

__global__ void square_array(float *a, int N) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < N)
		a[idx] = a[idx] * a[idx];

int main(void) {
	// Pointer to host & device arrays
	float *h_a1, *d_a1;
	float *h_a2, *d_a2;

	// Number of elements in arrays
	const int N = 100000;

	int block_size = 4;
	int n_blocks = N / block_size + (N % block_size == 0 ? 0 : 1);

	cudaStream_t s1;

	cudaStream_t s2;

	// Allocate array on host
	h_a1 = (float *) malloc(N * sizeof(float));
	h_a2 = (float *) malloc(N * sizeof(float));

	// Allocate array on device using pinned memory
	cudaMallocHost((void **) &d_a1, N * sizeof(float));
	cudaMallocHost((void **) &d_a2, N * sizeof(float));

	// Initialize host array and copy it to CUDA device
	for (int i = 0; i < N; i++) {
		h_a1[i] = (float) i;
		h_a2[i] = (float) i;

	cudaMemcpyAsync(d_a1, h_a1, N * sizeof(float), cudaMemcpyHostToDevice, s1);
	cudaMemcpyAsync(d_a2, h_a2, N * sizeof(float), cudaMemcpyHostToDevice, s2);

	// Do calculation on device:
	square_array<<<n_blocks, block_size, 0, s1>>>(d_a1, N);
	square_array<<<n_blocks, block_size, 0, s2>>>(d_a2, N);

	// Retrieve result from device and store it in host array
	cudaMemcpyAsync(h_a1, d_a1, sizeof(float) * N, cudaMemcpyDeviceToHost, s1);
	cudaMemcpyAsync(h_a2, d_a2, sizeof(float) * N, cudaMemcpyDeviceToHost, s2);

// Cleanup



I am trying to make it so the operations in both streams (i.e. the memory transfers and kernel launches) occur concurrently. However when I profile this code with nvvp I see that the overlap in kernel execution is very minimal. Specifically, the kernel on the first stream runs from 71.93ms to 72.717ms and the kernel from the second stream runs from 72.694ms to 73.473ms. So my question is shouldn’t the execution be overlapping more? If so, is there something wrong with my code?

First of all concurrent kernel execution is fairly hard to witness. There is a concurrent kernels cuda sample code, you might want to start with that and be sure you see the expected behavior there.

There is an underlying assumption that you have made (implicitly) that any 2 kernels can be run concurrently. This is not correct. Kernels will only run concurrently if a number of conditions are met.

One of those requirements is that there must be enough computation resources available for the second kernel, after the first is launched.

In your case we could focus specifically on the blocks per SM limitation of all CUDA devices. This is listed in table 13 of the CUDA programming guide. Once an SM has a full complement of threadblocks, it cannot accept any more, until some retired. You are launching 100000/4 threadblocks, or 25000 threadblocks. That is enough to fill up the SMs of any currently available GPU. Therefore, when the second kernel launch comes along, the SMs are already full, and can’t accept any threadblocks from the second kernel.

In fact this state persists until nearly all the threadblocks from the first launch are consumed and “retired”, at which point you are witnessing the “tail effect”. Once the remaining threadblocks from the first kernel launch drop to a small number, then there is now “room” for some of the threadblocks from the second kernel launch to begin executing. And so you witness a small amount of overlap of the “tail” of one kernel with the “head” of the other.

Before you say “OK, I will modify this code to just launch 4 blocks” (or something like that), be aware that very short kernels are also hard to witness overlap. In the typical few microseconds from one kernel launch to the next, a very small kernel can completely finish executing, so once again no overlap is witnessed.

Amongst other requirements, concurrent kernels require long enough execution times to be able to witness overlap, with low enough resource requirements to permit co-residency. This is not a typical coding outcome, so to create a synthetic example may require some careful thought. Which is why you may wish to start by studying the aforementioned cuda sample code.