Unable to achieve concurrency in kernel launches

Platform : Jetson TK1

I am unable to achieve kernel concurrency with the following code :-

kernel :

__global__ void add_matrix(char *m1, char *m2, char* m3, char * outm, int *grayWidthStep, int * offset)

	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	if(i >= 960)
	if(j >= 540)

	const int gray_tid  = j * (*grayWidthStep) + (i*3) + (*offset);
	outm[gray_tid] = (m1[gray_tid]) * ( m2[gray_tid]) * ( m3[gray_tid]);
	outm[gray_tid + 1] = (m1[gray_tid + 1]) * (m2[gray_tid + 1]) * (m2[gray_tid + 1]);
	outm[gray_tid + 2] = (m1[gray_tid + 2]) * (m2[gray_tid + 2]) * (m2[gray_tid + 1]);


Two streams :

for (int i = 0; i < stream_count; ++i)

Memory allocations :

//cudaMalloc for each matrix . Showing just one here.
cudaMalloc((char **)&imageMain_d,input_size);

Memory copies :

//cudaMemcpyAsync for each matrix.
cudaMemcpyAsync(imageMain_d, imageMain.ptr(), input_size, cudaMemcpyHostToDevice,stream[0]);

Grid Size

dim3 block(32, 16, 1);
dim3 grid(((imageMain.cols/2) + block.x - 1)/block.x, ((imageMain.rows/2) + block.y - 1)/block.y);

Kernel Launches

add_matrix<<<grid,block,0,stream[0]>>>(imageMain_d,imageLogo_d,image3_d,imageout_d,image_step_d, offset_d);


Following by cudaDeviceSynchronize();

The Visual profiler shows both as occuring one after the other on alternate streams
same for cudaMemcpyAsync, one after the other on alternate streams.

  1. There was a CUDA bug affecting concurrent kernel usage on Jetson TK1. I can see you’ve commented on other threads that mention this so you are already aware of it. So make sure you have the latest firmware for Jetson TK1 loaded on your machine.

  2. Assuming a reasonable size image, you may be launching 100 or more blocks per kernel. Jetson has a single Kepler SMX, so 4 blocks of 512 threads each will fill that SMX, mostly preventing any concurrency with the next kernel launch.

In general, concurrent kernel execution is somewhat hard to witness. You must have kernels that have relatively small resource usage (so they can share the GPU resources) but also take relatively long to run (so you have an opportunity to witness concurrency, in spite of the inevitable launch difference time e.g. 3-5us minimum).

you are not shown profile output, so may be you just expect too much:

  1. desktop/mobile GPUs has only 1 async copy engine, so next data copying operation started only when previous one is finished
  2. as txbob already mentioned, the next kernel starts to execute after all invocations of previous kernel were started, so you only can overlay a few last invocations of first kernel with a first few invocations of the second kernel. note that, if two kernels run in the same stream, GPU first finishes all invocations of first kernel prior to starting any invocations of second one, so with two streams you can have substantial speed improvement only when kernels are pretty small and the “tail effect” significantly hurts the performance

so, two async copying operations cannot overlap, and two kernels can overlap only their heads/tails. the most significant benefit of multiple streams, therefore, is overlapping of async copying and kernel execution, i.e. first stream can copy data and at the same time second stream execute kernel. if you have it, and your kernels are sufficiently large, it’s all you can get.