Unable to achieve concurrency in kernel launches

mdotali · February 12, 2016, 12:45pm

Platform : Jetson TK1

I am unable to achieve kernel concurrency with the following code :-

kernel :

__global__ void add_matrix(char *m1, char *m2, char* m3, char * outm, int *grayWidthStep, int * offset)
{

	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	if(i >= 960)
		return;
	if(j >= 540)
		return;

	const int gray_tid  = j * (*grayWidthStep) + (i*3) + (*offset);
	
	
	outm[gray_tid] = (m1[gray_tid]) * ( m2[gray_tid]) * ( m3[gray_tid]);
	outm[gray_tid + 1] = (m1[gray_tid + 1]) * (m2[gray_tid + 1]) * (m2[gray_tid + 1]);
	outm[gray_tid + 2] = (m1[gray_tid + 2]) * (m2[gray_tid + 2]) * (m2[gray_tid + 1]);

	return;
}

Two streams :

for (int i = 0; i < stream_count; ++i)
	cudaStreamCreate(&stream[i]);

Memory allocations :

//cudaMalloc for each matrix . Showing just one here.
cudaMalloc((char **)&imageMain_d,input_size);

Memory copies :

//cudaMemcpyAsync for each matrix.
cudaMemcpyAsync(imageMain_d, imageMain.ptr(), input_size, cudaMemcpyHostToDevice,stream[0]);

Grid Size

dim3 block(32, 16, 1);
dim3 grid(((imageMain.cols/2) + block.x - 1)/block.x, ((imageMain.rows/2) + block.y - 1)/block.y);

Kernel Launches

add_matrix<<<grid,block,0,stream[0]>>>(imageMain_d,imageLogo_d,image3_d,imageout_d,image_step_d, offset_d);

add_matrix<<<grid,block,0,stream[1]>>>(imageMain_d2,imageLogo_d2,image3_d2,imageout_d2,image_step_d2,offset_d2);

Following by cudaDeviceSynchronize();

The Visual profiler shows both as occuring one after the other on alternate streams
same for cudaMemcpyAsync, one after the other on alternate streams.

Robert_Crovella · February 12, 2016, 2:40pm

There was a CUDA bug affecting concurrent kernel usage on Jetson TK1. I can see you’ve commented on other threads that mention this so you are already aware of it. So make sure you have the latest firmware for Jetson TK1 loaded on your machine.
Assuming a reasonable size image, you may be launching 100 or more blocks per kernel. Jetson has a single Kepler SMX, so 4 blocks of 512 threads each will fill that SMX, mostly preventing any concurrency with the next kernel launch.

In general, concurrent kernel execution is somewhat hard to witness. You must have kernels that have relatively small resource usage (so they can share the GPU resources) but also take relatively long to run (so you have an opportunity to witness concurrency, in spite of the inevitable launch difference time e.g. 3-5us minimum).

BulatZiganshin · February 12, 2016, 4:25pm

you are not shown profile output, so may be you just expect too much:

desktop/mobile GPUs has only 1 async copy engine, so next data copying operation started only when previous one is finished
as txbob already mentioned, the next kernel starts to execute after all invocations of previous kernel were started, so you only can overlay a few last invocations of first kernel with a first few invocations of the second kernel. note that, if two kernels run in the same stream, GPU first finishes all invocations of first kernel prior to starting any invocations of second one, so with two streams you can have substantial speed improvement only when kernels are pretty small and the “tail effect” significantly hurts the performance

so, two async copying operations cannot overlap, and two kernels can overlap only their heads/tails. the most significant benefit of multiple streams, therefore, is overlapping of async copying and kernel execution, i.e. first stream can copy data and at the same time second stream execute kernel. if you have it, and your kernels are sufficiently large, it’s all you can get.

Topic		Replies	Views
can't achieve cudaMemcpyAsync and kernel concurrency Jetson TX1	3	757	October 18, 2021
Kernel launch concurrency CUDA Programming and Performance	10	1956	December 11, 2014
Stream Concurrency (or lack thereof) on GTX 480 CUDA Programming and Performance	6	2605	July 15, 2010
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5665	April 28, 2012
Is concurrent kernel execution really possible on TK1 ? Jetson TK1	0	535	March 31, 2016
My streams are not running concurrently CUDA Programming and Performance	7	1906	March 6, 2018
Is this strange behaviour with kernel concurrency? CUDA Programming and Performance	5	1044	September 3, 2015
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	734	March 8, 2023
Concurrent executions of streams CUDA Programming and Performance	6	529	December 19, 2022
Cannot achieve kernel concurrency CUDA Programming and Performance	1	532	May 20, 2015

Unable to achieve concurrency in kernel launches

Related topics