low concurrency and low kernel utilization, but kernels are filled.

I have written the following code, and I cannot understand why nvvp tells me that my cuda kernels have low concurrency and low kernel utilization.

I launch a kernel, always close to 1024 threads with block size not exceeding 256 in count.

The kernels basically multiply two arrays and write in one third, where they write alone, without atomic or any other form of utilization.

What am I doing so wrong ?
Thanks in advance.

max_buffer = (int) (maxThreadsPerBlock/(N_cr*M_cr*colour_t));

	while(nimgf_tmp > 0)
	{
		/*------------ Kernel Arg Config ------------*/

		buffer = 0;

		while(buffer<max_buffer && buffer<nimgf_tmp) buffer++;

		nimgf_tmp -= buffer;

		dim3 bCINNum(N_cd, M_cd, colour_b);
		dim3 tCINNum(buffer*N_cr, M_cr, colour_t);		

		compute_matrix
		<<<bCINNum, tCINNum, 0, cmptStreams[strm]>>>
		(in_gpu, N_in, M_in, cr_gpu, N_cr, M_cr,
                 ci_gpu, N_ci, M_ci, prv_in, prv_ci, nimgf);
		
		_CUDA(cudaGetLastError());

		prv_in += buffer*N_in*M_in*colour_t;
		prv_ci += buffer*N_ci*M_ci*colour_t;

		strm ++;
	}

	for(int i=0; i<strm; i++)
		cudaStreamSynchronize(cmptStreams[i]);

It may be that you’re not understanding the terminology that nvvp uses.

concurrency as used by nvvp (and most other cuda usages that I am aware of) refer to the idea that you have either:

  1. concurrent kernels

or

  1. one or more kernels executing at the same time as host<->device copying.

Compute utilization refers to the following measurement:

the cumulative duration of the application timeline in which one or more kernels are running

divided by

the overall timeline duration

Many ordinary applications written in CUDA don’t score particularly high on either of these measurements.

If you want to improve these measurements, you should:

  1. investigate copy/compute overlap
  2. seek to make it so that your application spends most of its time running GPU kernels.

https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

This explains a lot and thank you very much for your quick response.

If I have understood you correctly:

  1. Concurrency is not achievable between highly utilized GPU kernels, that use more than 32 blocks with over 800 threads each never exceeding 1024.
    What I could change is probably, using streams in order to cpy files and than mask the following copies with alot of compute, but not with multiple heavy kernel calls.

  2. Utilization is not achievable because nvvp divides the time it took to run all the kernels with the end-to-end execution time of the program.
    I should probably count it my self with cudaEvents.

So the use of streams is for copies and compute concurrently or multiple copies but not multiple computes.

Yes, kernels that fully saturate the machine (and therefore prevent kernel concurrency) are usually preferable.

Overlap of copy and compute is still possible with such kernels.

Perfect.

There are four other things that I would like to ask you:

Is it a better practice to launch many kernels with thread sizes very close to 32 (ex. 29) or less kernels very close to 1024 (ex. 980).
What I am practically asking is if the gpgpu is clever enough to stack them in a warp if they are smaller so they utilize the card better, theoretically speaking, 29x35 = 1015x2 = 2030/2048 = 99,1% achieving which is more than 980x2 = 1960/2048 = 95,7% occupancy?

Lets say I have 32 arrays of size 32x32 and I want to duplicate every point in it.

Is it better to

<<<(2, 2, 8), (2, 16, 32)>>> -> <<<32, 1024>>

(for every position a thread which will basically compute one multiplication)

or is it better to

<<<(2, 2, 8), (2, 16, 1)>>> -> <<<32, 32>>

(every thread has to compute 32 positions)

Should I try and make bigger grids (with more blocks) so that I have less kernel calls, or smaller grids with more kernel calls

Should my thread count be a multiple of 32 (so even if lets say I need 1000 threads I use 1024) ?
But than every kernel call has to have an if condition to terminate the excess threads.

Thank you very much for your time.

1 and 3 seem similar. You want threadblocks that are a multiple of 32 and usually at least 64 or 128 threads minimum, in order to maximize occupancy. The problem with really small threadblock sizes, e.g. 32 threads, is that most GPUs have a limit of 16 or 32 threadblocks maximum per SM, so with these limits, a kernel with really small threadblocks of say 32 threads cannot deposit 2048 threads on a single SM, to maximize occupancy. We generally prefer large grids vs. multiple small grids, for a number of reasons, one of them being kernel launch overhead.

  1. A threadblock size of 32 is not a good choice. See the above description. But if you were comparing total thread counts (e.g. instead varying number of blocks) or if you were comparing two larger threadblock sizes, e.g. 128 and 256, then there usually isn’t a lot of difference. Once your kernel has enough threads to satisfy occupancy, it may be slightly advantageous to not increase the total number of threads, but instead to add loops (because work or exposed parallelism per thread is also a figure of merit). However, to write forward looking code (e.g. that works well on Pascal but also on Volta) you may want to have larger kernels in terms of threads, so that you can saturate future GPUs. All of this can be calculated at run-time if you wish, to attempt to satisfy both competing objectives.

  2. We normally like to think of this across an entire grid, not a single threadblock. In that case, for large grids, choosing a threadblock size of 1024 is usually more efficient than 1000. This answer gives an example why:

https://stackoverflow.com/questions/26611241/why-launch-a-multiple-of-32-number-of-threads-in-cuda/26611959#26611959

The occupancy calculator spreadsheet included with the CUDA toolkit is also a good learning tool.

Thank you very much for you prompt and informative answers

See you soon, probably with many more questions, waiting for answers !