Best general alignment practices for kernel launches

iantsiakkas · November 19, 2018, 9:08pm

From cudaDeviceProp I use maxThreadsPerBlock as a limit to find the highest number of threads I can use per block to saturate the card and be a multiple of 32 b[/b].

But what about the size of grids ? ~ Should I align them too, with the warp size, as small sizes or not sufficient to saturate the GPU. Although I can launch a lot small sized kernels, but what size kernel is a good size ?

Robert_Crovella · November 19, 2018, 9:22pm

maxThreadsPerBlock is not the right value to use for saturation/occupancy calculations.

Each grid at a minimum should be large enough to saturate the GPU, i.e. maximum number of resident threads per multiprocessor * number of multiprocessors. Don’t choose grid sizes smaller than this if it is convenient for you to meet this goal.

note that the introduction of Turing architecture has changed the long-constant (cc 2.x to 7.0) maximum of 2048 threads per SM to 1024 threads per SM:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability[/url]

Also, due to kernel launch overhead (if no other reason), there is no benefit to breaking the work of a single large kernel into smaller kernels, solely for the purpose of having smaller kernels. This is not a benefit.

iantsiakkas · November 19, 2018, 10:01pm

So the equation is: b * (prop.multiProcessorCount)[/b] = good grid size ?

Robert_Crovella · November 19, 2018, 10:31pm

No, I would do:

prop.maxThreadsPerMultiprocessor * prop.multiProcessorCount

as a desirable minimum

iantsiakkas · November 19, 2018, 10:57pm

Thank you a lot !

iantsiakkas · November 20, 2018, 12:05am

So after a little though I have come up with the following

void
configurator
(cudaDeviceProp *prop, int dimx, int dimy, int *thrx, int *thry, int *thrz, int *blcx, int *blcy, int *blcz)
{
	int maxThreadsPerGrid = prop->maxThreadsPerMultiProcessor * prop->multiProcessorCount;
	int maxThreadsPerBlock = prop->maxThreadsPerBlock;

	int smlBlockSize = maxThreadsPerGrid / maxThreadsPerBlock;

	...

}

Which in my case means:
maxThreadsPerBlock = 1024
maxThreadsPerMultiProcessor = 2048

maxThreadsPerGrid = 28 * 2048 = 57344

smlBlockSize = 57344 / 1024 = 56

So a minimum of 56 blocks should be used if blocks have 1024 threads, each.

correct ?

Robert_Crovella · November 20, 2018, 1:36am

yes

Topic		Replies	Views
How to determine the Block Size CUDA Programming and Performance	1	5867	September 4, 2009
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27361	February 15, 2010
Questions about Block and Grid CUDA Programming and Performance	4	3542	February 26, 2008
The choose of grid size and block size CUDA Programming and Performance	8	925	May 8, 2024
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	587	June 2, 2011
Question about grid/block/thread sizes CUDA Programming and Performance	3	12251	November 13, 2012
MAximum block per grid CUDA Programming and Performance	8	5853	April 18, 2011
Launching Kernel Fail CUDA Programming and Performance	15	3394	May 28, 2014
Scheduling Thread Blocks CUDA Programming and Performance	5	1120	July 29, 2021
Maximum stack size? CUDA Programming and Performance	7	632	March 24, 2024

Best general alignment practices for kernel launches

Related topics