Best general alignment practices for kernel launches

From cudaDeviceProp I use maxThreadsPerBlock as a limit to find the highest number of threads I can use per block to saturate the card and be a multiple of 32 b[/b].

But what about the size of grids ? ~ Should I align them too, with the warp size, as small sizes or not sufficient to saturate the GPU. Although I can launch a lot small sized kernels, but what size kernel is a good size ?

maxThreadsPerBlock is not the right value to use for saturation/occupancy calculations.

Each grid at a minimum should be large enough to saturate the GPU, i.e. maximum number of resident threads per multiprocessor * number of multiprocessors. Don’t choose grid sizes smaller than this if it is convenient for you to meet this goal.

note that the introduction of Turing architecture has changed the long-constant (cc 2.x to 7.0) maximum of 2048 threads per SM to 1024 threads per SM:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability[/url]

Also, due to kernel launch overhead (if no other reason), there is no benefit to breaking the work of a single large kernel into smaller kernels, solely for the purpose of having smaller kernels. This is not a benefit.

So the equation is: b * (prop.multiProcessorCount)[/b] = good grid size ?

No, I would do:

prop.maxThreadsPerMultiprocessor * prop.multiProcessorCount

as a desirable minimum

Thank you a lot !

So after a little though I have come up with the following

void
configurator
(cudaDeviceProp *prop, int dimx, int dimy, int *thrx, int *thry, int *thrz, int *blcx, int *blcy, int *blcz)
{
	int maxThreadsPerGrid = prop->maxThreadsPerMultiProcessor * prop->multiProcessorCount;
	int maxThreadsPerBlock = prop->maxThreadsPerBlock;

	int smlBlockSize = maxThreadsPerGrid / maxThreadsPerBlock;

	...

}

Which in my case means:
maxThreadsPerBlock = 1024
maxThreadsPerMultiProcessor = 2048

maxThreadsPerGrid = 28 * 2048 = 57344

smlBlockSize = 57344 / 1024 = 56

So a minimum of 56 blocks should be used if blocks have 1024 threads, each.

correct ?

yes