Find the right set of blocks, threads

veredz72 · September 11, 2018, 5:31am

Hello,

I have to write a kernel that adds two vectors.
On a specified HW (e.g TX2), is there a defined walkthrough how to find the right combination of
B blocks, T threads ?

Assuming the vector has N elements, I can run:
kernel_name <<<1,N>>> (argument list) or:
kernel_name <<<N,1>>> (argument list) or:
kernel_name <<<B,T>>> (argument list) if B*T = N.

Can I use cudaGetDeviceProperties to find the combination that will give the best performance ?

If I understand correctly in this kernel, shared memory is not used because there is only one operation (+) on the data in global memory. Am I right ?

Thank you,
Zvika

saulocpp · September 11, 2018, 11:18am

No, the properties struct has information on the capacities of the current device. From there you can deduce how far you can go to launch your kernel. But what you CAN use to help you find optimal launch configuration for your kernel are a few API functions. See here: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__OCCUPANCY.html

Since I myself have asked this question a million times, I would say the “easiest” way is to first understand occupancy and how it affects the performance. Read here, it really helps: https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/

This discussion is also good and I found myself reading it over and over. Pay special attention to txbob’s answer: https://devtalk.nvidia.com/default/topic/1026825/cuda-programming-and-performance/how-to-choose-how-many-threads-blocks-to-have-/

You will find out yourself that a block size (number of threads) like 64, 128, 256, will generally provide the better occupancy (need to launch a multiple of 32 threads for a full warp). The grid size (number of blocks) can define the utilization of the device. For example, in a program I am writing exactly now, a kernel launch of <<< 200, 256 >>> yielded ~89% of utilization according to Nvidia Visual Profiler. Changing to <<< 2000, 256 >>> bumped to ~98% utilization. No definitive answer here as it will depend on specific problem you are solving. But make sure you launch 64 or more threads per block and enough blocks to keep the device as busy as possible (respecting the device’s capacities).
Have a look at the occupance spreadsheet as it tells you how some parameters affect the performance.

No. Dynamic shared memory is defined by the (optional) 3rd launch parameter, which is the shared memory allocation size in bytes. The shared memory is declared in your kernel function (which we don’t have here). If there is no 3rd parameter in your launch configuration, then the shared memory can still be declared as static (array length, not allocation size anymore) in the kernel function.
Read here: https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

veredz72 · September 13, 2018, 4:06pm

Hi saulocpp,

Thank you very much for the detailed answer.

Best regards,
Zvika

Topic		Replies	Views
Setting block size and avoiding errors CUDA Programming and Performance	7	6211	November 15, 2008
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	797	March 30, 2024
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration Technical Blog	12	688	February 21, 2017
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	4979	June 30, 2008
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1819	December 19, 2023
I had a kernel that needs more than the 48k+16kcache of each SM of a K40 CUDA Programming and Performance	8	763	December 18, 2014
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
how can i calc the correct count of blocks/threads for a variable vector size CUDA Programming and Performance	7	6038	December 17, 2007
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4450	February 3, 2009
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008

Find the right set of blocks, threads

Related topics