Find the right set of blocks, threads

Hello,

I have to write a kernel that adds two vectors.
On a specified HW (e.g TX2), is there a defined walkthrough how to find the right combination of
B blocks, T threads ?

Assuming the vector has N elements, I can run:
kernel_name <<<1,N>>> (argument list) or:
kernel_name <<<N,1>>> (argument list) or:
kernel_name <<<B,T>>> (argument list) if B*T = N.

Can I use cudaGetDeviceProperties to find the combination that will give the best performance ?

If I understand correctly in this kernel, shared memory is not used because there is only one operation (+) on the data in global memory. Am I right ?

Thank you,
Zvika

No, the properties struct has information on the capacities of the current device. From there you can deduce how far you can go to launch your kernel. But what you CAN use to help you find optimal launch configuration for your kernel are a few API functions. See here: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__OCCUPANCY.html

Since I myself have asked this question a million times, I would say the “easiest” way is to first understand occupancy and how it affects the performance. Read here, it really helps: https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/

This discussion is also good and I found myself reading it over and over. Pay special attention to txbob’s answer: https://devtalk.nvidia.com/default/topic/1026825/cuda-programming-and-performance/how-to-choose-how-many-threads-blocks-to-have-/

You will find out yourself that a block size (number of threads) like 64, 128, 256, will generally provide the better occupancy (need to launch a multiple of 32 threads for a full warp). The grid size (number of blocks) can define the utilization of the device. For example, in a program I am writing exactly now, a kernel launch of <<< 200, 256 >>> yielded ~89% of utilization according to Nvidia Visual Profiler. Changing to <<< 2000, 256 >>> bumped to ~98% utilization. No definitive answer here as it will depend on specific problem you are solving. But make sure you launch 64 or more threads per block and enough blocks to keep the device as busy as possible (respecting the device’s capacities).
Have a look at the occupance spreadsheet as it tells you how some parameters affect the performance.

No. Dynamic shared memory is defined by the (optional) 3rd launch parameter, which is the shared memory allocation size in bytes. The shared memory is declared in your kernel function (which we don’t have here). If there is no 3rd parameter in your launch configuration, then the shared memory can still be declared as static (array length, not allocation size anymore) in the kernel function.
Read here: https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

Hi saulocpp,

Thank you very much for the detailed answer.

Best regards,
Zvika