Suppose I have n similar(execute the same codes) and individual(have no need to concern threads communication) tasks. how do I choose a proper grid size? #1 let the grid size equal to the number of SMs. assign task 1 → n/grid_size * 1 to block 1, assign task n/grid_size * 1 + 1 → n/grid_size * 2 to block 2, …
Then, each thread in a same block are assigned n / grid_size / block_size tasks. each thread needs do:
for i in task_set:
task(i)
#2 assign every task to a unique thread. Namely, the thread at blockIdx.x * blockDim.x + threadIdx.x will be assigned task(blockIdx.x * blockDim.x + threadIdx.x)
As far as I know, the number os SMs is constant at a specific GPU. Does it mean if I want to get the best performance, I only need to keep grid size larger than SMs size?
(1) Each thread is responsible for producing one output element
(2) Chose between 128 and 256 threads per thread block (multiple of 32)
(3) Make a 1D grid that comprises enough blocks that the total number of threads covers all output elements
This has been covered in these forums multiple times. Of course, numerous variants and modifications are possible based on the details of the processing. For example 2D grids may be more naturally suited to the processing of 2D images, where each thread block produces one tile of pixels in the image.