Getting the best performance

Hi,
The below code is typical before a kernel invocation.

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)

In my case on a jetson nano,

  1. What is the maximum no of threads one can execute at a time.
  2. Does block size have an affect on the performance of the kernel? ie kernel execution time. I had an experience when i reduce the block size the kernel crashes while increasing the block size the kernel finished faster.

Thanks,

  1. The maximum grid and block dimensions that you can pass to a kernel launch are covered in the deviceQuery output and also in the programming guide, as well as numerous forum questions. If you’re asking what is the maximum instantaneous capacity of your GPU, that number depends to some degree on the code you are running, but is upper bounded by the maximum number of threads per SM times the number of SMs in your GPU. These numbers are also available in the deviceQuery output as well as the programming guide. The table numbers for this data in the programming guide vary by guide version, but in the most recent guide they are in the vicinity of table 17/18.

  2. Yes, some of the choices you make can impact performance. Many kernels run pretty well with a block size of 512 threads per block (assuming the grid is appropriately chosen around that), but also often with other block sizes like 256 or 128. Extreme choices like 32 or less threads per block and occasionally numbers larger than 512 may be an issue, to some degree dependent on the GPU. This is an involved topic (many possible sub-question) but you can find other forum questions like it/about it with a bit of searching, that cover many of these sub-topics. Here is a list of possibly related questions/answers, for example. If a kernel “crashes”, its entirely possible it is due to some block size choice, but I wouldn’t start there. I would use basic debugging techniques, such as isolation of the issue with compute-sanitizer. Do as you wish of course.