Hi,
The below code is typical before a kernel invocation.
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)
In my case on a jetson nano,
- What is the maximum no of threads one can execute at a time.
- Does block size have an affect on the performance of the kernel? ie kernel execution time. I had an experience when i reduce the block size the kernel crashes while increasing the block size the kernel finished faster.
Thanks,