Can cudaOccupancyMaxPotentialBlockSize(…) guarantee a calculated block size that will launch the kernel without any resources-related failure?
I am using launch bounds and i don’t know if it is worth it. I am using to the following Cuda best practices examples :
I would not believe that the returned maximum potential block size is not a multiple of the warp size. If you want to be sure, round it down. Especially your program would stay correct for any future architecture.
There could be an exception to the rule with warp-sized blocks for small numbers of the last parameter. I am not sure, whether it is a good place to put ArraySize there. Try to call the cudaOccupancyMaxPotentialBlockSize with 0 (the default) and only use ArraySize to calculate gridSize.
@Robert_Crovella@Curefab Will cudaOccupancyMaxPotentialBlockSize() take into account the launch bounds configuration set into the kernel ? It seems that this function becomes obsolete when implementing the launch bounds. As the latter will restrict the register usage, i assume that the block size for maximum occupancy will increase.