cudaOccupancyMaxPotentialBlockSize

Hello,

Can cudaOccupancyMaxPotentialBlockSize(…) guarantee a calculated block size that will launch the kernel without any resources-related failure?
I am using launch bounds and i don’t know if it is worth it. I am using to the following Cuda best practices examples :

#defineTHREADS_PER_BLOCK 256
#if__CUDA_ARCH__>=200
#defineMY_KERNEL_MAX_THREADS (2 *THREADS_PER_BLOCK) #defineMY_KERNEL_MIN_BLOCKS 3
#else
#defineMY_KERNEL_MAX_THREADS THREADS_PER_BLOCK #defineMY_KERNEL_MIN_BLOCKS 2
#endif
∕∕Devicecode
__global__ void __launch_bounds__(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS) MyKernel(...) { ... }

Thanks

Abdoulaye

Yes, the calculated size should be launchable without resource related failures.

Hi @Robert_Crovella

What should i do if this function gives me a non-multiple of warp (32) and i use shuffle down within the kernel as the part sum reduction?

I would not believe that the returned maximum potential block size is not a multiple of the warp size. If you want to be sure, round it down. Especially your program would stay correct for any future architecture.

a partial (-active) warp consumes the same resources as a full warp. There would be no reason to return a block size that included a partial warp.

Unfortunately yes ! Here is my code :

int blockSize, minGridSize, gridSize, ArraySize = 92;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, (void*)Kernel, 0, ArraySize);
gridSize = (ArraySize + blockSize - 1) ∕ blockSize;

As results:

blockSize = 92;
minGridSize = 80;
gridSize = 1;

There could be an exception to the rule with warp-sized blocks for small numbers of the last parameter. I am not sure, whether it is a good place to put ArraySize there. Try to call the cudaOccupancyMaxPotentialBlockSize with 0 (the default) and only use ArraySize to calculate gridSize.

I got blockSize = 768 with 0 limit as default. So it means that the block size has been capped by the input of 92.

Thanks

@Robert_Crovella @Curefab Will cudaOccupancyMaxPotentialBlockSize() take into account the launch bounds configuration set into the kernel ? It seems that this function becomes obsolete when implementing the launch bounds. As the latter will restrict the register usage, i assume that the block size for maximum occupancy will increase.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.