How to find parameters for _launch_bounds_(?,?)

I am using 1650, of which CUDA_ARCH should be larger than 200.

Following this link, I think I should choose launch_bounds as (256*3, 3)
#define THREADS_PER_BLOCK 256
#if CUDA_ARCH >= 200
#define MY_KERNEL_MAX_THREADS (2 * THREADS_PER_BLOCK)
#define MY_KERNEL_MIN_BLOCKS 3
#else
#define MY_KERNEL_MAX_THREADS THREADS_PER_BLOCK
#define MY_KERNEL_MIN_BLOCKS 2
#endif

// Device code
global void
launch_bounds(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS)
MyKernel(…)
{

}

And compare it with (256,2), I find this is faster!?

I am running 70007000~70007000=>7000*7000, basic matrix multiply.

I have two questions: 1. How to choose optimized parameters? 2. Add launch_bounds always better than not add it???

Thank you!!!

These days, it is rarely necessary to use __launch_bounds() at all. If you are fairly new to CUDA, ignore it completely. This is an expert-level feature and its use can easily be counterproductive in the hands of a novice.

In the past, roughly 10+ years ago, the hardware capabilities of GPUs were more limited, and __launch_bounds() was fairly frequently needed to squeeze performance out of CUDA code.

1 Like