I am using 1650, of which CUDA_ARCH should be larger than 200.
Following this link, I think I should choose launch_bounds as (256*3, 3)
#define THREADS_PER_BLOCK 256
#if CUDA_ARCH >= 200
#define MY_KERNEL_MAX_THREADS (2 * THREADS_PER_BLOCK)
#define MY_KERNEL_MIN_BLOCKS 3
#else
#define MY_KERNEL_MAX_THREADS THREADS_PER_BLOCK
#define MY_KERNEL_MIN_BLOCKS 2
#endif
// Device code
global void
launch_bounds(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS)
MyKernel(…)
{
…
}
And compare it with (256,2), I find this is faster!?
I am running 70007000~70007000=>7000*7000, basic matrix multiply.
I have two questions: 1. How to choose optimized parameters? 2. Add launch_bounds always better than not add it???
Thank you!!!