I’m quite surprised to see a kernel take much longer to execute when allocating more shared memory, even if the memory is never used. Can anyone explain what’s going on? Is this expected and is it a hardware limitation?
Shared memory is a limited resource. Multiple blocks can run simultaneously on the same SM, but only if the SM has enough resources for all of them. (including shared memory, registers, and total warp count.) By specifying a larger shared memory per block, the number of simultaneous blocks that can run at once drops, and therefore your occupancy drops. This made the SM have too little work to do, wasting its compute resources.
Tuning block configuration is part of designing your kernel’s launch configuration. There’s a lot of discussion in the CUDA programming guide, and an occupancy calculator.