I’m curious what happens to shared memory in a kernel where you have more shared memory allocated throughout your grid than available on the device. What happens to it after you do a cg::sync(grid)? I’m guessing the only way to resolve this is either with some under the hood transfer of shared memory to global or there’s a restriction on the amount of shared memory allocated throughout the whole grid.
a (proper) cooperative grid launch essentially implies that all threadblocks are resident on SMs, which means all shared memory is currently instantiated on SMs