In current NVIDIA GPU architectures/driver the local memory allocation is context wide and is calculated as MAX_THREADS_PER_SM x SMs x LOCAL_MEM_PER_THREAD. The allocation is shared by all grid launches for the context. If a grid launch requires a larger allocation than the current local memory allocation, then the LMEM allocation needs to be resized. This can result in a synchronization.
This is briefly discussed in cuCtxCreate under CU_CTX_LMEM_RESIZE_TO_MAX and cuCtxSetLimit.