I am somewhat confused over the CUDA architecture. If a thread/warp/block needs a slow memory access from the device memory, it can be masked by executing threads from an other block. This requires pre-emption of the current warp. But what about the shared memory? Is it a part of the block context and pre-emptied at the same time?