__shared__ and __device__ memory specifier clarification

Hi everyone. I need some clarification on using the device and shared memory specifiers together. My concern regards when declaring a variable shared and both device persistent, is its memory transferred to global memory (implicitly by the device) after the shared memory ceases to persist, i.e., the thread completed execution? Or, would someone be required to compose their own caching function of the shared memory? I myself require this because I wish to utilize shared memory accesses for a function I am writing but also need it to persist for the duration of the application. My thought of this is that the device would initialize the shared memory, function, and then upon return would cache the shared memory in global space, implicitly. Once initialized, it would no longer need to reallocate, rather it would simply reload the global memory into the shared on-chip memory for the thread block the next time the kernel were called. I also think this would be more efficient than requiring the device programmer to emulate this process in software.

No, shared memory has the lifetime of a block, values are not preserved between blocks. It would even be difficult to define useful semantics for this: Which block would you expect to receive the same shared memory contents as the block just finishing?

So you need to transfer the data from device to shared memory yourself. If you are concerned about this being inefficient, do more work per block to amortize the cost.