What happen to shared memory on block preemption

Dear all,
I’m fairly new to GPGPU so take it easy on me… thanks.

I have an input buffer i

  • in global memory of which I load a subset i[0~63] into shared memory, for processing by a block of 64 threads. Once this sub-array is loaded in sh.mem, can the current block be preempted, for another block to be executed ?
    If it not allowed then is it enough to assume that at any given time only one block executes its kernel without being interrupted, and that sh.mem should only be able to accommodate the need of the block 64 threads.
    If it is allowed then what would happen to the content of the shared memory used before preemption ?

    “programming massively parallel processors” say that “threads are assigned to execution resources on a block-by-block basis. In the current generation of hardware, the execution resources are organized into streaming multiprocessors (SMs)”, which seem to mean to blocks arent preempted once assigned for execution onto the SMM but I’m not sure of it.


  • pre-emption is a loaded term, and probably someone will come along and correct me if I use it. But it is accurate to say that once a threadblock gets launched on a particular SM, it will remain on that SM (only) until it is finished executing. Multiple threadblocks can be “resident” (launched) on a single SM. If threadblocks use shared memory, then one limiter on the number of threadblocks that can be “resident” on that SM is the amount of available shared memory divided by the shared memory used per threadblock.

    So, currently, even if a threadblock is not currently “executing” i.e. occupying specific SM execution resources (perhaps for example because all warps in that threadblock are waiting on global memory transactions), whatever “footprint” it has in shared memory will remain there, until the block is finished executing.

    txbob is right - on compute capabilities < 3.2 blocks are never preempted.
    On compute capability 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging. In both cases the shared memory state is taken care of by the driver, so from the view of the block shared memory does not change. (The id of the SMX/SMM the block is running on may change, but that is only exposed through (inline) PTX, not in the C or C++ interfaces).

    Note that “programming massively parallel processors” predates compute capability 3.2+ devices, so “the current generation of hardware” excludes those cases.
    Nevertheless it is part of the CUDA programming model that shared memory content is preserved during the lifetime of a block, so even on later devices you do not need to care.

    I see, thanks for the clarification