Is resident space always fully available to the kernel ?

The cuda documentation/hardware mentions:

MaxResidentThreads and MaxResidentBlocks.

Is this space always fully available to a kernel ?

Or is it possible that this space is divided in the case of multiple (concurrent) kernel launches/executions ?

(Could also some of this space be used up by the graphics system opengl/directx ?)

In the case the answer is yes, then perhaps cuda kernel computation and code could be changed to use this information instead and use it to compute a “resident index” for the data item to be processed for hopefully more efficient index calculations, (reducing the number of multiplications and additions and such for work item index calculations) in exchange for a while loop which would process items and then increment this resident index to process the next batch of work items as the resident data is processed (and new resident space becomes available to hold/load/work with the new data).

(Launch parameters could be filled with resident space information, or perhaps the ISA has special instructions to retrieve this information or could be passed via C parameters too, work size could be passed via C parameters instead of the launch parameters. The launch parameters themselves are configured to process the resident space as efficiently as possible, which would mean an exact mirror of these numbers in some thread/block/grid tupple and such).

One little bit of extra information required is probably: “MaxResidentProcessors”.

So once this information is known the kernel can assume that the resident space is the virtual computation power that is actually available:

MaxResidentThreads * MaxResidentBlocks * MaxResidentProcessors.

By dividing the work load across this resident space it should be optimal.

The kernel itself can then advance it’s work index based on these 3 pieces of information in hopefully a more efficient manner, then computing an index N directly based on how the current system works.

(for multi gpu cards it could even be extended to MaxResidentGPU and then for data centers MaxResidentDataCenter ! :) for maximum distribution efficiency ? ;))

MaxResidentThreads and MaxResidentBlocks apply to all threads or blocks which may be occupying an SM. It does not matter which kernel they come from. The limit across all kernels (i.e. all users of that SM) is the stated limit.

Therefore if a kernel uses 512 threads on a particular SM, and another concurrent kernel can begin executing, it’s possible that the 2nd concurrent kernel may be able to use up to 1536 out of the 2048 max threads resident on that SM (since 512 slots are already occupied).

This is one of several limits which may determine achievable occupancy.

If the graphics system/OGL/directX requires usage of compute resources, it will, under current CUDA behavior, require a context switch. The context switch currently involves draining the GPU of all currently executing kernels, such that the SMs are “empty”, just as if you were going to begin executing kernels from a separate CPU process (which also involves a context switch). This “draining” means that the kernels must finish, ie. terminate normally.

compute preemption is probably coming, so this will likely change in the future. compute preemption simply means that it’s not necessary to drain the SMs during a context switch, but instead the compute processes can be suspended in some fashion, so that the other context can use the compute resources.