The cuda documentation/hardware mentions:
MaxResidentThreads and MaxResidentBlocks.
Is this space always fully available to a kernel ?
Or is it possible that this space is divided in the case of multiple (concurrent) kernel launches/executions ?
(Could also some of this space be used up by the graphics system opengl/directx ?)
In the case the answer is yes, then perhaps cuda kernel computation and code could be changed to use this information instead and use it to compute a “resident index” for the data item to be processed for hopefully more efficient index calculations, (reducing the number of multiplications and additions and such for work item index calculations) in exchange for a while loop which would process items and then increment this resident index to process the next batch of work items as the resident data is processed (and new resident space becomes available to hold/load/work with the new data).
(Launch parameters could be filled with resident space information, or perhaps the ISA has special instructions to retrieve this information or could be passed via C parameters too, work size could be passed via C parameters instead of the launch parameters. The launch parameters themselves are configured to process the resident space as efficiently as possible, which would mean an exact mirror of these numbers in some thread/block/grid tupple and such).
One little bit of extra information required is probably: “MaxResidentProcessors”.
So once this information is known the kernel can assume that the resident space is the virtual computation power that is actually available:
MaxResidentThreads * MaxResidentBlocks * MaxResidentProcessors.
By dividing the work load across this resident space it should be optimal.
The kernel itself can then advance it’s work index based on these 3 pieces of information in hopefully a more efficient manner, then computing an index N directly based on how the current system works.
(for multi gpu cards it could even be extended to MaxResidentGPU and then for data centers MaxResidentDataCenter ! :) for maximum distribution efficiency ? ;))