No answer, so I guess there is nothing immediately available.
The neatest solution for fixed size global memory with a lifetime of the block might be to make auto global work (“bmem”?) - then it would be nicely transparent. This is a bug of the NYI variety:
__global__ void
kernel2(float* a, float** c)
{
__device__ float* b[32];
b[threadIdx.x] = a;
*c = b[threadIdx.x + 1];
}
Compiles to:
.entry kernel2
{
.reg .u32 $r1;
.reg .u64 $rd1,$rd2,$rd3,$rd4,$rd5,$rd6,$rd7,$rd8,$rd9,
$rd10;
.param .u64 __cudaparm_a;
.param .u64 __cudaparm_c;
.shared .align 8 .b8 b$0[256];
.loc 12 12 0
$LBB1_kernel2:
mov.u64 $rd1, b$0; //
.loc 12 16 0
cvt.u64.u16 $rd2, %tid.x; //
ld.param.u64 $rd3, [__cudaparm_a]; // id:11 __cudaparm_a+0x0
mul.lo.u64 $rd4, $rd2, 8; //
add.u64 $rd5, $rd1, $rd4; //
st.shared.u64 [$rd5+0], $rd3; // id:12 b$0+0x0
.loc 12 17 0
cvt.u32.u64 $r1, $rd2; //
cvt.u64.u32 $rd6, $r1; //
mul.lo.u64 $rd7, $rd6, 8; //
add.u64 $rd8, $rd1, $rd7; //
ld.shared.u64 $rd9, [$rd8+8]; // id:13 b$0+0x0
ld.param.u64 $rd10, [__cudaparm_c]; // id:14 __cudaparm_c+0x0
st.global.u64 [$rd10+0], $rd9; // id:15
exit; //
} // kernel2
You will notice that the device auto data has been put in shared. Not much use in expanding shared memory!
There is no coalescing problem with putting auto globals into a per block area. This also solves the local problem as we can build our own structs indexed at the bottom level by logical tid. Just need align(128) to work for warp wide coalescing of 32 bit items or align(256) for 64 bit items. The current local is really only useful for plain registers where access is fully coalesced.
Of course once auto globals work then they need overlaying just like auto shared while there is no stack.
A runningBlockId() and a way of calculating its maximum are still required for runtime global memory allocation on the device on a per block basis if one is running significantly more than the concurrent block capacity of the device.
An even more general solution that I like is to have a hardware base register for each block and map the global memory at that address into the address space again at a fixed address (like shared should be) then we launch a kernel with another parameter for the amount of per block device memory to allocate (extern block?), above what appears in auto decls, just like shared. There would be a limit based on how much address space was dedicated, however it could be a reasonable size like 1Mb. This could be emulated by the compiler/loader/exec without the hardware base register.
This space is where the stack should be eventually, filling from the top of the extern area specified in the kernel call. Current hardware coalescing would need to be improved not to take such a hit from misalignment, for a stack to be useful. Such an improvement would reduce the requirement to use align(256).
Perhaps shared memory should just be a cache for per block memory - costs more transistors for the hardware to manage it, rather than putting it back on the programmer, as at present.
Then if the register file is mapped into the main address space as well there can be a tradeoff between registers and real local memory that is addressable. Useful if the register file is increased in future versions of the hardware. More work for the compiler.
Just my thoughts to normalise the architecture and make it more useful for general computing - no doubt I have missed some important problem. :)
Eric