Extending shared & fixing local Physical ids required

Is there a secret const extern for the current physical block ID?
(unique among running blocks and in range 0-N where N is small)

The reason I ask is to efficiently enable the extending of shared memory for a block into global memory. Currently all one can use is (blockIdx.y * gridDim.x + blockIdx.x) which could be very large when only a small number of blocks run concurrently. Given that blocks are always scheduled in the one place from start to completion this seems safe. Not specified up front but the guide does say blocks are run “pipeline fashion” later.

Or can we get at “%physid” through another secret const extern, which is the unique physical identifier for a thread and presumably used to implement local? Would %physid mod blocksize be what is useful?

If we could access %physid then we could implement local stuff that was properly coalesced and get better performance for large arrays and big structs.

If the new cudaGetDeviceProps() also provided number of MPs and max blocks/MP then we could calculate the range for a physical block/thread ID when allocating storage.

Thanks,
Eric

I thought of this too, but be careful that multiple blocks can run at once on one multiprocessor in a time sharing fashion, so the hardware id might not be that useful as you think.

No answer, so I guess there is nothing immediately available.

The neatest solution for fixed size global memory with a lifetime of the block might be to make auto global work (“bmem”?) - then it would be nicely transparent. This is a bug of the NYI variety:

__global__ void

kernel2(float* a, float** c)

{

    __device__ float* b[32];

   b[threadIdx.x] = a;

    *c = b[threadIdx.x + 1];

}

Compiles to:

       .entry kernel2

        {

        .reg .u32 $r1;

        .reg .u64 $rd1,$rd2,$rd3,$rd4,$rd5,$rd6,$rd7,$rd8,$rd9,

                $rd10;

        .param .u64 __cudaparm_a;

        .param .u64 __cudaparm_c;

        .shared .align 8 .b8 b$0[256];

        .loc    12      12      0

$LBB1_kernel2:

        mov.u64         $rd1, b$0;              //

        .loc    12      16      0

        cvt.u64.u16     $rd2, %tid.x;           //

        ld.param.u64    $rd3, [__cudaparm_a];   //  id:11 __cudaparm_a+0x0

        mul.lo.u64      $rd4, $rd2, 8;          //

        add.u64         $rd5, $rd1, $rd4;       //

        st.shared.u64   [$rd5+0], $rd3; //  id:12 b$0+0x0

        .loc    12      17      0

        cvt.u32.u64     $r1, $rd2;              //

        cvt.u64.u32     $rd6, $r1;              //

        mul.lo.u64      $rd7, $rd6, 8;          //

        add.u64         $rd8, $rd1, $rd7;       //

        ld.shared.u64   $rd9, [$rd8+8]; //  id:13 b$0+0x0

        ld.param.u64    $rd10, [__cudaparm_c];  //  id:14 __cudaparm_c+0x0

        st.global.u64   [$rd10+0], $rd9;        //  id:15

        exit;                           //

        } // kernel2

You will notice that the device auto data has been put in shared. Not much use in expanding shared memory!

There is no coalescing problem with putting auto globals into a per block area. This also solves the local problem as we can build our own structs indexed at the bottom level by logical tid. Just need align(128) to work for warp wide coalescing of 32 bit items or align(256) for 64 bit items. The current local is really only useful for plain registers where access is fully coalesced.

Of course once auto globals work then they need overlaying just like auto shared while there is no stack.

A runningBlockId() and a way of calculating its maximum are still required for runtime global memory allocation on the device on a per block basis if one is running significantly more than the concurrent block capacity of the device.

An even more general solution that I like is to have a hardware base register for each block and map the global memory at that address into the address space again at a fixed address (like shared should be) then we launch a kernel with another parameter for the amount of per block device memory to allocate (extern block?), above what appears in auto decls, just like shared. There would be a limit based on how much address space was dedicated, however it could be a reasonable size like 1Mb. This could be emulated by the compiler/loader/exec without the hardware base register.

This space is where the stack should be eventually, filling from the top of the extern area specified in the kernel call. Current hardware coalescing would need to be improved not to take such a hit from misalignment, for a stack to be useful. Such an improvement would reduce the requirement to use align(256).

Perhaps shared memory should just be a cache for per block memory - costs more transistors for the hardware to manage it, rather than putting it back on the programmer, as at present.

Then if the register file is mapped into the main address space as well there can be a tradeoff between registers and real local memory that is addressable. Useful if the register file is increased in future versions of the hardware. More work for the compiler.

Just my thoughts to normalise the architecture and make it more useful for general computing - no doubt I have missed some important problem. :)

Eric

osiris, did I understand you right here? You want a piece of global memory automatically allocated for each running block that is reused for the next block upon finishing of the former block to save per-block allocated temp device memory?

If I understood you correctly, local memory is what you are looking for. It has been deprecated from the manual a while ago, but still works in 0.9. See this older post.

Peter

Yes your description is correct, however local (now undefined in 1.0) is per thread NOT per CTA (or block) … I did post in that topic you referred to!

Now local still exists for the compiler to put anything addressable declared in auto into, but we cannot declare our own. It is not very useful anyhow.

Nvidia seem to have forgotten this type of memory - nothing in the PTX spec, only shared, however there needs be 2 types of shared: on chip and off chip. If there is only 1 then our current shared is a cache for the other (which could be a sensible future development - see above - line size = archz * warpz?).

I notice the 1.0 guide now says device may only be used in global scope - sidestepping the above bug? Still need something.

Eric

Hi Eric,

ah, OK missed the “per block” part.

Yes, local does not work in 1.0 anymore. It is only used as the “auto local” induced by register spill. :(

Peter