Why grid dimentions are limitted to 65535x65535x1, but not to 65536x65536x1?
Because the largest number representable by a short int is 65535. Shorts are used to store the grid dimensions in shared memory.
why shorts, but not ints?
I’ve just always accepted this limit and not questioned it. However, reading this:
made me think.
65535 is indeed the largest representable number in unsigned short. However, zero is also representable. Since block IDs begin at zero, shouldn’t the maximum dimension therefore be 65536? Having the limit as 65535 allows for block IDs [0, 65534], so we’re losing one ID.
I guess the grid dimensions aren’t zero-based (just the IDs), because that’s not very intuitive. However, it wouldn’t be difficult to take user input and convert it?
uh, when 65,535 is said to be the limit, that means you can use 65,535.
Was that a response to what I said? I think you’re missing my point. 65,535 is the maximum grid dimension. The limitation is imposed so that unsigned shorts can be used for grid dim and (I’m assuming) block ID variables. But there aren’t 65,535 numbers representable by an unsigned short, there are 65,536. Basically 0 is being wasted in regards to grid dimensions, because a grid with a zero dimension makes no sense.
I know it’s splitting hairs, and I’m basically just thinking out loud, but seems like it’s be possible to squeeze a one-size-larger dimension out of a grid. Would be nice for keeping things a power of two.
I agree the 65535 grid size limit (and the 2D grid) can be annoying sometimes, but I’m afraid that’s just the way the hardware works. It’s not that hard to construct your own 1D block index.
As for why it’s only 16 bits, bits aren’t free in hardware.
I suspect these limitations will be relaxed in future designs (I believe the DirectX 11 Compute limits are higher).
Are you sure? Since it just sits in shared mem, like the parameters, it sounds like you could make blockIdx (unlike threadIdx) anything you’d like. Surely smem initialization is handled by the driver, not dedicated hardware?
Yes am I sure, the block and thread indices are initialized by the hardware (the compute “rasterizer”). How would the driver do this for every block?
Initialization of shared memory is handled by the hardware, the CPU can’t even write there… And the hardware determines what to place in the first few bytes, the rest (well, like 32*4 bytes of the rest) can be custom specified by the application, as kernel parameters.
I suppose this is in keeping with the GPU being optimized for very short kernels. I was thinking it’s possible a “prekernel” ran that would initialize the shared memory. Using a prekernel to write a dozen bytes (an n-vector block/grid dimension and maybe something else too?) should have very acceptable overhead.
Of course figuring out my indexes manually isn’t a problem for me, and I would have been fine if they were all 1-dimensional. But people seem to really like the N-D indexes, so maybe we should make them happier?