How to set up shared memory allocated per block for a 3D structured data?

There is data mapped to a grid with dimensions ptdimx × ptdimy × ptdimz.

A kernel is designed to process the data by doing some calculation based on each cell’s value and its neighbors at one‐ and two‐cell padding, for example:

D00–D01–D02–D03–D04
| |
D10–D11–D12–D13–D14
| |
D20–D21– X –D23–D24
| |
D30–D31–D32–D33–D34
| |
D40–D41–D42–D43–D44

Currently, my code allocates blocks with a thread‐block size of 8×8×8 to read and compute the data:

__global__ 
void calcdata(int ptdimx, int ptdimy, int ptdimz,
              float* extdata, float* extdata2)
{
    // two data fields per cell, total block dimension = 12^3 = 1728
    __shared__ float datas[2][12][12][12];

    // flatten thread index in [0..511]
    int thlocid = threadIdx.x 
                + threadIdx.y * 8 
                + threadIdx.z * 64;

    // global coordinates of this thread
    int thglox = blockIdx.x * 8 + threadIdx.x;
    int thgloy = blockIdx.y * 8 + threadIdx.y;
    int thgloz = blockIdx.z * 8 + threadIdx.z;

    // origin of the 12^3 read window (block start minus 2)
    int orix = blockIdx.x * 8 - 2;
    int oriy = blockIdx.y * 8 - 2;
    int oriz = blockIdx.z * 8 - 2;

    // each thread performs 4 reads of 512 elements to cover 1728
    for (int multiread = 0; multiread < 4; ++multiread)
    {
        int idx = thlocid + multiread * 512;
        int full_x = idx % 12;
        int full_y = (idx / 12) % 12;
        int full_z = idx / 144;
        if (full_z >= 12) break;

        int fx = orix + full_x;
        int fy = oriy + full_y;
        int fz = oriz + full_z;

        // periodic wrap in x
        if      (fx == -1)        fx = ptdimx - 1;
        else if (fx == ptdimx)    fx = 0;
        else if (fx == -2)        fx = ptdimx - 2;
        else if (fx == ptdimx+1)  fx = 1;

        // periodic wrap in y
        if      (fy == -1)        fy = ptdimy - 1;
        else if (fy == ptdimy)    fy = 0;
        else if (fy == -2)        fy = ptdimy - 2;
        else if (fy == ptdimy+1)  fy = 1;

        // periodic wrap in z
        if      (fz == -1)        fz = ptdimz - 1;
        else if (fz == ptdimz)    fz = 0;
        else if (fz == -2)        fz = ptdimz - 2;
        else if (fz == ptdimz+1)  fz = 1;

        if (fx < ptdimx && fy < ptdimy && fz < ptdimz)
        {
            int offset = fx + fy * ptdimx + fz * ptdimx * ptdimy;
            datas[0][full_z][full_y][full_x] = extdata[offset];
            datas[1][full_z][full_y][full_x] = extdata2[offset];
        }
    }

   // some calculations and outputs are omitted
}

In summary, this occupies about 13.8 KB of shared memory per block.

However, I’m concerned that it is a poor scalability. If I increase either the number of threads per block (currently 512) or the dimension of data (currently 2), a shared‐memory limit may be hit.

Say GPU is RTX 4090 Ti, what optimizations can I do to handle more data (e.g. 4–5 ) per kernel launch?

Thank you!

LEE

the RTX 4090 has 99KB of shared memory per block which can easily fit 5*13.8 KB

I wonder if each block is fully exploiting shared memory, or if the SM is degrading block scheduling, say allocating fewer blocks per SM at the same time.

I’ve googled that each SM actually reuses its shared‑memory hardware across all eligible warps from different blocks, so I may need to profile this more.

Roughly speaking, I suspect my choice to store data in shared memory may not be optimal.

Another option would be the global memory, and much more data can be loaded per kernel launch. I will test it later.

That is a possibility. Usage of shared memory can be a limiter to occupancy, i.e. the number of threads or warps that can be simultaneously resident and executing on a particular SM.

An RTX 4090 is a cc8.9 device which has a maximum of 1536 threads per SM (you can verify with deviceQuery or read the relevant table in the programming guide.)

If your shared usage per block (or equivalently per warp or thread) is such that the total results in a limit lower than 1536 threads per SM) then it will limit occupancy. For example, if you use 99KB for a single threadblock, then you cannot achieve 1536 threads/SM occupancy. Whether or not that is an actual concern for performance cannot be determined a-priori. It requires comparative profiling and consideration of code design options that use more or less shared memory per block.

Using dynamically allocated shared memory and explicitly opting-in to increased shared usage, you should be able to use up to about 32KB for every 512 threads. Given that a cc8.9 has this limit of 1536 threads per SM, I often suggest that folks don’t choose block sizes larger than 512 threads per block on a cc8.6 or cc8.9 device, for this reason (to have a possibility to hit maximum occupancy.)

I would be more concerned about optimizing the accesses to limit bank conflicts than with maximum occupancy. But perhaps that is already optimized or no issue?

Note that in my kernel, shared memory operations are dominated by reads rather than writes.

Banks aren’t assigned to threads strictly as mod 32 (paddings exists).

For each block, 12 x 12 x 12 float datas are stored in shared memory to feed 8x8x8 threads (each thread can read neighbors up to 2 cells away in the x, y, and z).

Take the first 64 banks as example (data flatten in x first, followed by y and z access):

2 - 8 - 2 - 2 - 8 - 2 - 2 - (6)

(2) - 2 - 2 - 8 - 2 - 2 - 8 - 2 - 2 - (2)

where each 8 is a contiguous data chunk, each 2 is padding, and the parentheses show data chunk in two rows.

Based on the algorithm, the padding of 2 is mandatory.

But i increased the padding beyond 2 to see how it affected bank conflicts.

The results reported that the padding of 2 gives the fewest bank conflicts.

If one warp (32 threads) spans a subgroup of 8x4x1 (xyz) threads in your block, then for each single access you access 8 continuous data in x and 4 in y direction.

If you arrange the 12 data items in x direction without padding and shift for each step in y direction by 8 or by 24, then you have no bank conflicts any longer at all.

The normal shift in y direction would be 12. We make it 24.

So you could define

    __shared__ float datas[12][12][2][12];

y=0, x=0..7: Banks 0..7
y=1, x=0..7: Banks 24..31
y=2, x=0..7: Banks 16..23
y=3, x=0..7: Banks 8..15

On the other hand, you mention periodic wraps in x, y, z. That makes it complicated with a length of 12. You would switch banks, when you do periodic wrapping in x.

If your algorithm is dominated by reads, you could store the data twice for wraps in x to make sure there is no bank switching.

    __shared__ float datas[2][12][12][12+2+2+8]; // +2+2 for storing twice, +8 for padding

or

    __shared__ float datas[12][12][2][12+2+2+4]; // +2+2 for storing twice, +4 for padding

Instead of padding, and to save memory, one can rotate the x coordinate depending on the y coordinate instead:

datas[16][16][16]:
datas[z][y][x] -> datas[z][y][(x + y * 8) & 0xF]; // example, introduces padding-like shift by 8 along y

This coordinate transformation has to be done for reads and writes, and by that only changes the banks and not the data accessed. The small local computations (in this case addition, multiplication by a power of two, logical AND) are often a small price, as the shared memory is shared between the 4 SM Partitions and often gets the bottleneck, whereas local arithmetic and logical computations often are not fully occupied.

Excellent.

As for wrapping in x, y, and z, I don’t see any issues, wrapping only occurs at the global (CUDA grid) level.

Only threads at the global data boundaries perform the wrap; for all internal threads, the 12‑cell block already accounts for inter‑block wrapping in the initialization stage (reading from global into shared memory).

This simple data layout change really eliminates all bank conflicts.

Thank you!

Short derivation for people reading the thread having to find a similar solution:

Mathematically your x-dimension is larger than the 8 elements you access, it is 12 long.
To still get banking conflict free accesses, you want that it acts like it still would be 8 elements.

So you need a size, which is at least 12 and where the greatest common denominator between the x-size and 32 (the 32 threads of a warp) is 8. That is true for 24 or 40 or 56, …

With 24 instead of padding the 12 elements to 24 elements (which we could also do), we can just reuse the doubled memory for the other data element.

You can get e.g. a further 17% reduction in shared memory accesses, if each thread processes 4 instead of 1 position:

Normally you read in 12 neighbours per position.

If each thread processes 4 positions side-by-side in x-direction, than instead of 4*12 = 48 reads,
you need 4*8 = 32 reads for the y and z neighbours and altogether 8 reads for the neighbours in the x-direction, which is a reduction from 48 reads to 40 reads.

But it slightly complicates your code. You have to decide by yourself, if it is worth it. And you can of course use more than 4 positions per thread (e.g. 8). The limit is the number of registers per thread (which could limit your occupancy beyond 64 registers) and the length of your function (which could lead to code cache misses, if too long).

The 4 positions could of course also be side-by-side in z direction (instead of x direction) to not reintroduce banking conflicts.

For reusability of reads it would be advantageous if one thread not only processes linear positions, but a local neighbourhood in all 3D axes.

If you go further with it, most of your 8x8x8 positions are processed by a single warp. Then you possibly (could get too complicated for it to be faster) can replace shared memory with shuffle operations, too.

But perhaps you should first find out, whether you want to optimize performance further, and where the bottleneck is. E.g. it could be in L2/global device memory speed. Than SM-local optimizations to use less shared memory accesses will not help anyway.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.