Shared Memory Persistence

Is there a particular reason that read only data stored in shared memory cannot be retained for subsequent blocks?

I guess there must be or it perhaps would already be implemented. I’ve a kernel which performs best with a lookup table stored in shared memory, but suspect it would perform better if it didn’t require reloading every block.

In the CUDA programming model, each block executes independently of other blocks. Each block may move multiple sets of data through shared memory during different phases of kernel execution. Also, different blocks may do entirely different things with shared memory. How would a block “know” whether the data in shared memory is the correct data?

I am facing a similar problem and I seem to have found the answer here.

I am invoking a child kernel which fills the shared memory (declared at file scope) with values computed from global memory.
On return to parent kernel I find that the shared memory is set to 0.

Is the explanation that it cannot be shared across blocks apply also to file scope shared memory across kernels ??

Yes, it cannot be reliably shared across blocks. All shared memory, regardless of how the declaration appears, is of block scope.

Agreed, as per your first quoted sentence.

My thoughts were for the situation where it is purely being used for read only. I should qualify my situation by saying that while constant memory would seem the obvious place, the LUT is being accessed in a totally random manner on a per thread basis and while shared memory suffers from the bank conflicts here, random constant access degrades much more.

While constant memory is designed for warp-wise broadcast access, it is usually still appropriate to use if there is no more than 4-way intra-warp address divergence. The switchover point may shift around for different architectures due to differeing access latencies for various types of memory and I haven’t experimentally confirmed this rule of thumb in a while.

Am short of time now, but will let you know the details, shared vs constant later. If memory serves, I think shared is about 10% quicker.

As I said, the speed of the constant memory accesst depends on the degree of address divergence, which is different for different use cases. In the worst case, address divergence across a warp is 32-fold, requiring 32 replays to finish the constant memory access. Obviously that is not going to be conducive to high performance.

Discussions of the benefits of using constant memory therefore must always be relative to the amount of intra-warp address divergence.

I realise it’s a “how long’s a piece of string situation”, due to all the variables, but for no other reason than to give you an update on your rule of thumb :) ;

Pascal, 128 thread blocks, LUT uint16_t x[256] and LUT access is completely random access/thread

With LUT in shared memory, kernel takes 8.49ms.

With LUT in constant memory and shared copy disabled, kernel runs 59.56ms, so about 6 times worse.
Visual Profiler shows the intra-warp address divergence to be 3.5.

Run time increasing by factor of 6 times does not readily make sense to me right now. If there were nothing else happening besides the table lookup, an average of 3.5 addresses per warp presented to the constant memory should result in a slowdown by factor 3.5 due to replay, provided there are enough threads running to cover the latency of the replay, which seems likely. Is the working set of the table lookup small enough to fit into the 8KB of constant cache? Overhead from address arithmetic should be comparable between table in constant memory vs table in shared memory, so it can’t be that.

Am I overlooking something semi-obvious (I had to perform a mental context switch from a Stackoverflow question just now)?

I don’t believe I’ve done anything stupid here. When using shared memory, I copy the table (512 bytes in total), across from constant memory at the beginning of the kernel, so all I’ve done to test the “running from constant case”, is comment out the copy and adjust the LUT reference from the shared array, to the constant array.

The LUT is an Sbox and the loop that uses it many times/thread, is very short - about 6 instructions, so it gets thrashed quite hard.

If I look at the “Warp State Statistics”, short scoreboard stall dominate:

“On average, each warp of this kernel spends 37.6 cycles being stalled waiting for a scoreboard dependency on an MIO operation (not to TEX or L1). This represents 91.6% of the total average of 41 cycles between issuing 2 instructions.”

EDIT: Where I may be making a false statement, is assuming that the " intra-warp address divergence to be 3.5.".

I inferred this from the shared memory analysis in nvvp, where it said the ideal access should be 1 and the measured was 3.5.

I am working on a little test app right now that allows me to precisely control how many “lanes” in the warp use a random address, whereas the rest of the threads share a common address.

As a first step, I confirmed that the time to access constant memory grows pretty much linearly with the amount of address divergence across the warp. A 32-bit mask, where each bit controls whether the address for that warp "lane’ is taken from the loop counter (shared by all threads in the warp) or a simple PRNG (initialized with a different seed for each thread). To make the memory access dominant and minimize overhead 16 sequential loads are performed starting at the selected array index such determined. When looking at the data below, keep in mind that when there are, for example, two lanes using a random index, that’s a total of three different addresses across the warp. I verified that various random lane combinations specifying the same number of random lanes also have identical performance, as we would expect (e.g. mask=0x55555555, mask=0xaaaaaaaa, mask=0xcccccccc). Elapsed time is reported in seconds. I ran with CUDA 11.1 on Quadro RTX 4000 (sm_75)

constant memory LUT (SHARED_TAB=0):

mask=00000000 rnd_lanes=0 elapsed = 0.0020027
mask=80000000 rnd_lanes=1 elapsed = 0.0038806
mask=80008000 rnd_lanes=2 elapsed = 0.0057145
mask=80808080 rnd_lanes=4 elapsed = 0.009453
mask=88888888 rnd_lanes=8 elapsed = 0.0167938
mask=aaaaaaaa rnd_lanes=16 elapsed = 0.0310192
mask=FFFFFFFF rnd_lanes=32 elapsed = 0.0568734

shared memory LUT (SHARED_TAB=1):

mask=00000000 rnd_lanes=0 elapsed = 0.0019839
mask=80000000 rnd_lanes=1 elapsed = 0.0020394
mask=80008000 rnd_lanes=2 elapsed = 0.0021156
mask=80808080 rnd_lanes=4 elapsed = 0.0023051
mask=88888888 rnd_lanes=8 elapsed = 0.0026492
mask=aaaaaaaa rnd_lanes=16 elapsed = 0.0032692
mask=FFFFFFFF rnd_lanes=32 elapsed = 0.0034731

As expected, for code that is nearly completely bound by LUT accesses, a LUT in shared memory provides more robust performance characteristics compared to a LUT in constant memory in the presence of intra-warp address divergence.

The real-life kernels I looked at in the past featured a bunch of other work in addition to LUT accesses, and I found with those that between two-way to four-way address divergence for constant memory LUTs had a negligible performance impact. I no longer have access to those codes; the question I was researching at the time was whether these LUTs could profitably be exchanged for some simple computation, with the result that this did not have a positive performance impact.

I think the conclusion here is that careful timing and/or profiling with realistic data is indispensable when deciding whether to use LUTs and if so, what memory to place them in.

My little test program is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NUM_BLK  (1024)
#define NUM_THRD (256)
#define NUM_REPS (256)
#define SHARED_TAB (1)

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

int32_t popc (uint32_t a)
    int32_t c;
    for (c = 0; a; c++) {
        a &= a - 1;
    return c;

__constant__ uint32_t tab [272] =
#endif // SHARED_TAB

__global__ void kernel (uint32_t mask, uint32_t *r)
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    uint32_t prng = tid;
    uint32_t lane_select, sum, idx;

    __shared__ uint32_t tab [272];
    tab [threadIdx.x] = threadIdx.x;
    if (threadIdx.x < 16) tab [threadIdx.x + NUM_THRD] = threadIdx.x;

    lane_select = (1 << (tid & 0x1f)) & mask; // one-hot
    sum = 0;
    for (int i = 0; i < NUM_REPS; i++) {
        prng ^= prng << 13;
        prng ^= prng >> 17;
        prng ^= prng << 5;
        idx = lane_select ? (prng & 0xff) : i;
        sum += tab [idx+0];
        sum += tab [idx+1];
        sum += tab [idx+2];
        sum += tab [idx+3];
        sum += tab [idx+4];
        sum += tab [idx+5];
        sum += tab [idx+6];
        sum += tab [idx+7];
        sum += tab [idx+8];
        sum += tab [idx+9];
        sum += tab [idx+11];
        sum += tab [idx+12];
        sum += tab [idx+13];
        sum += tab [idx+14];
        sum += tab [idx+15];
        sum += tab [idx+16];
    r[tid] = sum;

int main (void)
    uint32_t *res_d = 0;
    uint32_t mask = 0xffffffff;
    double start, stop, elapsed;
    CUDA_SAFE_CALL (cudaMalloc ((void **)&res_d, sizeof (res_d[0]) * NUM_BLK * NUM_THRD));
    for (int k = 0; k < 2; k++) {
        start = second();
        kernel <<<NUM_BLK,NUM_THRD>>>(mask, res_d);
        stop = second();
        elapsed = stop - start;
    printf ("mask=%08x  rnd_lanes=%d elapsed = %.6g\n", 
            mask, popc(mask), elapsed);
    return EXIT_SUCCESS;


Nicely done and I’ve learnt what is probably a better way to load the shared memory. Thanks for taking the trouble.

This exchange has been mutually beneficial.The lesson I learned is that I should ask to see some actual code before mentioning any rules of thumb :-)

To the best of my recollection, I have never encountered code that is so completely dominated by LUT access as yours apparently is.

An observation, which may be an anomaly, due to the extreme nature of my kernel.

In Nsight Compute, in the “GPU Speed of Light” section, the shared memory version shows an SM % of 70.88 and a Memory % of 63.3.

The six times slower constant memory version showed over 97% for both.

I’m therefore somewhat confused as to the purpose of this metric, as my naive understanding, (perhaps I haven’t read the docs properly), is that the higher the utilisation, the better the performance.

Here’s my hand-wavy opinion.

I think it’s possible there might be some general correlation like that (sort of like the statement that higher occupancy implies higher performance) but it certainly isn’t always the case (just like the occupancy statement isn’t always true) and I would try to avoid replacing your specific knowledge of code behavior with a general idea like that.

If we consider a kernel where the LUT access is very high (dominates kernel activity) and we further posit that the LUT accesses are to constant and on average they are not uniform by a factor of 4 then we can say that that’s going to result in approximately a 4x increase in the number of issue slots required, due to replay, each replay using up an issue slot (at least).

So if my shared memory code requires X transactions to the LUT, and the constant memory version requires 4X “transactions”, and the LUT accesses dominate the kernel activity, it stands to reason that the SM will be “busier”. Remember, a replay has no dependencies (by definition), it is always “issuable”. A replay uses an issue slot but never stalls. Therefore it’s the most easily schedulable entity.

That’s going to drive up the apparent utilization. But none of that negates the fact that you now need approximately 4X issue slots to do the same work. So unless your shared kernel issue slot utilization was at or below 25%, the constant version is going to have a negative impact on code throughput.

Hi Robert,

I agree and certainly have experience of highly performant, low occupancy kernels. I think perhaps it’s a matter of presentation. If the SOL metrics were listed like the occupancy ones, the perception would be different, (speaking for myself), as opposed to prominent bar graphs, which perhaps overemphasize importance.