Best way to allocate a small lookup table 2KB of data, read only

I’ve been trying to understand the various memory types in CUDA, but its unclear to me which is right for what I want to do. I have a table of 512 floats that I computed in advance. Each thread randomly selects one of them with equal probability and uses it as a seed value for its calculations. I tried using constant memory, but it was surprisingly slow.

Would shared memory be the right way to do this? And if so, its unclear to me how I allocate it. I understand that it must be allocated from the kernel, but if its shared how do I know which thread should allocate and which threads should assume its already been allocated?

Shared memory is per-block only. If you launch one block only , then I guess it’s a nice idea.

did you try to precede the array’s declaration with a constant keyword, i.e. constant float array[512] ?

You declare it outside any function body; it will be cached.

Access to constant memory is serialized if not all threads in a warp access the same address. For this reason, a texture is a better choice for a lookup table.

Shared memory is another option if you can afford it. If you are on a compute capability 2.x device, just leaving the table in global (not constant) memory is also an option as global memory is cached on these devices.

That makes sense then why using constants gave such poor performance. I’m basically not using any shared memory right now (something like a couple hundred bytes vs. 16KB on CL 1.1), so I think I can spare it. Is there an example somewhere showing how to use it? I found ones for constant memory, but I’m not sure how to allocate shared memory so that theres just one table per block.

Something like this, if your kernel is called with a blocksize of (256,1,1):

float table[512] = {...};

__global__ void mykernel(...)

{

    __shared__ float s_table[512];

    unsigned int idx = threadIdx.x;

s_table[idx    ] = table[idx]    ;    // load first half of lookup table from global memory

    s_table[idx+256] = table[idx+256];    // load second half of lookup table from global memory

    __syncthreads();

...

}

Note that you will not be able to achieve a full speedup of 16x compared to constant memory, as you will have some slowdown due to random bank conflicts. In the worst case (16x bank conflict, all accesses to the same bank), it will just be as slow as constant memory.

If you can afford using some more shared memory, you can limit the worst case to an 8x bank conflict:

float table[512] = {...};

__global__ void mykernel(...)

{

    __shared__ float s_table[512][2];

    unsigned int idx = threadIdx.x;

    unsigned int lane = idx & 1;

s_table[idx    ][0] = s_table[idx    ][1] = table[idx    ];    // load first half of lookup table from global memory

    s_table[idx+256][0] = s_table[idx+256][1] = table[idx+256];    // load second half of lookup table from global memory

    __syncthreads();

...

    value = s_table[index][lane];

    ...

}

Hey tera, texture memory access is not affected by coalesced patterns right? You interested to test the latency for cached texture memory?

Yes, coalescing and bank conflicts should be of much less importance to textures as they are cached and the seems to be some kind of broadcast mechanism. I’m not convinced they play no roll at all, but I don’t want to invest time investigating that.

Latency for texture access should be around the same as for memory access (~400 cycles or so), as the Programming Guide states that latency is constant whether it is a cache hit or miss. Latency for shared memory access is something like 36 cycles.