Shared memory alternative

My application depends heavily on shared memory. However the memory itself is very limited to 48KB maximum.

When the size of the problem increase, 48KB memory per block is still so small. Is there anyway to access to the assembly level to increase shared memory to the maximum 64K ? or is there anyway to access to the L1 or L2 cache programmatically ?

I have tried to fall back to global memory but it is really slow because my app requires read and write multiple time. Texture cache is only for reading so not useful.

I really appreciate any advice.

Perhaps your application allows you to store some or a lot of the data in regsiters? 128 KB isn’t too bad…

Btw, has anyone used surfaces? I noticed they are a bit like textures and you can read and write to them…

__global__ void copyKernel(int width, int height) {

 // Calculate surface coordinates unsigned 

int x = blockIdx.x * blockDim.x + threadIdx.x; 

unsigned int y = blockIdx.y * blockDim.y + threadIdx.y; 

if (x < width && y < height) { uchar4 data; 

// Read from input surface 

surf2Dread(&data, inputSurfRef, x * 4, y); 

// Write to output surface 

surf2Dwrite(data, outputSurfRef, x * 4, y);



Does it offer any performance improvements?

Fermi’s unified address space generic addressing modes allows you to have one type of pointers pointing to both shared and global memory. Maybe that can help you split your data between shared and global memory.

Unfortunately my data need to be shared between threads. I don’t know if there is a way to share register.
I have also given surface memory a thought, however in the CUDA programming guide it said:

“The texture and surface memory is cached (see Section and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.”

It really doesn’t help because I need to write and read in the same kernel call.

I did try using both shared memory and global memory like this:

if ( within range of shared) writeToShared/readFromShared
else writeToGlobal/readFromGlobal.

But it reduces the speed by more than 2 times. Firstly because of the divergent, 2ndly because of imbalanced load between threads ( some threads access global more than shared and the other way round).

I’m not sure if the unified address that you mention can tackle this problem, but I will look into how to do this. Really appriciate if you can give me an example of how this could be done.

[Sorry, just realized that Nvidia calls the feature I’m writing about generic addressing modes, not unified address space. Edited previous post to reflect this]

Using pointers instead of conditionals is a bit more efficient because the access happens within the same transaction. Also you might be able to reuse the pointer, reducing the number of conditionals needed. Basically the code you sketched should give something like this:

if ( within range of shared)

    pointer = shared memory address;


    pointer = global memory address;


However for a speed gain all accesses within a warp would have to come either from shared memory or from the L1 cache. If any of the threads reads from L2 cache or device memory, all threads of the warp would have to wait for it, so nothing would be gained.

Hi tera, thanks for the clarification. However I can’t find anything like this in the web nor the CUDA Programming guide, can you point me to something to read ?

Is there anything special when defining the pointer ? As my code with pointer throw a “Warp Invalid Address Space” in cuda-gdb.

The original code that works:

__shared__ float sdata[NUM_ELEMS];

for ({many row}) {

  if (row < NUM_ELEMS)

    atomicAdd(&sdata[row], input);


    atomicAdd(&y[actualRow + row],input);


The new code that doesn’t work:

__shared__ float sdata[NUM_ELEMS];

float *pResult;

for ({many row}) {

  if (row < NUM_ELEMS)

    pResult = &sdata[row];


    pResult = &y[actualRow + row];

  atomicAdd(pResult, input);


I’m using Fermi, CUDA 4.0, compiled as sm_20

up my threads …

Oh, I didn’t know you want to use atomic operations on the memory. According to the PTX manual, generic addressing is not available for atomic operations.