Shared memory alternative

ikidntu · December 2, 2011, 9:41am

My application depends heavily on shared memory. However the memory itself is very limited to 48KB maximum.

When the size of the problem increase, 48KB memory per block is still so small. Is there anyway to access to the assembly level to increase shared memory to the maximum 64K ? or is there anyway to access to the L1 or L2 cache programmatically ?

I have tried to fall back to global memory but it is really slow because my app requires read and write multiple time. Texture cache is only for reading so not useful.

I really appreciate any advice.
Cheers

_constant · December 3, 2011, 4:15pm

Perhaps your application allows you to store some or a lot of the data in regsiters? 128 KB isn’t too bad…

Btw, has anyone used surfaces? I noticed they are a bit like textures and you can read and write to them…

__global__ void copyKernel(int width, int height) {

 // Calculate surface coordinates unsigned 

int x = blockIdx.x * blockDim.x + threadIdx.x; 

unsigned int y = blockIdx.y * blockDim.y + threadIdx.y; 

if (x < width && y < height) { uchar4 data; 

// Read from input surface 

surf2Dread(&data, inputSurfRef, x * 4, y); 

// Write to output surface 

surf2Dwrite(data, outputSurfRef, x * 4, y);

 }

 }

Does it offer any performance improvements?

tera · December 3, 2011, 10:19pm

Fermi’s unified address space generic addressing modes allows you to have one type of pointers pointing to both shared and global memory. Maybe that can help you split your data between shared and global memory.

ikidntu · December 5, 2011, 9:13am

@constant_
Unfortunately my data need to be shared between threads. I don’t know if there is a way to share register.
I have also given surface memory a thought, however in the CUDA programming guide it said:

“The texture and surface memory is cached (see Section 5.3.2.5) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.”

It really doesn’t help because I need to write and read in the same kernel call.

@tera:
I did try using both shared memory and global memory like this:

if ( within range of shared) writeToShared/readFromShared
else writeToGlobal/readFromGlobal.

But it reduces the speed by more than 2 times. Firstly because of the divergent, 2ndly because of imbalanced load between threads ( some threads access global more than shared and the other way round).

I’m not sure if the unified address that you mention can tackle this problem, but I will look into how to do this. Really appriciate if you can give me an example of how this could be done.

tera · December 6, 2011, 12:17am

[Sorry, just realized that Nvidia calls the feature I’m writing about generic addressing modes, not unified address space. Edited previous post to reflect this]

Using pointers instead of conditionals is a bit more efficient because the access happens within the same transaction. Also you might be able to reuse the pointer, reducing the number of conditionals needed. Basically the code you sketched should give something like this:

if ( within range of shared)

    pointer = shared memory address;

else

    pointer = global memory address;

writeViaPointer/readViaPointer;

However for a speed gain all accesses within a warp would have to come either from shared memory or from the L1 cache. If any of the threads reads from L2 cache or device memory, all threads of the warp would have to wait for it, so nothing would be gained.

ikidntu · December 6, 2011, 9:41am

Hi tera, thanks for the clarification. However I can’t find anything like this in the web nor the CUDA Programming guide, can you point me to something to read ?

Is there anything special when defining the pointer ? As my code with pointer throw a “Warp Invalid Address Space” in cuda-gdb.

The original code that works:

__shared__ float sdata[NUM_ELEMS];

for ({many row}) {

  if (row < NUM_ELEMS)

    atomicAdd(&sdata[row], input);

  else

    atomicAdd(&y[actualRow + row],input);

}

The new code that doesn’t work:

__shared__ float sdata[NUM_ELEMS];

float *pResult;

for ({many row}) {

  if (row < NUM_ELEMS)

    pResult = &sdata[row];

  else

    pResult = &y[actualRow + row];

  atomicAdd(pResult, input);

}

I’m using Fermi, CUDA 4.0, compiled as sm_20

ikidntu · December 7, 2011, 1:51pm

up my threads …

tera · December 7, 2011, 2:32pm

Oh, I didn’t know you want to use atomic operations on the memory. According to the PTX manual, generic addressing is not available for atomic operations.

Topic		Replies	Views
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1144	April 26, 2013
Using Shared Memory in CUDA C/C++ Technical Blog	36	1929	October 8, 2020
Memory coalescing in one thread CUDA Programming and Performance	17	16585	March 31, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9794	December 14, 2011
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
Suspected wrap size issue with my matrix-related kernel code CUDA Programming and Performance	11	541	December 25, 2017
Memory performance in image processing example CUDA Programming and Performance	9	1600	March 24, 2011
shared memory latency CUDA Programming and Performance	7	5877	May 18, 2011
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Texture Memory / Large Data / Global Memory Advice CUDA Programming and Performance	14	10687	May 18, 2010

Shared memory alternative

Related topics