Whether use shared memory?

luca · April 14, 2008, 6:21am

I want to have an operation(Read and write) on an array,width = 256,height = 128, or even bigger, I know that if the input data is too big, then the best choose of memory is texture sometimes maybe constant, the shared memory is 16kb, but when I need read/write and high performace, can I use shared memory???

if I define the sharedata[256];
divide Grid(256,1,1),BLock(1,128,1);

jordyvaneijk · April 14, 2008, 1:36pm

Shared and constant memory are one of the fastest memory types on the GPU I think (don’t know about Textures never use them :P). And whether you put your data depends on how big the array is. Like you mentioned yourself shared memory is 16kb as constant memory is 64kb. also constant memory can not be written to from the device so if you want to write to it you need to use local, shared or tex mem. Where the size of local mem depends on how much video memory you still have.

smokescreen · April 14, 2008, 2:49pm

The thing is, within the memory presentations of CUDA found on the ECE 498 course (Course Websites | The Grainger College of Engineering | UIUC) one of the slides mentions that using constant memory depends highly on the cache locality so can vary a lot (from 1 to 100 cycles), where as you know with shared memory it’ll only take a single cycle.

So in effect, constant memory depends a lot on the GPU’s coalescing and caching mechanism?

MisterAnderson42 · April 14, 2008, 3:21pm

Constant memory always performs best when all threads in a warp access the same element of shared memory. If threads in a warp access different values in constant memory, then shared memory may be a better option.

smokescreen · April 14, 2008, 4:42pm

Thanks MisterAnderson42 , that clears it up nicely.

luca · April 15, 2008, 1:21am

Thank you for your answers.

I understand that constant is quick, but it can’t be written.

Maybe shared is a better choice,but it’s only 16kb,
the question is can I still use it(for high performance) when the size of input data is 128kb or even bigger?

Or Do I have mistakes on shared memory using?

MisterAnderson42 · April 15, 2008, 2:12am

Shared is 16k, but per block. If you need more read/write memory than that, then your only option is to use device memory, either as global or local. Coalescing will the the most important aspect for performance here.

luca · April 15, 2008, 3:09am

Then I assume that if I divide per block size into 256 threads, smaller than 16kb,

I can use the shared memory no matter the total size of input array,such as Matrix Multiplication exmple,right?

MisterAnderson42 · April 15, 2008, 1:01pm

Yes.