Whether use shared memory?

I want to have an operation(Read and write) on an array,width = 256,height = 128, or even bigger, I know that if the input data is too big, then the best choose of memory is texture sometimes maybe constant, the shared memory is 16kb, but when I need read/write and high performace, can I use shared memory???

if I define the sharedata[256];
divide Grid(256,1,1),BLock(1,128,1);

Shared and constant memory are one of the fastest memory types on the GPU I think (don’t know about Textures never use them :P). And whether you put your data depends on how big the array is. Like you mentioned yourself shared memory is 16kb as constant memory is 64kb. also constant memory can not be written to from the device so if you want to write to it you need to use local, shared or tex mem. Where the size of local mem depends on how much video memory you still have.

The thing is, within the memory presentations of CUDA found on the ECE 498 course (http://courses.ece.uiuc.edu/ece498/al1/) one of the slides mentions that using constant memory depends highly on the cache locality so can vary a lot (from 1 to 100 cycles), where as you know with shared memory it’ll only take a single cycle.

So in effect, constant memory depends a lot on the GPU’s coalescing and caching mechanism?

Constant memory always performs best when all threads in a warp access the same element of shared memory. If threads in a warp access different values in constant memory, then shared memory may be a better option.

Thanks MisterAnderson42 , that clears it up nicely.

Thank you for your answers.

I understand that constant is quick, but it can’t be written.

Maybe shared is a better choice,but it’s only 16kb,
the question is can I still use it(for high performance) when the size of input data is 128kb or even bigger?

Or Do I have mistakes on shared memory using?

Shared is 16k, but per block. If you need more read/write memory than that, then your only option is to use device memory, either as global or local. Coalescing will the the most important aspect for performance here.

Then I assume that if I divide per block size into 256 threads, smaller than 16kb,

I can use the shared memory no matter the total size of input array,such as Matrix Multiplication exmple,right?