Shared Memory A simple example CUDA code to elaborate the concept?

Hi,

Can some one help me understand the concept of Shared memory in CUDA programming. I need a very simple code say for addition using shared memory.

Thanks for your time

The transpose example in the SDK (and the accompanying whitepaper) contain just about everything you need to know.