Shared Memory A simple example CUDA code to elaborate the concept?


Can some one help me understand the concept of Shared memory in CUDA programming. I need a very simple code say for addition using shared memory.

The transpose example in the SDK (and the accompanying whitepaper) contain just about everything you need to know.