Device memory VS Shared memory

Which one is faster to access within the kernel?

Thanks a lot. :D

“Device” memory is the same as “global” memory. It’s unfortunate that there’s this discrepancy in terminology. I’m sure you know global memory is much slower than shared.

I haven’t figure out how to fit the data into shared memory, so I use device memory as the input and out of the kernel. By acomparing the time between CUDA code and CPU code(same algorithm), the performance got enhanced 4 times. So I bring out the questions, what’s the performance difference between shared memory and global memory.

Thanks a lot for the answer.

all input to a kernel goes through global memory. shared memory is filled inside a kernel, the acces to this shared memory is much faster (almost as fast as accessing a register). It is however only beneficial if threads have to work together or acess the same values from global memory to use shared memory.
You can find plenty of uses for shared memory in the SDK examples.

If the memory you need to access is read-only on the GPU side, you may also want to look at constant memory. This is global memory, more than shared memory (64 K vs 16 K), but cached, can be read and written from the host and read from the device. I used it very beneficially in one case, being able to stuff data into constant memory for a speedup increase from roughly 20x to 680x ! (vs CPU). Beyond that, global memory is VERY slow, but whether that really hurts performance, depends on your memory access patterns. The idea of having many threads running in parallel on the GPU is to hide memory latencies. In another example where my amount of “constant” data is so large that it barely fits in global memory, I can still achieve speedups up to 250x (vs CPU), because the latencies can be hidden well enough. CPU codes are optimized as much as possible, so the speedups are quite reasonable (GPU: 9800GT)