Use CUDA Shared memory as a write buffer

Hi there,

I was stuck with the shared memory thing. I was told that the shared memory performs 100x than global memory in accessing. Then I was wondering if I can use the shared memory as a write buffer. Because in my code there are some data dependency. Here is how my loop runs:


  1. Write the N-1 iteration’s results into global memory from shared memory.
  2. read the input from the shared memory(computed by N-1 iteration) and compute the N iteration result.
  3. Write the N iteration result into shared memory
    End loop

I guess the 1-2 lines will be issued simultaneously because there is no dependency and it will hide the write latency. But after I implemented it, I found the performance is no better than that I write into the global memory directly. Why does this happen? Please correct me if I made some fatal errors here. Thanks!


shared memory is faster than global memory, but i would think 100x is a bit much (lot)

and from what you are showing, i do not expect an improvement, as you are mostly writing to global memory, and this is much faster than reading global memory, as writes are cached

shared memory is good for data reuse, and/ or improving global memory access
and i do not think you currently manage to achieve either
in other words, i doubt whether you have redundant global memory reads, or inefficient global memory access, so there is little reason to benefit from shared memory, in this regard

Thanks Jimmy! I think you are correct, I only consider the writing which may be less important.

I have another question regarding your post. I can understand the data reuse. But if we are in an extreme case that all the data will be used only once, then how can I fully exploit the shared memory to access global memory?

Thanks in advance,

if data is used only once, the data is unique and allows no reuse
whether you then read into local or shared memory, you still need to read the data from global memory, and there is nothing you can do (via shared memory) to improve or speed up such reads (assuming they are coalesced)

if the same data is used more than once, it may be worthwhile to push it into shared memory - similar to caching the data

if the data is unique, but with a poor access pattern, like matrix transposes etc, shared memory can perhaps help with this too

Thanks, Jimmy~

Is it possible to load the data from the global memory while the thread is doing some other computation?


“Is it possible to load the data from the global memory while the thread is doing some other computation?”

in a gpu-way, not a cpu-way

the global memory access is done via the load/ store units, which have buffers i am told
global memory reads are also cached, so the necessary data may or may not be in cache already
hence, it becomes a question of how long the access would take to complete, given the underlying conditions, whether the current thread (warp) has additional instructions that it can execute before it must commit its memory accesses, and whether there are other threads (warps) that can jump in, should the thread (warp) need to wait for its accesses to complete

so again: in a gpu-way, not a cpu-way

Jimmy, I’m really appreciate for your valuable answer!

Actually I met a strange situation here and it brought all the questions above. I tried to explain it clearly and hope you can help:)

My code runs multiple iterations. In every iteration, it reads data and update data in global memory. And the data every iteration reads mostly has been modified by the previous iteration. So you can see that there is a true data dependency here. However, if I just remove this dependency, say that I don’t write back to the memory after the computation of every iteration, the code runs 100x faster than the dependency exists. Is the compiler smart enough to detect that there is no dependency and then unroll my loop? I don’t think so. My loop is simple as below:

While (iteration < last_iteration){
foo<<<blocks, threads>>>(param1, param2…);

Then due to the huge difference, I was wondering if the writing back takes toooo much time. So I tried using the shared memory as buffer, there was no improvement at all. Now I know the write is cached. This optimization may have no improvement.

So why there is a huge speedup then? Please correct me if I made any errors here. Thanks!


Yes, the compiler is smart enough to actually eliminate code that doesn’t affect global state.

Thanks txbob, though beyond my expectation…