Use CUDA Shared memory as a write buffer

E_K · May 8, 2015, 6:10am

Hi there,

I was stuck with the shared memory thing. I was told that the shared memory performs 100x than global memory in accessing. Then I was wondering if I can use the shared memory as a write buffer. Because in my code there are some data dependency. Here is how my loop runs:

Loop:

Write the N-1 iteration’s results into global memory from shared memory.
read the input from the shared memory(computed by N-1 iteration) and compute the N iteration result.
Write the N iteration result into shared memory
End loop

I guess the 1-2 lines will be issued simultaneously because there is no dependency and it will hide the write latency. But after I implemented it, I found the performance is no better than that I write into the global memory directly. Why does this happen? Please correct me if I made some fatal errors here. Thanks!

EK

little_jimmy · May 8, 2015, 6:32am

shared memory is faster than global memory, but i would think 100x is a bit much (lot)

and from what you are showing, i do not expect an improvement, as you are mostly writing to global memory, and this is much faster than reading global memory, as writes are cached

shared memory is good for data reuse, and/ or improving global memory access
and i do not think you currently manage to achieve either
in other words, i doubt whether you have redundant global memory reads, or inefficient global memory access, so there is little reason to benefit from shared memory, in this regard

E_K · May 8, 2015, 6:51am

Thanks Jimmy! I think you are correct, I only consider the writing which may be less important.

I have another question regarding your post. I can understand the data reuse. But if we are in an extreme case that all the data will be used only once, then how can I fully exploit the shared memory to access global memory?

Thanks in advance,
EK

little_jimmy · May 8, 2015, 7:18am

if data is used only once, the data is unique and allows no reuse
whether you then read into local or shared memory, you still need to read the data from global memory, and there is nothing you can do (via shared memory) to improve or speed up such reads (assuming they are coalesced)

if the same data is used more than once, it may be worthwhile to push it into shared memory - similar to caching the data

if the data is unique, but with a poor access pattern, like matrix transposes etc, shared memory can perhaps help with this too

E_K · May 8, 2015, 3:16pm

Thanks, Jimmy~

Is it possible to load the data from the global memory while the thread is doing some other computation?

EK

little_jimmy · May 8, 2015, 3:49pm

“Is it possible to load the data from the global memory while the thread is doing some other computation?”

in a gpu-way, not a cpu-way

the global memory access is done via the load/ store units, which have buffers i am told
global memory reads are also cached, so the necessary data may or may not be in cache already
hence, it becomes a question of how long the access would take to complete, given the underlying conditions, whether the current thread (warp) has additional instructions that it can execute before it must commit its memory accesses, and whether there are other threads (warps) that can jump in, should the thread (warp) need to wait for its accesses to complete

so again: in a gpu-way, not a cpu-way

E_K · May 8, 2015, 5:08pm

Jimmy, I’m really appreciate for your valuable answer!

Actually I met a strange situation here and it brought all the questions above. I tried to explain it clearly and hope you can help:)

My code runs multiple iterations. In every iteration, it reads data and update data in global memory. And the data every iteration reads mostly has been modified by the previous iteration. So you can see that there is a true data dependency here. However, if I just remove this dependency, say that I don’t write back to the memory after the computation of every iteration, the code runs 100x faster than the dependency exists. Is the compiler smart enough to detect that there is no dependency and then unroll my loop? I don’t think so. My loop is simple as below:

While (iteration < last_iteration){
foo<<<blocks, threads>>>(param1, param2…);
iteration++;
}

Then due to the huge difference, I was wondering if the writing back takes toooo much time. So I tried using the shared memory as buffer, there was no improvement at all. Now I know the write is cached. This optimization may have no improvement.

So why there is a huge speedup then? Please correct me if I made any errors here. Thanks!

EK

Robert_Crovella · May 8, 2015, 5:18pm

Yes, the compiler is smart enough to actually eliminate code that doesn’t affect global state.

E_K · May 9, 2015, 2:33am

Thanks txbob, though beyond my expectation…

Topic		Replies	Views
shared memory latency CUDA Programming and Performance	7	6052	May 18, 2011
General Shared Memory Question CUDA Programming and Performance	5	6695	March 4, 2010
Shared memory as slow as global memory CUDA Programming and Performance	8	4608	September 5, 2016
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1917	August 20, 2010
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4654	May 12, 2015
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3728	March 31, 2011
Shared memory bandwidth CUDA Programming and Performance	10	8652	November 10, 2007
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3295	October 14, 2009
simple global data copy using shared memory why bother shared memory when simply copy global data CUDA Programming and Performance	4	1633	March 9, 2012
Device memory VS Shared memory CUDA Programming and Performance	4	4232	September 22, 2008

Use CUDA Shared memory as a write buffer

Related topics