shared memory latency

arashloo · May 17, 2011, 11:54am

Hi all,

As a beginner to cuda I have this question on using shared memory:
in my code i got a high access latency on using global memory so decided to buffer some intermediate results in shared memory.
switching from global memory to shared memory i could get nearly no performance improvement!!
I checked again the code and saw when a variable not dependent on the input data from user is written to shared memory the latency is very small whereas when the result of some calculation (which is dependent on the input data from the user) is written to shared memory the writing process to shared memory is so slow: nearly the same as that of global memory. my question: is this a typical behavior of shared memory or I am doing something wrong?

many thanks

tera · May 17, 2011, 12:40pm

The latency you are seeing probably is that of the input from global memory. Writing to memory has no latency by itself, the following instructions will already be executed while memory is written. Unless the following instructions read again from memory and have to wait for the previous write to finish.

arashloo · May 17, 2011, 1:01pm

thanks for your reply.
I discarded the input from global memory.
the remaining part of my code is a loop of say 1000 iterations which only writes to some place in shared memory over and over.
when i compare this exactly the same code writing to global memory instead of shared i again see no performance improvement.
could this be caused by the reason than subsequent writes to the shared memory should wait until previous writes have finished?? if true in this case there is no difference between using global or shared memory??

tera · May 17, 2011, 1:13pm

Can you show concrete code? A kernel that only ever writes to shared memory should be completely optimized to an empty kernel as it has no effect at all.

arashloo · May 17, 2011, 1:22pm

Sorry , please do not consider my previous comment as true. some parts of the code had forgotten to be commented!
but the problem with the original code remains…

hyqneuron · May 17, 2011, 3:42pm

When I did some test with shared memory some time ago, I was surprised to find out that a kernel consisting of writes to shared memory alone did not get its writes removed in the process of optimization. Of course, I was doing it in PTX with ptxas directly. nvcc, however, does seem to do some optimization on its own before passing the PTX code to ptxas.

hyqneuron · May 17, 2011, 3:50pm

@arashloo
Global memory on Fermi cards is cached. It is as fast as shared memory.

The initial memory load may take some time, thus to make the effects of shared memory greater you’ll need greater amount of computation and read/write from/to shared memory. Also, if you use event timing, the global writes at the end of the kernel need to be taken into consideration as well, as an event would not be registered until all the work in the previous kernel is finished. If you use %clock as a timing method, then there’s no need to worry about the global writes.

Sarnath · May 18, 2011, 6:51am

You will benefit from shared memory only when it acts as a cache that stops accessing global memory frequently.
Its like data staging.
If you are doing only 1 pass over your input data, shared memory wont help.
+
Shared memory is subjected to bank conflicts… If all threads write to same location, then the writes get serialized and you will loose lot of cycles…Best for successive threads to read/write successive 32-bit memory locations in shared memory…

Topic		Replies	Views
Use CUDA Shared memory as a write buffer CUDA Programming and Performance	8	3366	May 9, 2015
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4079	August 1, 2009
global memory latency CUDA Programming and Performance	6	6110	December 24, 2008
Shared memory as slow as global memory CUDA Programming and Performance	8	4608	September 5, 2016
Issues in memory handling CUDA Programming and Performance	0	1896	March 5, 2007
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8397	February 17, 2008
Shared memory bandwidth CUDA Programming and Performance	10	8652	November 10, 2007
Shared Memory Vs Device Memory Device memory gives better result :fear: CUDA Programming and Performance	3	2795	April 16, 2007
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7624	July 21, 2008
Why Reg->shared->global is faster than Reg->global? CUDA Programming and Performance	7	1609	June 11, 2022

shared memory latency

Related topics