shared memory latency

Hi all,

As a beginner to cuda I have this question on using shared memory:
in my code i got a high access latency on using global memory so decided to buffer some intermediate results in shared memory.
switching from global memory to shared memory i could get nearly no performance improvement!!
I checked again the code and saw when a variable not dependent on the input data from user is written to shared memory the latency is very small whereas when the result of some calculation (which is dependent on the input data from the user) is written to shared memory the writing process to shared memory is so slow: nearly the same as that of global memory. my question: is this a typical behavior of shared memory or I am doing something wrong?

many thanks

The latency you are seeing probably is that of the input from global memory. Writing to memory has no latency by itself, the following instructions will already be executed while memory is written. Unless the following instructions read again from memory and have to wait for the previous write to finish.

thanks for your reply.
I discarded the input from global memory.
the remaining part of my code is a loop of say 1000 iterations which only writes to some place in shared memory over and over.
when i compare this exactly the same code writing to global memory instead of shared i again see no performance improvement.
could this be caused by the reason than subsequent writes to the shared memory should wait until previous writes have finished?? if true in this case there is no difference between using global or shared memory??

Can you show concrete code? A kernel that only ever writes to shared memory should be completely optimized to an empty kernel as it has no effect at all.

Sorry , please do not consider my previous comment as true. some parts of the code had forgotten to be commented!
but the problem with the original code remains…

When I did some test with shared memory some time ago, I was surprised to find out that a kernel consisting of writes to shared memory alone did not get its writes removed in the process of optimization. Of course, I was doing it in PTX with ptxas directly. nvcc, however, does seem to do some optimization on its own before passing the PTX code to ptxas.

@arashloo
Global memory on Fermi cards is cached. It is as fast as shared memory.

The initial memory load may take some time, thus to make the effects of shared memory greater you’ll need greater amount of computation and read/write from/to shared memory. Also, if you use event timing, the global writes at the end of the kernel need to be taken into consideration as well, as an event would not be registered until all the work in the previous kernel is finished. If you use %clock as a timing method, then there’s no need to worry about the global writes.

You will benefit from shared memory only when it acts as a cache that stops accessing global memory frequently.
Its like data staging.
If you are doing only 1 pass over your input data, shared memory wont help.
+
Shared memory is subjected to bank conflicts… If all threads write to same location, then the writes get serialized and you will loose lot of cycles…Best for successive threads to read/write successive 32-bit memory locations in shared memory…