As a beginner to cuda I have this question on using shared memory:
in my code i got a high access latency on using global memory so decided to buffer some intermediate results in shared memory.
switching from global memory to shared memory i could get nearly no performance improvement!!
I checked again the code and saw when a variable not dependent on the input data from user is written to shared memory the latency is very small whereas when the result of some calculation (which is dependent on the input data from the user) is written to shared memory the writing process to shared memory is so slow: nearly the same as that of global memory. my question: is this a typical behavior of shared memory or I am doing something wrong?