Hi again. i want to ask about strange result i have when i profile my program.
in my program i use 128 bytes short stak. I switched on local-memory counter, so i know exactly how much local memory are in use.
so, with local memory stack i have this (128 bytes lmem use):
[attachment=6768:attachment]
and this with shared memory (0 bytes lmem use):
[attachment=6769:attachment]
the last result have discouraged me. i do not use any global or local memory in the last case, so what does “gst uncoalesced” mean in that case?
the second strange moment is that local stack works faster then shared stack.
may be the shared-memory banks conflicts are the cause to that?
global store uncoalesced. So that is counting writes to global memory that not coalesced
That can happen when you use so much shared memory that your occupancy goes down too much. When you have more occupancy, you can apparently hide the latency of the usage of local memory good enough.
Trying is knowing in CUDA. It really depends how much local memory reads&writes you are doing and the amount of calculation & number of threads per MP. If the latency can be hidden using shared memory might not give better performance. But I have had too many times that I thought something was faster & it was not (and vice versa) that the only thing I can say is : try it. It also really depends between different kernels.
Yes, that write is not coalesced. Especially in kernels where you write only 1 value per block at the end, you will have uncoalesced writes. That does not necessarily mean that you can do better for your algorithm.
I say, it does not mean you can do better. Sometimes uncoalesced writes are the best you can do. When you write one value per block, it is always going to be coalesced. But it can very well be that writing one value per block is the best for your algorithm.