shared stack, local stack, ptofiler output

Hi again. i want to ask about strange result i have when i profile my program.

in my program i use 128 bytes short stak. I switched on local-memory counter, so i know exactly how much local memory are in use.

so, with local memory stack i have this (128 bytes lmem use):
and this with shared memory (0 bytes lmem use):

the last result have discouraged me. i do not use any global or local memory in the last case, so what does “gst uncoalesced” mean in that case?

the second strange moment is that local stack works faster then shared stack.
may be the shared-memory banks conflicts are the cause to that?

global store uncoalesced. So that is counting writes to global memory that not coalesced

That can happen when you use so much shared memory that your occupancy goes down too much. When you have more occupancy, you can apparently hide the latency of the usage of local memory good enough.

Thanks, that’s make clear a lot.

So another question

I think about mixed stack. A small part in shared memory and outher in local - what do you think about that?

Will it give performance growth comparison to local stack ?

But i do not use global memory writes. I do global memory write only once at the end of kernel.

Trying is knowing in CUDA. It really depends how much local memory reads&writes you are doing and the amount of calculation & number of threads per MP. If the latency can be hidden using shared memory might not give better performance. But I have had too many times that I thought something was faster & it was not (and vice versa) that the only thing I can say is : try it. It also really depends between different kernels.

Yes, that write is not coalesced. Especially in kernels where you write only 1 value per block at the end, you will have uncoalesced writes. That does not necessarily mean that you can do better for your algorithm.

what do you mean by “That does not necessarily mean that you can do better for your algorithm”.

how can i do that?

I say, it does not mean you can do better. Sometimes uncoalesced writes are the best you can do. When you write one value per block, it is always going to be coalesced. But it can very well be that writing one value per block is the best for your algorithm.

ok, thanks for answers.