Local vs Global memory is local memory access always coalesced ?

I have matrix like this

double mat[42][13];

per thread:

Currently am making it on per-thread basis hence its getting stored in local memory. I did that after reading in the programing guide that the local memory access is always coalesced :) (page number 86 of 2.2 programming guide).

Another way I can so this is by allocating the total space first on gpu memory (from the host side) for all the threads and then using that allocated memory. Then I would have to deal with the issue of malloc and writing to it in coalesced way ( which can be done ) .

My questions is that I saw some threads on this forum where people who are much more experienced and smarter than me in CUDA often say to avoid local memory usage as much as possible ?

Hence I am not sure now which is the best way to proceed?

The programming guide doesn’t state much on local-memory to make it clear for the programmers…

Thanks all for ur time…

PS: I have already used my shared memory hence I cant use that anymore for the above matrix :(

The problem with local memory is that it is not cached, so accesses to local memory are as expensive as accesses to global memory.
Furthermore, local memory is not shared between threads in a block like shared memory is, so you can look at local memory as a very slow register.

N.

Yes I Understand thanks. But its better than dealing with global memory… as I don’t need inter-thread data communication anyway… each thread has individual matrix which gets formed during computation and is used up for updating another global memory variable… so its scope is per thread only.

I think I will go with local memory only let see how it works as there are loooot of flops… in the kernel.

Thanks for your inputs NICO :)

The prog.guide 2.2 says they are always coalesced.

Ya I read that too… good to know that :)

thanks at ton SARNATH