I have matrix like this
Currently am making it on per-thread basis hence its getting stored in local memory. I did that after reading in the programing guide that the local memory access is always coalesced :) (page number 86 of 2.2 programming guide).
Another way I can so this is by allocating the total space first on gpu memory (from the host side) for all the threads and then using that allocated memory. Then I would have to deal with the issue of malloc and writing to it in coalesced way ( which can be done ) .
My questions is that I saw some threads on this forum where people who are much more experienced and smarter than me in CUDA often say to avoid local memory usage as much as possible ?
Hence I am not sure now which is the best way to proceed?
The programming guide doesn’t state much on local-memory to make it clear for the programmers…
Thanks all for ur time…
PS: I have already used my shared memory hence I cant use that anymore for the above matrix :(