Local memory performance Using more than 4kb kills it.. why?

They are not that close, about 50% slower. But I suspect that 32 bit integer multiplies might be quite slow on your hardware, and the coalesced variant needs two (if the compiler is stupid 3) more per innermost loop iteration.

You could try using a shift or the special 24-bit integer multiplication and see if it changes something.

Though I mostly think you have too few blocks by far so it has no chance to hide the huge global memory latency - since for the uncoalesced read it can still queue the next reads latency is almost the same for coalesced and uncoalesced, and without latency hiding memory bandwidth is not an issue.

Also you are writing back localArray[0], not localArray(0) which I am not sure what kind of effects that might have.

Are you saying that the mere indexing calculations are destroying performance? Wow, I never thought about that. I guess you’re right. That’s huge. Thanks for pointing it out.

I saw that yesterday. Didn’t change anything or stop the driver from crashing.

Ok, changed to mul24 and took out the modulus

Results for 32 threads/block and 1 block/multiproc:
lmem: 2.45
uncoallesced gmem: 2.75
coallesced gmem: 1.85

Results for 256 threads/block and 3 blocks/multiproc:
None. The driver kept crashing and the result was 3.7s each time. I rewrote the kernel so that it’s many small executions instead of the one multisecond one, but that didn’t fix it. :-/ Edit: neither did installing the latest driver.

Anyway, could someone with a GTX 2x0 please try this so we have results on CM1.3?
Lmem_Test_1.1.rar (331 KB)

Section 5.1.2.2 in CUDA 2.0 manual

I just started reading the Programming Guide 2.0 and stumbled upon this and it reminded me of this thread.

Local memory should indeed be automatically coalesced but even coalesced global memory operations have huge latency. They are fast (ie. can almost max out your DtD bandwidth) but take a few hundred cycles to execute. Not a good place for a often used, random access scratchpad but I imagine it can be useful in some special cases.

CUDA 1.1 manual doesn’t say a thing about this BTW. It might be a new thing or affect only GPUs of certain compute capability, I don’t know.

My testing on CUDA 2.0 has revealed that local memory is not being coallesced. Or if it is, its access pattern (in regards to memory channels and all that) is nevertheless inferior to doing manual coallescing. This was born out in the code I posted, and in the full-fledged kernel I’m working on. The speedup is 50-100%. I don’t think the problem is my using an 8600 GT. But if you have a G200, please try the code I posted and tell us the results. It would be greatly appreciated.