large local variables


when creating large amounts of local variables for each thread, say int a[512] I would assume CUDA automatically uses global memory for the storage. But how exactly is it stored? Is it cached? Same performance as any other global memory?


Local memory is’t a cache memory.

cuda dosen’t automaticaly store large variables in global memory, its totaly controled by the user. If a single thread in your program needs 512 ints exclusivly then you probebly should be splitting those calcualtions to a few threads …

Section does say that the compiler will automatically move large structs out to device memory. Any array HAS to be put into dev mem as registers are not addressible per usual. If you take the address of an automatic variable it gets put into dev mem automatically (this should be mentioned in sec You need to be more aware of this on the G80 as the penalty is 250 times slower access (cycle times). It is a really bad idea to put a large array into auto storage as accessing individual members will incur an even greater penalty for non coalesced access. Still waiting on details about the memory subsystem to be able to design here. Always do your own thing with dev mem and plan coalescing. Why I said elsewhere the compiler should not do this behind your back as the penalties are so high.