Use of the local memory can be a balancing act. All threads in a block share the same local memory (16384 bytes on a Tesla) on a SM (streaming multi-processor). All registers and shared memory will be placed in local memory unless you use too much, in which case the memory gets spilled to global memory.
So where do those local arrays end up? It depends on how many threads you have in a block and how much local memory each SM has. The utility ‘pgaccelinfo’ will show how much local memory is available. The flag “-Mcuda=ptxinfo” will list the number of registers and shared memory used per thread. This combined with the number of threads in a block should help you determine if you’re spilling to global memory.
The reason I ask is that sometimes I find that using shared memory actually slows my algorithm down, or simply doesn’t make any difference
If you’re using a Fermi card, then using shared memory doesn’t matter as much. Fermi has added an L2 cache as well as hardware caching of the local memory. Software caching still can help, just not as much.
I was also surprised a few days ago when I decided to make a block of regularly used data (in a 100x100 array) sit in constant memory instead of global memory - as my code ended up being slower! Is there a limit to how much constant memory can be cached?
Constant memory is also finite (64K) but your program would have crashed if you exceeded this limit. Unfortunately I don’t why it’s slower for you.
Lastly, I haven’t got a clue how to use “pinned” memory, so if anyone has an example of how this might be useful, I’d be very grateful.
In order for host data to be copied to and from the device, the data must be placed ‘pinned’ memory. This is a physical memory address that can not be swapped to virtual memory. The OS can start the DMA transfer and then start on another task without worry of the data being swapped out of memory mid-transfer.
By default, host memory is allocated in normal virtual memory. To transfer the data to the device, this virtual memory must be first copied to pinned memory. When you use the ‘pinned’ attribute, you can save this extra copy since the host memory will be allocated in ‘pinned’ memory. The caveat being that pinned memory is finite and that the OS does not need to honor the request. Also, it’s worth noting that it’s the CUDA driver that manages this memory. Hence, if you destroy you context, all device and pinned memory will be destroyed as well. Normal host allocated memory is not. destroyed.
Hope this helps,