What part of “not cached” fails to indicate “slowness”? It is true that the performance guidelines section 5.2.1 lacks a description of how to get the best performance out of local memory, but since you really have no control over how the compiler uses it I don’t see it as a big loss. You just have to hope that the compiler generates memory reads that are coalesced.
The truth of the matter is, if your kernel is using local memory for ANY reason it is going to be slow. Plus, it will be better for you to move that into global memory and manage it explicitly to ensure that your access pattern is coalesced. Device memory bandwidth is very precious and it should not be wasted with random uncoalesced reads. Better yet, depending on your access pattern, constant, shared, or even textures may be better options.
The programming guide uses “device” and “global” almost as synonyms. If you want to be absolutely picky about it, “global memory” refers to a pointer to “device memory” alocated by cudaMalloc or declared device (see what I mean about synonyms, you declare a “global memory” with device !, if you don’t believe me look at 184.108.40.206 in the CUDA 1.1 guide).