There seems to be a performance cliff. Accessing more than 2 or 4 kilobytes of local memory severly hurts performance to the point where moving some data to global memory is a big advantage.
What kind of memory access layout does local memory use, and why is it so inefficient when scaled?
The reason I want to use local memory is because local memory ought to be faster than global. In essence, it is guaranteed to be coallesced, which is great in the places where you need per-thread storage. However, why aren’t I seeing this?
Coalescing is not a problem, but it is banked, which means that accesses that hit the same memory bank (a stride of 16 or something like that) can only be done one after the other.
To make it more concrete:
a = shared[32 * threadIdx.x];
is about a factor 8 - 16 slower than
a = shared[threadIdx.x];
Note that even normal (e.g. x86) CPUs can have a major issue with the power-of-two strided access the first one uses (mostly due to cache associativity), so finding literature on how to avoid it should not be too hard.
The difference to global memory is that e.g.
a = shared[threadIdx.x + 1];
a = shared[constant];
are fast, too (with the GTX 2xx series these are not that slow for global memory either though).
Note I did not re-check the details, so better read the corresponding programming manual sections.
Note that even in the slow case, shared memory should be faster than global memory, just not by that much and the difference to my knowledge is mostly in latency so it might not help much if you have loads of threads that can hide the latency.
Local memory is very very slow – It means death to performance - unless you have enough warps and activities to hide that kind of latency! – which is usually difficult to achieve!
I have no idea why it is not fast etc… All I know is - better not use it.
Note that most local variables are allocated in register and are fast!! Only when you allocate arrays in local memory that are NOT indexed by constants – you have this local memory coming into picture.
People here have a strange misconception about local memory. Yes, it’s much slower than registers and it’s great if registers is all your kernel needs. But it’s optimized vs global memory, and should be used whenever the shared nature of global memory isn’t needed. As I said, local memory is very similar to global except the storage is striped by the warp size and accesses to the ‘same’ address by all threads result in automatic coallescing. Emulating the same thing with global memory is of course possible but can get quite hairy.
Now I understand there’ll be an overhead converting the address from per-thread to global, but I don’t think that’s what I’m seeing.
Hm, sorry for confusing shared an local memory earlier. May I ask where you have all that info from? The programming guide has very little info on local memory, and what I read there sounds to me like it just uses global memory i.e. access to local variable a becomes access to global array element a[threadIdx.x] (assuming a one-dimensional block).
The few times a register ended up in local memory things got horribly slow for me and I forced it to use shared memory manually instead (declared as volatile so it can not be optimized away).
local memory = global memory, but local to each thread & always coalesced for you. So when you need thread-local memory & you have too many registers, it is better/easier to use local memory than global memory. Offcourse when you can you should be using shared memory in this case, but that can also not be an option.
Right. And what happens when you access an array as a[threadIdx.x]? You get coallesced reads. Which, you should all know, are very beneficial.
If you were to use global memory, you’d have to do the indexing by threadIdx yourself. Sometimes this isn’t too hard, sometimes it adds a large amount of complexity and mess. Local memory automagically solves the problem. It’s strange how little awareness there is for this aspect of CUDA.
So… does anyone have a clue why local memory performance plummets if you allocate even a modest amount (2kb+)? The cliffs seem to come at power-of-two boundaries. So a big fall at 2kb, a smaller one at 4kb (a total of a two-fold slowdown between them), another 20% if you pass 8kb. (I haven’t done specific testing, but that’s what I’ve seen working on my code the past week.)
Yes, exactly. Believe me, I’m maxing out my shared mem as well. Actually, it’d be nice if there was “local shared” memory. This wouldn’t automatically provide coallescing, it would automatically solve bank conflicts. I emulate it be having an array such as local_shared[blockDim.x]. Each thread has its own local storage and accessesing the “same” location (local_shared[threadIdx.x]) is confict-free. You don’t get much memory this way, but if you need it it’s very useful.
Local memory is sometimes the only option. Just look at e.g. the implementation of sinf(). And it is not slow, it is relatively slow. I can imagine you might bring register usage down quite a bit with using some local mem. Which gives you a higher occupancy & makes it possible to hide the latency of local memory.
Why would it be allocated like this rather than interleaved? And if it is allocated like this then it won’t be indexed by the threadID, it’ll be indexed in such a way that it is coalesced (threadID * 400 + elementID should do it I think… though I have to admit I have very little idea how coalescing works on CC < 1.2).
What’s Doctor Dob? Anyway, if local memory worked like how you just described it (localArray[threadIdx.x][i]) then yeah it would be very slow and uncoallesced. But if it worked like (localArray[i][threadIdx.x]) it would be fast.
To be frank, I don’t have proof local memory works one way or the other. But I know which way it should work. I’m going to go write a test kernel to get to the bottom of this. However in the code I’m working on, I replaced a local memory array with a global memory array and got a huge speedup. That should not have happened in either case!
unsigned const num_blocks = 4;
unsigned const num_threads_per_block = 32;
unsigned const num_threads = num_blocks * num_threads_per_block;
int const size = 512;
int const iter = 3000000/size;
__device__ float globalArray[num_threads * size];
#define localArray(i) localArray[i]
//#define localArray(i) globalArray
//#define localArray(i) globalArray[i * num_threads + (blockIdx.x*blockDim.x + threadIdx.x)]
testKernel( void* OutputMem )
int Â i, j;
Â Â for( Â j= 0; j< iter; j+= 1 Â ) Â
Â for( Â i= 0; i< size; i+= 1 Â ) Â localArray(i) = localArray(i)*localArray((i+1)%size);
((float*)OutputMem)[ Â blockIdx.x*blockDim.x + threadIdx.x Â ] = localArray;
// Program main
main( int argc, char** argv)
void * OutputMem;
Â Â dim3 Â grid( num_blocks, 1, 1);
Â Â dim3 Â threads( num_threads_per_block, 1, 1);
Â Â CUT_DEVICE_INIT(argc, argv);
CUDA_SAFE_CALL( Â cudaMalloc( &OutputMem, num_threads*sizeof(float)) Â );
Â Â // execute the kernel
testKernel<<< grid, threads, 16350 >>>( OutputMem);
Â Â CUT_CHECK_ERROR("Kernel execution failed");
Â Â float Output[num_threads*sizeof(float)];
Â Â CUDA_SAFE_CALL( cudaMemcpy( Output, OutputMem, num_threads*sizeof(float),
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â cudaMemcpyDeviceToHost) );
for(int i=0; i< 10; i++)
Â printf("%f\n", Output[i]);
Â Â // cleanup memory
Â Â CUDA_SAFE_CALL(cudaFree(OutputMem));
Ok, here are my results:
Using local memory: 2.67s
Using global memory uncoallesced: 3.08s
Using global memory coallesced: 2.06s
Hmm. Hmmmmmmm. Please feel free to play with this yourself. Complete VS project attached. Use timethis.exe for timing.
Frankly, I don’t even know why the uncoallesced and coallesced reads are so close. Local_memory_test.rar (331 KB)
I just tried it with 256 threads per block. The numbers are similar (ie, you get 8x the parallelism, but the total runtime and the pattern don’t change much). An issue I kept running into was the driver crashing! Do you see any bugs in my simple code?