Local memory performance Using more than 4kb kills it.. why?

alex_dubinsky · August 26, 2008, 5:43am

There seems to be a performance cliff. Accessing more than 2 or 4 kilobytes of local memory severly hurts performance to the point where moving some data to global memory is a big advantage.

What kind of memory access layout does local memory use, and why is it so inefficient when scaled?

The reason I want to use local memory is because local memory ought to be faster than global. In essence, it is guaranteed to be coallesced, which is great in the places where you need per-thread storage. However, why aren’t I seeing this?

Reimar · August 26, 2008, 6:52am

Coalescing is not a problem, but it is banked, which means that accesses that hit the same memory bank (a stride of 16 or something like that) can only be done one after the other.

To make it more concrete:

a = shared[32 * threadIdx.x];

is about a factor 8 - 16 slower than

a = shared[threadIdx.x];

Note that even normal (e.g. x86) CPUs can have a major issue with the power-of-two strided access the first one uses (mostly due to cache associativity), so finding literature on how to avoid it should not be too hard.

The difference to global memory is that e.g.

a = shared[threadIdx.x + 1];

and

a = shared[constant];

are fast, too (with the GTX 2xx series these are not that slow for global memory either though).

Note I did not re-check the details, so better read the corresponding programming manual sections.

Note that even in the slow case, shared memory should be faster than global memory, just not by that much and the difference to my knowledge is mostly in latency so it might not help much if you have loads of threads that can hide the latency.

alex_dubinsky · August 26, 2008, 7:29am

Reimar, local, not shared

Sarnath · August 26, 2008, 7:45am

Local memory is very very slow – It means death to performance - unless you have enough warps and activities to hide that kind of latency! – which is usually difficult to achieve!

I have no idea why it is not fast etc… All I know is - better not use it.

Note that most local variables are allocated in register and are fast!! Only when you allocate arrays in local memory that are NOT indexed by constants – you have this local memory coming into picture.

alex_dubinsky · August 26, 2008, 7:56am

People here have a strange misconception about local memory. Yes, it’s much slower than registers and it’s great if registers is all your kernel needs. But it’s optimized vs global memory, and should be used whenever the shared nature of global memory isn’t needed. As I said, local memory is very similar to global except the storage is striped by the warp size and accesses to the ‘same’ address by all threads result in automatic coallescing. Emulating the same thing with global memory is of course possible but can get quite hairy.

Now I understand there’ll be an overhead converting the address from per-thread to global, but I don’t think that’s what I’m seeing.

Sarnath · August 26, 2008, 8:46am

Not sure why you are favoring local memory here… Do you mean to say local memory is more optimized than global memory? I dont understand why u feel so.

Reimar · August 26, 2008, 8:57am

Hm, sorry for confusing shared an local memory earlier. May I ask where you have all that info from? The programming guide has very little info on local memory, and what I read there sounds to me like it just uses global memory i.e. access to local variable a becomes access to global array element a[threadIdx.x] (assuming a one-dimensional block).

The few times a register ended up in local memory things got horribly slow for me and I forced it to use shared memory manually instead (declared as volatile so it can not be optimized away).

Fuchs · August 26, 2008, 2:40pm

http://www.ddj.com/hpc-high-performance-co…08401741?pgno=3

says:

Don’t be confused by the fact the illustration includes a block labeled “local memory” within the multi-processor. Local memory implies “local in the scope of each thread”. It is a memory abstraction, not an actual hardware component of the multi-processor. In actuality, local memory gets allocated in global memory by the compiler and delivers the same performance as any other global memory region. Local memory is basically used by the compiler to keep anything the programmer considers local to the thread but does not fit in faster memory for some reason. Normally, automatic variables declared in a kernel reside in registers, which provide very fast access. In some cases the compiler might choose to place these variables local memory, which might be the case when there are too many register variables, an array contains more than four elements, some structure or array would consume too much register space, or when the compiler cannot determine if an array is indexed with constant quantities.

Be careful because local memory can cause slow performance. Inspection of the ptx assembly code (obtained by compiling with the -ptx or -keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. If it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture.

kristleifur · August 26, 2008, 2:57pm

Thanks, Fuchs. It seems that people here have indeed had a strange misconception about local memory :)

E.D_Riedijk · August 26, 2008, 3:08pm

local memory = global memory, but local to each thread & always coalesced for you. So when you need thread-local memory & you have too many registers, it is better/easier to use local memory than global memory. Offcourse when you can you should be using shared memory in this case, but that can also not be an option.

alex_dubinsky · August 26, 2008, 11:26pm

Right. And what happens when you access an array as a[threadIdx.x]? You get coallesced reads. Which, you should all know, are very beneficial.

If you were to use global memory, you’d have to do the indexing by threadIdx yourself. Sometimes this isn’t too hard, sometimes it adds a large amount of complexity and mess. Local memory automagically solves the problem. It’s strange how little awareness there is for this aspect of CUDA.

So… does anyone have a clue why local memory performance plummets if you allocate even a modest amount (2kb+)? The cliffs seem to come at power-of-two boundaries. So a big fall at 2kb, a smaller one at 4kb (a total of a two-fold slowdown between them), another 20% if you pass 8kb. (I haven’t done specific testing, but that’s what I’ve seen working on my code the past week.)

alex_dubinsky · August 26, 2008, 11:31pm

Yes, exactly. Believe me, I’m maxing out my shared mem as well. Actually, it’d be nice if there was “local shared” memory. This wouldn’t automatically provide coallescing, it would automatically solve bank conflicts. I emulate it be having an array such as local_shared[blockDim.x][31]. Each thread has its own local storage and accessesing the “same” location (local_shared[threadIdx.x][7]) is confict-free. You don’t get much memory this way, but if you need it it’s very useful.

Sarnath · August 27, 2008, 6:05am

I really dont understand why suddenly people have started glorifying local memory!

Local memory is SLOW!

btw, Whats the deal about “coalesced access”?? WHere in manual does it talk about all these??

Lets for a momment consider that what that site (Doctor Dob…) says is true… ( i have no clue how they got this info…)

btw, just consider a local memory array of “100” integers – i.e. 400 bytes each.

Now, if you spawn a block with 512 threads – then almost 512*400 bytes are allocated in global memory such that thread 0’s 400 bytes come first, followed by threadid 1’s 400 bytes and so on…

Thus when the kernel accesses localArray[threadIdx.x] – then you are accessing global memory locations 400 bytes apart – which results in 32 independent un-coalesced global memory requests per WARP!

Thats bound to be slower in any sense!

Improper usage of local memory (if u r not hiding the big fat latency it generates) – your performance will come down heavily!

For example: only 1 Warp of a block accessing local memory and others refraining to do – MAY help the cause! but if all warps of the block r accessing then it can result in a tremendously slow kernel…

Best Regards,
Sarnath

E.D_Riedijk · August 27, 2008, 2:47pm

Local memory is sometimes the only option. Just look at e.g. the implementation of sinf(). And it is not slow, it is relatively slow. I can imagine you might bring register usage down quite a bit with using some local mem. Which gives you a higher occupancy & makes it possible to hide the latency of local memory.

Tigga · August 27, 2008, 2:58pm

Why would it be allocated like this rather than interleaved? And if it is allocated like this then it won’t be indexed by the threadID, it’ll be indexed in such a way that it is coalesced (threadID * 400 + elementID should do it I think… though I have to admit I have very little idea how coalescing works on CC < 1.2).

alex_dubinsky · August 27, 2008, 6:17pm

What’s Doctor Dob? Anyway, if local memory worked like how you just described it (localArray[threadIdx.x][i]) then yeah it would be very slow and uncoallesced. But if it worked like (localArray[i][threadIdx.x]) it would be fast.

To be frank, I don’t have proof local memory works one way or the other. But I know which way it should work. I’m going to go write a test kernel to get to the bottom of this. However in the code I’m working on, I replaced a local memory array with a global memory array and got a huge speedup. That should not have happened in either case!

alex_dubinsky · August 27, 2008, 6:20pm

By the way, I am using an 8600 GT on Vista with CUDA 2.0b2.

alex_dubinsky · August 27, 2008, 7:27pm

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <cutil.h>

unsigned const num_blocks = 4;

unsigned const num_threads_per_block = 32;

unsigned const num_threads = num_blocks * num_threads_per_block;

int const size = 512;

int const iter = 3000000/size;

__device__ float globalArray[num_threads * size];

#define localArray(i) localArray[i]

//#define localArray(i) globalArray

//#define localArray(i) globalArray[i * num_threads + (blockIdx.x*blockDim.x + threadIdx.x)]

__global__ void

testKernel( void* OutputMem ) 

{

	float localArray;

	int Â  i, j;

Â  Â for( Â j= 0; j< iter; j+= 1 Â ) Â 

 Â for( Â i= 0; i< size; i+= 1 Â ) Â  localArray(i) = localArray(i)*localArray((i+1)%size);

	((float*)OutputMem)[ Â blockIdx.x*blockDim.x + threadIdx.x Â ] = localArray[0];

}

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

	void * OutputMem;

Â  Â dim3 Â grid( num_blocks, 1, 1);

 Â  Â dim3 Â threads( num_threads_per_block, 1, 1);

Â  Â CUT_DEVICE_INIT(argc, argv);

	CUDA_SAFE_CALL( Â cudaMalloc( &OutputMem, num_threads*sizeof(float)) Â );

Â  Â // execute the kernel

	testKernel<<< grid, threads, 16350 >>>( OutputMem);

 Â  Â CUT_CHECK_ERROR("Kernel execution failed");

Â  Â float Output[num_threads*sizeof(float)];

 Â  Â CUDA_SAFE_CALL( cudaMemcpy( Output, OutputMem, num_threads*sizeof(float),

 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â cudaMemcpyDeviceToHost) );

	for(int i=0; i< 10; i++)

 Â printf("%f\n", Output[i]);

Â  Â // cleanup memory

 Â  Â CUDA_SAFE_CALL(cudaFree(OutputMem));

}

Ok, here are my results:

Using local memory: 2.67s

Using global memory uncoallesced: 3.08s

Using global memory coallesced: 2.06s

Hmm. Hmmmmmmm. Please feel free to play with this yourself. Complete VS project attached. Use timethis.exe for timing.

Frankly, I don’t even know why the uncoallesced and coallesced reads are so close.
Local_memory_test.rar (331 KB)

_Big_Mac · August 27, 2008, 8:06pm

Maybe because you only have 32 threads per block? I’m guessing here, I don’t have a CUDA GPU right now.

alex_dubinsky · August 27, 2008, 10:39pm

I just tried it with 256 threads per block. The numbers are similar (ie, you get 8x the parallelism, but the total runtime and the pattern don’t change much). An issue I kept running into was the driver crashing! Do you see any bugs in my simple code?

Topic		Replies	Views
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7244	January 25, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1943	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3304	January 10, 2010
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2637	April 23, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Putting the GPU at work CUDA Programming and Performance	21	20172	July 5, 2007
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	1958	December 9, 2010
Effective global memory bandwidth? CUDA Programming and Performance	17	17559	September 18, 2007
How fast is local memory? the doc doesn't say much CUDA Programming and Performance	24	8186	August 20, 2007

Local memory performance Using more than 4kb kills it.. why?

Related topics