How fast is local memory? the doc doesn't say much

asadafag · August 14, 2007, 6:29am

Today I wrote a local-memory-as-stack kernel just for fun, and found it surprisingly fast!
That thing allocate 1k local memory per thread, reads and writes about 10~100 dwords of them repeatedly in all kinds of weird places like branches and loops. These, as I’m aware of, are totally against the doc’s performance section. Nevertheless, that thing nearly outperformed my entirely-shared-memory version.
Have anyone else ever benchmarked such extensive local-memory-using kernels?
Or may nVidia guys give some further explanation?

asadafag · August 14, 2007, 6:36am

Seems that if a lot of threads write exactly the same thing to the same offset at local memory, the write gets optimized. Is my guess correct?

asadafag · August 14, 2007, 7:17am

I corrected my own bugs, now local memory is insanely fast indeed!
nVidia should have said that in doc!
Now all my hard work to reduce memory usage turns out to be sheer stupidity…

prkipfer · August 14, 2007, 8:12am

asadafag, how are you allocating the local memory? The local keyword is deprecated in 1.0. Did you check the .ptx whether nvopencc hasn’t turned it into shared mem actually?

Peter

asadafag · August 14, 2007, 9:51am

I just declared an array and indexed it.
I have lmem=1024 in .cubin, and shared memory can’t possibly hold that much. I allocated 448kb local memory per block.

prkipfer · August 14, 2007, 10:02am

Fair enough. Does the compiler produce ld.local instructions or does it use ld.global when accessing the array?

Peter

asadafag · August 15, 2007, 2:31am

ld.local and st.local

prkipfer · August 15, 2007, 10:02am

Cool. So I guess that 1) you get good mem access performance because the compiler produces code that coalesces and 2) you help the compiler a lot reducing register pressure that way. Good work!

Peter

asadafag · August 15, 2007, 11:36am

you get good mem access performance because the compiler produces code that coalesces

Maybe that’s exactly my case!
Register is also a side benefit:)
So basically, local memory stack is a good choice!

paulius · August 16, 2007, 8:57pm

Local memory performance is the same as that of global memory. So, yes, coalescing is very important (up to 10x speedup).

Paulius

asadafag · August 17, 2007, 5:47am

Well, glad to get that confirmed…

Now the question is, exactly WHAT would result in coalescing for local memory? The address is not even know.

My guess is: for an int array, writing to the exactly same offset in a warp result in coalescing, right?

prkipfer · August 17, 2007, 9:08am

Well as paulius said, local mem is stored in device mem, same as global mem space. So all coalescing requirements mentioned in the programming guide do apply. In particular, yes, ints aligned to threadIdx will coalesce. I assume the assembler also chooses a suitable start address automatically, so it works.

Peter

pyrtsa · August 17, 2007, 9:30am

I have a hunch that it’s working a little different from that. I guess the local memory is stored so the local arrays of different threads in a warp are already interleaved to enable memory coalescing.

This would mean that simultaneously accessing local_array[j] in each thread of a warp will coalesce, but accessing local_array[threadIdx.x] will not. Maybe somebody from NVIDIA will confirm this as true or false.

/Pyry

prkipfer · August 17, 2007, 9:57am

Pyry, that is also a good access pattern.

We could dig the .ptx to find out. Volunteers ? External Image

Peter

asadafag · August 17, 2007, 10:54am

I’m afraid the .ptx is useless for that purpose… It’s just a .local .

Also, local memory is PER THREAD, not like global pointers SHARED BY THREADS.

I tried to get address in the kernel and write back to CPU… It turns out one can’t get that.

#include <stdio.h>

__global__ void ker0(int *ret){

	int lcl[256];

	int thid=threadIdx.x;

	//force local

	for(int i=0;i<256;i++)

  lcl[i]=ret[i];

	for(int i=0;i<256;i++)

  ret[0]+=lcl[ret[i]];

	__syncthreads();

	//return the addr

	ret[thid]=(int)&lcl[0];

	ret[thid+256]=(int)&lcl[1];

	ret[thid+512]=(int)&lcl[2];

	ret[thid+768]=(int)&lcl[16];

}

int main(){

	int *a;

	int b[1024];

	cudaMalloc((void**)&a,99999);

	cudaMemset(a,0,99999);

	ker0<<<1,256,1>>>(a);

	cudaMemcpy(b,a,4096,cudaMemcpyDeviceToHost);

	for(int i=0;i<4;i++){

  for(int j=0;j<256;j++)

  	printf("%08x ",b[i*256+j]);

  puts("");

	}

	return 0;

}

The ptx is correct, but its output indicates that local memory starts at address zero for all threads. Seems the hardware or the ptxas has a few more tricks up her sleeves in local memory…

pyrtsa · August 17, 2007, 11:38am

That’s exactly what I thought.

prkipfer · August 17, 2007, 12:40pm

Hm, too bad. :( Someone from NVIDIA …?

Peter

osiris1 · August 18, 2007, 9:05am

I will bite on this one…

Local does start from 0 for every thread (you can take its address and hand it back OK). It seems to be an odd address space - hardware calculates the device address from a hardware base address register either for each block or more likely each warp (you cannot tell from the outside), shifts the tid up 7 bits (for warp based local) and adds to the base register so that accesses to 32 bit words in local are always fully coalesced (warp == 128 byte aligned). Address arithmetic always works and accesses are optimal.

As I mentioned elsewhere the thread clock time between reads from dev or local is only 40 clocks on GTX @ 100% occupancy (only 30 on a 8800GTS with 900MHz memory) and that does not leave much time for asadafag to do his random address calculations each loop (found this with my dev mem benchmark). Running the same code to shared will give 16x bank conflict on writes (32 clocks each) and given writes are asynchronous to dev memory it is easily possible to get higher apparent performance from local.

Eric
(since this is something NV don’t think you need to know you won’t get told).

asadafag · August 18, 2007, 11:03am

Thanks to Eric!
With this data, I’ll be able to optimize my code much better.

paulius · August 18, 2007, 10:37pm

I would suggest against using local memory space explicitly (you’ll notice that it no longer is discussed in the Programming Guide). Coalescing requirements are the same as for global memory. Addresses are handled differently, since space is partitioned differently from global memory. When compiler makes use of the local memory (for example, register spilling, large arrays local to kernels), it ensures coalescing.

Osiris, how are you measure time between reads? 40 cycles seems high, as one read should not affect the issue of another independent read. Also, keep in mind that time between instruction issues and time before a value is ready are different things.

Paulius

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5072	September 6, 2008
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24479	December 21, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1943	October 8, 2020
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22471	December 6, 2007
Optimizing ptx CUDA Programming and Performance	10	8923	April 24, 2008
Local faster than global. Why? CUDA Programming and Performance	15	12916	March 20, 2009
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7244	January 25, 2010
temporary memory issues CUDA Programming and Performance	11	5312	March 30, 2008
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3304	January 10, 2010
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2637	April 23, 2010

How fast is local memory? the doc doesn't say much

Related topics