How fast is local memory? the doc doesn't say much

Today I wrote a local-memory-as-stack kernel just for fun, and found it surprisingly fast!
That thing allocate 1k local memory per thread, reads and writes about 10~100 dwords of them repeatedly in all kinds of weird places like branches and loops. These, as I’m aware of, are totally against the doc’s performance section. Nevertheless, that thing nearly outperformed my entirely-shared-memory version.
Have anyone else ever benchmarked such extensive local-memory-using kernels?
Or may nVidia guys give some further explanation?

Seems that if a lot of threads write exactly the same thing to the same offset at local memory, the write gets optimized. Is my guess correct?

I corrected my own bugs, now local memory is insanely fast indeed!
nVidia should have said that in doc!
Now all my hard work to reduce memory usage turns out to be sheer stupidity…

asadafag, how are you allocating the local memory? The local keyword is deprecated in 1.0. Did you check the .ptx whether nvopencc hasn’t turned it into shared mem actually?

Peter

I just declared an array and indexed it.
I have lmem=1024 in .cubin, and shared memory can’t possibly hold that much. I allocated 448kb local memory per block.

Fair enough. Does the compiler produce ld.local instructions or does it use ld.global when accessing the array?

Peter

ld.local and st.local

Cool. So I guess that 1) you get good mem access performance because the compiler produces code that coalesces and 2) you help the compiler a lot reducing register pressure that way. Good work!

Peter

you get good mem access performance because the compiler produces code that coalesces

Maybe that’s exactly my case!
Register is also a side benefit:)
So basically, local memory stack is a good choice!

Local memory performance is the same as that of global memory. So, yes, coalescing is very important (up to 10x speedup).

Paulius

Well, glad to get that confirmed…

Now the question is, exactly WHAT would result in coalescing for local memory? The address is not even know.

My guess is: for an int array, writing to the exactly same offset in a warp result in coalescing, right?

Well as paulius said, local mem is stored in device mem, same as global mem space. So all coalescing requirements mentioned in the programming guide do apply. In particular, yes, ints aligned to threadIdx will coalesce. I assume the assembler also chooses a suitable start address automatically, so it works.

Peter

I have a hunch that it’s working a little different from that. I guess the local memory is stored so the local arrays of different threads in a warp are already interleaved to enable memory coalescing.

This would mean that simultaneously accessing local_array[j] in each thread of a warp will coalesce, but accessing local_array[threadIdx.x] will not. Maybe somebody from NVIDIA will confirm this as true or false.

/Pyry

Pyry, that is also a good access pattern.

We could dig the .ptx to find out. Volunteers ? External Image

Peter

I’m afraid the .ptx is useless for that purpose… It’s just a .local .

Also, local memory is PER THREAD, not like global pointers SHARED BY THREADS.

I tried to get address in the kernel and write back to CPU… It turns out one can’t get that.

#include <stdio.h>

__global__ void ker0(int *ret){

	int lcl[256];

	int thid=threadIdx.x;

	//force local

	for(int i=0;i<256;i++)

  lcl[i]=ret[i];

	for(int i=0;i<256;i++)

  ret[0]+=lcl[ret[i]];

	__syncthreads();

	//return the addr

	ret[thid]=(int)&lcl[0];

	ret[thid+256]=(int)&lcl[1];

	ret[thid+512]=(int)&lcl[2];

	ret[thid+768]=(int)&lcl[16];

}

int main(){

	int *a;

	int b[1024];

	cudaMalloc((void**)&a,99999);

	cudaMemset(a,0,99999);

	ker0<<<1,256,1>>>(a);

	cudaMemcpy(b,a,4096,cudaMemcpyDeviceToHost);

	for(int i=0;i<4;i++){

  for(int j=0;j<256;j++)

  	printf("%08x ",b[i*256+j]);

  puts("");

	}

	return 0;

}

The ptx is correct, but its output indicates that local memory starts at address zero for all threads. Seems the hardware or the ptxas has a few more tricks up her sleeves in local memory…

That’s exactly what I thought.

Hm, too bad. :( Someone from NVIDIA …?

Peter

I will bite on this one…

Local does start from 0 for every thread (you can take its address and hand it back OK). It seems to be an odd address space - hardware calculates the device address from a hardware base address register either for each block or more likely each warp (you cannot tell from the outside), shifts the tid up 7 bits (for warp based local) and adds to the base register so that accesses to 32 bit words in local are always fully coalesced (warp == 128 byte aligned). Address arithmetic always works and accesses are optimal.

As I mentioned elsewhere the thread clock time between reads from dev or local is only 40 clocks on GTX @ 100% occupancy (only 30 on a 8800GTS with 900MHz memory) and that does not leave much time for asadafag to do his random address calculations each loop (found this with my dev mem benchmark). Running the same code to shared will give 16x bank conflict on writes (32 clocks each) and given writes are asynchronous to dev memory it is easily possible to get higher apparent performance from local.

Eric
(since this is something NV don’t think you need to know you won’t get told).

Thanks to Eric!
With this data, I’ll be able to optimize my code much better.

I would suggest against using local memory space explicitly (you’ll notice that it no longer is discussed in the Programming Guide). Coalescing requirements are the same as for global memory. Addresses are handled differently, since space is partitioned differently from global memory. When compiler makes use of the local memory (for example, register spilling, large arrays local to kernels), it ensures coalescing.

Osiris, how are you measure time between reads? 40 cycles seems high, as one read should not affect the issue of another independent read. Also, keep in mind that time between instruction issues and time before a value is ready are different things.

Paulius