Problem with local memory

wanderine · February 8, 2011, 1:47pm

I’m having some problems with a kernel, I get correct results but the kernel is very slow. It seems like each thread uses 1.6 KB of local memory, I thought that the L1-cache on the Fermi-cards would be fast enough to handle this…if the L1-cache is 16 KB, does this mean that only 10 threads can run in parallel on each multiprocessor? My GPU is Nvidia GTX 480. In each thread I invert a 7x7 matrix and this uses a lot of registers.

Compiler output for kernel
ptxas info : Used 63 registers, 1644+4 bytes lmem, 8192+0 bytes smem, 128 bytes cmem[0], 54332 bytes cmem[2], 16 bytes cmem[16]

DrAnderson42 · February 8, 2011, 3:11pm

Wow, that’s a lot of local memory. It seems a bit excessive for the inversion of a 7x7 matrix. How many copies of the matrix do you need to store in order to invert it? That size matrix is only a couple hundred bytes.

No. It just means that a lot of your local memory traffic is likely going to get flushed all the way out to dram and read back when you run out of cache - and slow things down.

The number one thing to do is up the cache config to 48k. See the set cache config function calls in the reference manual for more information (if you’re running CUDA 3.2, there is a new call that is simpler than cudaSetFuncCacheConfig).

The number two thing to do is take a close look at the code and try and figure out where all that local memory use is coming from. Perhaps with some reuse of arrays or some clever rethinking of your code structure you can boost the performance significantly.

Topic		Replies	Views
3000 floats per thread CUDA Programming and Performance	2	919	January 16, 2014
Local memory usage CUDA Programming and Performance	5	1200	July 9, 2010
Reconfiguring the cache / shared memory on a Fermi understanding the cudaFuncSetCacheConfig command CUDA Programming and Performance	19	34684	June 7, 2010
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008
Why does a simple single-threaded CUDA kernel consume large amounts of global memory? CUDA Programming and Performance	7	6564	February 24, 2011
17x drop in Cuda performance When each thread operate on subset of kernel input data CUDA Programming and Performance	7	1683	April 16, 2012
Local Memory Per Thread ? CUDA Programming and Performance	5	4367	June 4, 2010
Fermi memory management different? CUDA Programming and Performance	10	1317	April 17, 2011
Thread Local Memory CUDA Programming and Performance	1	6906	January 26, 2016
local thread memory & compiller CUDA Programming and Performance	12	2941	September 26, 2008

Problem with local memory

Related topics