Kernel fails to run due to too much lmem, but why?

I’m using a GTX 260 with the -arch sm_13 flag, so I should be good for 16,384 bytes of registers, right? Basically if the sizes of the arrays in my kernel are too large, the kernel fails to run and doesn’t end up writing anything to the output.

When I count up the local registers declared within my kernel (i.e., double mydata[mysize], etc.) it adds up to about 8000-9000 bytes. Compiling with the --cubin flag and finding the kernel in the file gives me the following:

code {

	name = _Z6DHtoH1PdS_S_S_S_S_P9complex64S1_PidS_S1_S1_

	lmem = 8528

	smem = 76

	reg  = 48

	bar  = 0

	const {

			segname = const

			segnum  = 1

			offset  = 0

			bytes   = 96

		mem {

I can do up to 320 threads per block. But when I compile with the --ptxas-options=-v option, I get:

ptxas info : Compiling entry function ‘Z6DHtoH1PdS_S_S_S_S_P9complex64S1_PidS_S1_S1

ptxas info : Used 48 registers, 8528+8488 bytes lmem, 76+72 bytes smem, 1936 bytes cmem[0], 96 bytes cmem[1]

Am I limited to 16,384 bytes of lmem as given by the 2nd line of the ptxas info (in this case I’m exceeding it)? Where do the extra 8488 bytes come from? Is there any way to lower this number? If I make a small adjustment to the sizes of my arrays, then I get:

ptxas info : Compiling entry function ‘Z6DHtoH1PdS_S_S_S_S_P9complex64S1_PidS_S1_S1

ptxas info : Used 48 registers, 7096+7056 bytes lmem, 76+72 bytes smem, 1936 bytes cmem[0], 96 bytes cmem[1]

And in this case the kernel runs fine.

complex64 is an aligned struct containing two double values.