cudaError_enum error with a lot of shared memory

Hello! I was creating an array

__shared__ float4 table[4][16][16];

which requires 16k…

This causes the error “cudaError_enum at memory location yada yada”, because this requires too much shared memory.

Two things:

I’m guessing this error occurs because at run time, it references lalaland? Can’t this error be detected at compile time, since it appears you are adding up the shared memory required by every code snippit (even ones that aren’t being called!)? I lost a lot of time until I did the math and realized I exceeded the 8K limit.

This was actually in a function that wasn’t even being called anywhere; so, this means that the shared memory was being allocated even though I wasn’t using it. This made it even harder to figure out since I had to debug a function that never runs :-)


  1. Can’t the compiler figure out the shared memory required for a kernel, not including functions that won’t be used by that kernel? i.e., kernel A calls B, which calls C, so kernel A requires the summed shared memory of A, B and C, but not that of some other functions X, Y and Z? Isn’t this possible since you inline every function?

  2. If not, say I have a whole bunch of helper functions, each requiring 8k, but I know I will only be calling one at a time. How do I get around this problem? Do I have to explicitly create separate .cu files?



(Windows XP, Quadro FX 5600, 0.8 SDK10)

This is a known bug in the current release. It also adds the shared memory requirements of multiple kernels defined in the same file and stores the sum (!) with each one (take a look at the numbers in the .cubin). You might want to check this also. NVIDIA said this will be fixed in the next release.


Also, since parameters to global functions are passed via shared memory, they take some shared memory as well – so you likely won’t be able to allocate exactly 16K of shared memory on G80 except for the simplest of kernels (i.e. no inputs).