Hello! I was creating an array
__shared__ float4 table;
which requires 16k…
This causes the error “cudaError_enum at memory location yada yada”, because this requires too much shared memory.
I’m guessing this error occurs because at run time, it references lalaland? Can’t this error be detected at compile time, since it appears you are adding up the shared memory required by every code snippit (even ones that aren’t being called!)? I lost a lot of time until I did the math and realized I exceeded the 8K limit.
This was actually in a function that wasn’t even being called anywhere; so, this means that the shared memory was being allocated even though I wasn’t using it. This made it even harder to figure out since I had to debug a function that never runs :-)
Can’t the compiler figure out the shared memory required for a kernel, not including functions that won’t be used by that kernel? i.e., kernel A calls B, which calls C, so kernel A requires the summed shared memory of A, B and C, but not that of some other functions X, Y and Z? Isn’t this possible since you inline every function?
If not, say I have a whole bunch of helper functions, each requiring 8k, but I know I will only be calling one at a time. How do I get around this problem? Do I have to explicitly create separate .cu files?
(Windows XP, Quadro FX 5600, 0.8 SDK10)