Entry function uses too much local data

Hi, I get the compiler error:

ptxas error : Entry function ‘ppush_kernel’ uses too much local data (0x30d40 bytes, 0x4000 max)

‘ppush_kernel’ is a global subroutine that is called by the main host program.

It takes 11 largish (50,000 type real elements) device arrays as parameters. These arrays are stored in a .h file which is included in a module file that the main program uses. But this module file is not used in the specific ppush_kernel subroutine.

Within the ppush_kernel subroutine I declare another 50,000 element (local) array. I am assuming it is automatically shared amongst threads, and so it is only stored in memory once, though I never explicitly declared it as a shared array.

I’m wondering which arrays would be causing the problem, and what are the solutions?

Also another question: how does the compiler know what the memory limits are of the GPU I am using? And if I compile on a machine different then what I actually run on could this cause a problem? (e.g. some cases of submitting a job to a cluster).

Within the ppush_kernel subroutine I declare another 50,000 element (local) array. I am assuming it is automatically shared amongst threads, and so it is only stored in memory once, though I never explicitly declared it as a shared array.

Is this an automatic array who’s size is set via the kernels execution configuration? If not, then you do need to explicitly add the “shared” attribute. Otherwise, each thread will get their own local copy of this array. Given the size 0x30d40 (i.e. 200,000 bytes or 50,000 elements times 4 bytes per element), I’m guessing this is the problem.

Also another question: how does the compiler know what the memory limits are of the GPU I am using? And if I compile on a machine different then what I actually run on could this cause a problem? (e.g. some cases of submitting a job to a cluster).

If I remember correctly, for the older Tesla (cc13) cards, ptxas enforced the local memory size. Though, I think these limits were lifted or changed to the runtime when targeting newer devices.

By default, we target multiple devices including CC 1.3. Try targeting a newer device such as CC 3.5,“-Mcuda=cc35,cuda5.5”, to see if it works around this limit.

  • Mat