Nvlink error : File uses too much global constant data (0x100a8 bytes, 0x10000 max)

filippo94 · September 21, 2021, 1:15pm

Hi everyone,
I am new to cuda, many thanks to any help in advance.

I am trying to add a couple of features to an already existing repo and I got the error in the title.
From other posts in the forum I understood that I have to look up if I am using too much cmem.

I am compiling for target sm_61, and if I compare the ptxas infos of the repo before and after my changes the difference is only about cmem[0], of 16 bytes, but I don’t find anywhere the total possible amount of cmem[0].

Robert_Crovella · September 21, 2021, 3:53pm

You’re limited to 64K bytes of __constant__ data. The first thing I would suggest is to inspect your code. Do you have __constant__ declarations in your code? If so, those declarations must all be for a compile-time constant size. So it should be possible to total up that kind of usage through code inspection.

filippo94 · September 21, 2021, 5:34pm

Thanks for your reply!
The 64kB are so referred to all the X “types” of cmem [ X ] all together?
There were some __constant__ data already decleared in the repo (that was compiling with no issue on sm_61), I haven’t added any others __constant__ declaration.

njuffa · September 21, 2021, 5:58pm

The 64KB are in reference to the one particular cmem[X] bank that is used for __constant__ data. There are other constant banks, e.g. the constant bank used to pass arguments to __global__ functions.

Note that all code that lives in user space can contribute to __constant__ usage. In particular this is also true for code in the CUDA standard math library. For example, the trigonometric functions require a small table of __constant__ data for argument reduction.

So, if your code changes not only added 16 bytes of __constant__ data, but also introduced a call to a double-precision trig function where no such function was used before, that could explain the difference in constant bank usage.

[Later:]

The size of the table used by the double-precision trig functions seems to be 144 bytes (= 0x90 in hex) in size and the corresponding table for the single-precision trig functions seems to be 20 bytes (=0x14 in hex) in size. These store a π-related constant to the precision necessary for a correct argument reduction. Other math function might call trig functions, or employ small tables of their own, so this is not intended to be a comprehensive list of all details.

filippo94 · September 22, 2021, 3:00pm

Thanks a lot for your reply njuffa, every piece of information is gold!
I am trying to figure out what is happening.
I think I have only used more cmem[0] because I passed an extra pointer as input in a kernel.

njuffa · September 22, 2021, 5:02pm

As I stated, kernel arguments are placed in a different constant bank than __constant__ data. It would be highly unusual to pass 64 KB of kernel arguments.

The assignment of constant banks differs somewhat by GPU architecture. Looking at some disassembled code, for sm_61, kernel arguments appear to go into c[0x0] (constant bank 0), while __constant__ data appears to go into c[0x3] (constant bank 3), and compiler assigned constant data appears to go into c[0x2] (constant bank 2).

Unless the code is passing a huge amount of data to a kernel by value (this is not considered good practice in my book), I am not sure how one would overflow constant bank 0. Can you provide details of what the kernel invocation(s) look like, or (better yet) show a reproducer code that demonstrates the problem?

From what I can see, on sm_61 the first 0x140 (= 320 decimal) bytes of c[0x0] are taken up by internal CUDA data, followed by programmer-specified kernel arguments.

filippo94 · September 23, 2021, 7:48am

Thanks a lot again Njuffa, I have several versions of the code.
Looking at the ptxas infos of the last working version and the current one (not working one) the difference is just 2x8bytes of cmem[0].
All the cmem[3] info are exactly the same.
For this reason, thanks to what I understood by yours explanations, I thought it was related to kernel arguments.

However I managed to avoid passing the extra arguments leading to extra cmem[0] data but unfortunately I didn’t solve the issue (still cmem[3] data not changed in the ptxas info).

Probably I am missing somenthing.

filippo94 · September 24, 2021, 7:43am

the kernel invocation look like:

mykernel<<<getBlockCount(), getBlockDim()>>>(
			size, myStruct, myConfig, steps);

I cannot reproduce the error unfortunately.

Anyway as I stated above now I have all the cmem[0], cmem[2] and cmem[3] in the ptxas info that looks exactly the same wrt the previous version of the code, so it seems to not be related to thoose.

In your experience, could this error arise for other reasons?

filippo94 · September 30, 2021, 10:28am

well, now I can now partially answer myself:

even if all the cmem[x] info look exactly the same, two version of code apparently can differ in the constant memory allocation and so trigger the nvlink error.

filippo94 · September 30, 2021, 1:55pm

I’m sorry but I didn’t got your point:

are you referring to this previous comment?

Because then njuffa said:

And so now I’m confused.

Robert_Crovella · September 30, 2021, 2:29pm

Sorry, I deleted my previous comment, I was confused. I think if your application is linked with -rdc=true and you have device code in multiple modules linked together, it’s possible for you to run into this if the aggregate usage by a particular routine (which is really only discovered at link time) exceeds the 64K limit. If you run a nvcc compile command with the --verbose switch, you will discover that the nvlink phase runs after the ptxas phase (which makes sense) and so ptxas cannot/does not know about all aspects of routine resource usage, and so the ptxas output by itself may appear to be “inconsistent” with the nvlink results.

njuffa · September 30, 2021, 5:51pm

While the linker looks at the aggregate usage of all linked modules, the only two ways for this value to increase would appear to be (1) A module that uses constant memory is now linked in that was not linked in before (2) One or more of the modules being linked in has increased constant memory usage as shown by the diagnostic output of ptxas.

In other words, changes in the constant memory usage of the linked executable should be reflective of changes in the constant memory use of the constituent object files. I am not aware of a third way for constant memory usage in the executable to increase.

Does the build use the new link-time optimization feature (-dlto)?

filippo94 · September 30, 2021, 7:33pm

thanks @Robert_Crovella and @njuffa for your replies.

To answer both of you I don’t see -rdc or -dlto given to the compiler.

I will try to understand if this is what happens, maybe this is my case and it willl perfectly makes sense to me.

njuffa · September 30, 2021, 8:42pm

I don’t think we are going to make any progress here in resolving this issue as neither the code base nor the build details are available to us. The natural outcome of which is speculation.

If this were my code base, I would revert to the original unproblematic code base, then re-introduce any changes in increasingly smaller deltas until the issue re-appears. In this way it should be possible to narrow down the problem to a small amount of code (ideally a single line of code), the effects of which can then be examined in detail.

While some weird issue in the toolchain is a possibility [*], the likely cause of the over utilization of constant memory is in the code base itself. In any event there is now fair warning that the code as-is has maxed out the constant memory usage and might be in need of partial re-designed to prevent repeat occurrences of the issue at hand.

[*] If you are not using the latest CUDA version, you may want to try switching to that.

filippo94 · October 1, 2021, 8:32am

Thanks a lot again for the help @njuffa

Unfortunatly I cannot share the code, otherwise I would have already sent it.

I’m trying to do this when I have spare time, for now I have partially fixed the issue re-sizing a couple of constant buffers.

I am still new on all of this but at least now I have more knowlegde to find out what was going on!

Topic		Replies	Views
Isolating host-side include files CUDA Programming and Performance	2	1427	June 5, 2015
How to debug "Invalid memory reference" while generating linker CUDA Programming and Performance	17	5180	August 21, 2014
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3042	November 13, 2017
ptxas error : File uses too much global constant data CUDA Programming and Performance	3	3171	January 13, 2020
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	329	April 3, 2024
Anyway to reduce stack frame usage CUDA Programming and Performance	8	1360	November 29, 2018
Where can i find the information regarding the --ptxas-options=-v output ? CUDA Programming and Performance	9	3046	March 23, 2024
Accessing beyond bank 0 in __constant__ memory from PTX? CUDA Programming and Performance	7	610	September 17, 2023
Does the use of 16-bit, __restrict__ const kernel arguments hurt performance? CUDA Programming and Performance	4	4301	May 24, 2018
How to prevent nvcc from using local memory? CUDA Programming and Performance	16	22358	February 14, 2008

Nvlink error : File uses too much global constant data (0x100a8 bytes, 0x10000 max)

Related topics