Nvlink error : File uses too much global constant data (0x100a8 bytes, 0x10000 max)

Hi everyone,
I am new to cuda, many thanks to any help in advance.

I am trying to add a couple of features to an already existing repo and I got the error in the title.
From other posts in the forum I understood that I have to look up if I am using too much cmem.

I am compiling for target sm_61, and if I compare the ptxas infos of the repo before and after my changes the difference is only about cmem[0], of 16 bytes, but I don’t find anywhere the total possible amount of cmem[0].

You’re limited to 64K bytes of __constant__ data. The first thing I would suggest is to inspect your code. Do you have __constant__ declarations in your code? If so, those declarations must all be for a compile-time constant size. So it should be possible to total up that kind of usage through code inspection.

Thanks for your reply!
The 64kB are so referred to all the X “types” of cmem [ X ] all together?
There were some __constant__ data already decleared in the repo (that was compiling with no issue on sm_61), I haven’t added any others __constant__ declaration.

The 64KB are in reference to the one particular cmem[X] bank that is used for __constant__ data. There are other constant banks, e.g. the constant bank used to pass arguments to __global__ functions.

Note that all code that lives in user space can contribute to __constant__ usage. In particular this is also true for code in the CUDA standard math library. For example, the trigonometric functions require a small table of __constant__ data for argument reduction.

So, if your code changes not only added 16 bytes of __constant__ data, but also introduced a call to a double-precision trig function where no such function was used before, that could explain the difference in constant bank usage.

[Later:]

The size of the table used by the double-precision trig functions seems to be 144 bytes (= 0x90 in hex) in size and the corresponding table for the single-precision trig functions seems to be 20 bytes (=0x14 in hex) in size. These store a π-related constant to the precision necessary for a correct argument reduction. Other math function might call trig functions, or employ small tables of their own, so this is not intended to be a comprehensive list of all details.

Thanks a lot for your reply njuffa, every piece of information is gold!
I am trying to figure out what is happening.
I think I have only used more cmem[0] because I passed an extra pointer as input in a kernel.

As I stated, kernel arguments are placed in a different constant bank than __constant__ data. It would be highly unusual to pass 64 KB of kernel arguments.

The assignment of constant banks differs somewhat by GPU architecture. Looking at some disassembled code, for sm_61, kernel arguments appear to go into c[0x0] (constant bank 0), while __constant__ data appears to go into c[0x3] (constant bank 3), and compiler assigned constant data appears to go into c[0x2] (constant bank 2).

Unless the code is passing a huge amount of data to a kernel by value (this is not considered good practice in my book), I am not sure how one would overflow constant bank 0. Can you provide details of what the kernel invocation(s) look like, or (better yet) show a reproducer code that demonstrates the problem?

From what I can see, on sm_61 the first 0x140 (= 320 decimal) bytes of c[0x0] are taken up by internal CUDA data, followed by programmer-specified kernel arguments.

Thanks a lot again Njuffa, I have several versions of the code.
Looking at the ptxas infos of the last working version and the current one (not working one) the difference is just 2x8bytes of cmem[0].
All the cmem[3] info are exactly the same.
For this reason, thanks to what I understood by yours explanations, I thought it was related to kernel arguments.

However I managed to avoid passing the extra arguments leading to extra cmem[0] data but unfortunately I didn’t solve the issue (still cmem[3] data not changed in the ptxas info).

Probably I am missing somenthing.

the kernel invocation look like:

mykernel<<<getBlockCount(), getBlockDim()>>>(
			size, myStruct, myConfig, steps);

I cannot reproduce the error unfortunately.

Anyway as I stated above now I have all the cmem[0], cmem[2] and cmem[3] in the ptxas info that looks exactly the same wrt the previous version of the code, so it seems to not be related to thoose.

In your experience, could this error arise for other reasons?

well, now I can now partially answer myself:

even if all the cmem[x] info look exactly the same, two version of code apparently can differ in the constant memory allocation and so trigger the nvlink error.

I’m sorry but I didn’t got your point:

are you referring to this previous comment?

Because then njuffa said:

And so now I’m confused.

Sorry, I deleted my previous comment, I was confused. I think if your application is linked with -rdc=true and you have device code in multiple modules linked together, it’s possible for you to run into this if the aggregate usage by a particular routine (which is really only discovered at link time) exceeds the 64K limit. If you run a nvcc compile command with the --verbose switch, you will discover that the nvlink phase runs after the ptxas phase (which makes sense) and so ptxas cannot/does not know about all aspects of routine resource usage, and so the ptxas output by itself may appear to be “inconsistent” with the nvlink results.

While the linker looks at the aggregate usage of all linked modules, the only two ways for this value to increase would appear to be (1) A module that uses constant memory is now linked in that was not linked in before (2) One or more of the modules being linked in has increased constant memory usage as shown by the diagnostic output of ptxas.

In other words, changes in the constant memory usage of the linked executable should be reflective of changes in the constant memory use of the constituent object files. I am not aware of a third way for constant memory usage in the executable to increase.

Does the build use the new link-time optimization feature (-dlto)?

thanks @Robert_Crovella and @njuffa for your replies.

To answer both of you I don’t see -rdc or -dlto given to the compiler.

I will try to understand if this is what happens, maybe this is my case and it willl perfectly makes sense to me.

I don’t think we are going to make any progress here in resolving this issue as neither the code base nor the build details are available to us. The natural outcome of which is speculation.

If this were my code base, I would revert to the original unproblematic code base, then re-introduce any changes in increasingly smaller deltas until the issue re-appears. In this way it should be possible to narrow down the problem to a small amount of code (ideally a single line of code), the effects of which can then be examined in detail.

While some weird issue in the toolchain is a possibility [*], the likely cause of the over utilization of constant memory is in the code base itself. In any event there is now fair warning that the code as-is has maxed out the constant memory usage and might be in need of partial re-designed to prevent repeat occurrences of the issue at hand.

[*] If you are not using the latest CUDA version, you may want to try switching to that.

Thanks a lot again for the help @njuffa

Unfortunatly I cannot share the code, otherwise I would have already sent it.

I’m trying to do this when I have spare time, for now I have partially fixed the issue re-sizing a couple of constant buffers.

I am still new on all of this but at least now I have more knowlegde to find out what was going on!