The CUDA 11.1 toolchain seems to handle literal constants in excess of 64KB just fine.

What I see for `sm_70`

is that the constants that don’t fit into the constant bank are loaded using two moves from immediate, a `MOV Rn, 0xnnnnnnnn`

for the least significant 32 bits followed by an `IMAD.MOV.U32 Rn+1, RZ, RZ, 0xnnnnnnnnn`

for the most significant 32 bits.

For `sm_75`

and `sm_80`

, literal `double`

constants that don’t fit into the constant bank are loaded with two consecutive `UMOV Rn, 0xnnnnnnnnn`

instructions.

The performance impact from having to use immediate loads instead of constant bank reference is probably (conjecture!) minimal for double-precision code. For single-precision code, the compiler should be able to fit many single-precision floating-point literals into the arithmetic instructions themselves so they don’t take up space in the constant bank, and twice as many constants can fit into the constant bank. Where that isn’t sufficient, a noticeable impact from the increased dynamic instruction count due to immediate load instructions seems likely.

The performance impact will probably depend on how the code is compiled. The original post suggests that there is one call to `sin()`

for every two constants. So the majority of the time would likely be spent in trig function evaluation, unless the code is single-precision computation compiled with `-use_fast_math`

.

Formatting hint: This forum supports markup like `<sup></sup>`

for superscripts and `<sub></sub>`

for subscripts: a^{b}, log_{a}b.