The CUDA 11.1 toolchain seems to handle literal constants in excess of 64KB just fine.
What I see for
sm_70 is that the constants that don’t fit into the constant bank are loaded using two moves from immediate, a
MOV Rn, 0xnnnnnnnn for the least significant 32 bits followed by an
IMAD.MOV.U32 Rn+1, RZ, RZ, 0xnnnnnnnnn for the most significant 32 bits.
double constants that don’t fit into the constant bank are loaded with two consecutive
UMOV Rn, 0xnnnnnnnnn instructions.
The performance impact from having to use immediate loads instead of constant bank reference is probably (conjecture!) minimal for double-precision code. For single-precision code, the compiler should be able to fit many single-precision floating-point literals into the arithmetic instructions themselves so they don’t take up space in the constant bank, and twice as many constants can fit into the constant bank. Where that isn’t sufficient, a noticeable impact from the increased dynamic instruction count due to immediate load instructions seems likely.
The performance impact will probably depend on how the code is compiled. The original post suggests that there is one call to
sin() for every two constants. So the majority of the time would likely be spent in trig function evaluation, unless the code is single-precision computation compiled with
Formatting hint: This forum supports markup like
<sup></sup> for superscripts and
<sub></sub> for subscripts: ab, logab.