Strange behaviour of the compiler

LowLevelKB · December 24, 2024, 12:01am

Hello everyone,

I have recently come across the very strange behaviour of the nvcc compiler. In the following piece of code:


...


R = rint(fmin(fmax(R, 0.0),1.0) * 255);


G = rint(fmin(fmax(G, 0.0),1.0) * 255);


B = rint(fmin(fmax(B, 0.0),1.0) * 255);


...

the compiler utterly ignores fmin and fmax and in the generated ptx code I see only the assembly instructions, that perform rint of argument times 255. It’s worth noting, that R, G and B are of type double and I compile only to the PTX file using appropriate option of MSVCC in the file properties (the piece of code comes from the OptiX shader, so full compilation cannot be done due to the unresolved OptiX functions like OptixTrace etc).

The more strange fact is that both fmin and fmax are “seen” by the compiler. If I put single fmin or single fmax, its OK. I just cannot nest them or compute max for variable in which the minimum was previously computed.

njuffa · December 24, 2024, 12:10am

Cannot reproduce with CUDA 12.5. For my test code, I see DMNMX instructions for older architecture and a combo of {DSETP.MIN | DSETP.MAX} with FSEL for newer architectures.

LowLevelKB · December 24, 2024, 12:14am

Thank You for Your answer. I use CUDA 12.4.1.

LowLevelKB · December 24, 2024, 12:18am

In some fragments fmin(fmax(…)) works and in some doesn’t. That is the most strange thing about it.

njuffa · December 24, 2024, 12:19am

You looked at the machine code (SASS), correct? For which GPU architecture? My observation does not change when I switch to CUDA 12.4.1. There is a small chance that the CUDA compiler eliminates clamping constructs when it knows that this is safe to do. But best I know, for floating-point arithmetic the CUDA compiler does not perform the kind of range tracking that would allow it to safely perform such an optimization.

__device__ void scale_clamp (double *r, double *g, double *b)
{
   double R = *r;
   double G = *g;
   double B = *b;
   R = rint(fmin(fmax(R, 0.0),1.0) * 255); 
   G = rint(fmin(fmax(G, 0.0),1.0) * 255); 
   B = rint(fmin(fmax(B, 0.0),1.0) * 255);
   *r = R;
   *b = B;
   *g = G;
}

The above results in this SASS for sm_90 when compile with CUDA 12.4.1 (note DSETP and {F}SEL instructions used for clamping):

scale_clamp(double*, double*, double*):
 ULDC.64 UR4, c[0x0][0x208] 
 LD.E.64 R14, desc[UR4][R4.64] 
 LD.E.64 R12, desc[UR4][R6.64] 
 LD.E.64 R10, desc[UR4][R8.64] 
 IMAD.MOV.U32 R0, RZ, RZ, RZ 
 IMAD.MOV.U32 R32, RZ, RZ, 0x80000 
 DSETP.MAX.AND P0, P1, RZ, R14, PT 
 IMAD.MOV.U32 R3, RZ, RZ, R14 
 IMAD.MOV.U32 R14, RZ, RZ, RZ 
 FSEL R33, R0, R15, P0 
 SEL R14, R14, R3, P0 
 IMAD.MOV.U32 R3, RZ, RZ, 0x80000 
 IMAD.MOV.U32 R0, RZ, RZ, R14 
 @P1 LOP3.LUT R33, R15, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R15, RZ, RZ, R33 
 IMAD.MOV.U32 R33, RZ, RZ, R12 
 DSETP.MIN.AND P0, P1, R14, 1, PT 
 FSEL R15, R15, 1.875, P0 
 SEL R14, R0, RZ, P0 
 IMAD.MOV.U32 R0, RZ, RZ, RZ 
 @P1 LOP3.LUT R15, R3, 0x3ff00000, RZ, 0xfc, !PT 
 DSETP.MAX.AND P1, P2, RZ, R12, PT 
 IMAD.MOV.U32 R3, RZ, RZ, RZ 
 IMAD.MOV.U32 R12, RZ, RZ, RZ 
 DMUL R14, R14, 255 
 FSEL R3, R3, R13, P1 
 SEL R12, R12, R33, P1 
 DSETP.MAX.AND P0, P1, RZ, R10, PT 
 @P2 LOP3.LUT R3, R13, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R13, RZ, RZ, R10 
 FRND.F64 R14, R14 
 IMAD.MOV.U32 R10, RZ, RZ, RZ 
 SEL R10, R10, R13, P0 
 FSEL R13, R0, R11, P0 
 @P1 LOP3.LUT R13, R11, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R11, RZ, RZ, R13 
 ST.E.64 desc[UR4][R4.64], R14 
 IMAD.MOV.U32 R13, RZ, RZ, R3 
 IMAD.MOV.U32 R0, RZ, RZ, R11 
 DSETP.MIN.AND P2, P3, R10, 1, PT 
 DSETP.MIN.AND P0, P1, R12, 1, PT 
 FSEL R3, R0, 1.875, P2 
 IMAD.MOV.U32 R0, RZ, RZ, R13 
 SEL R10, R10, RZ, P2 
 SEL R4, R12, RZ, P0 
 FSEL R5, R0, 1.875, P0 
 @P3 LOP3.LUT R3, R32.reuse, 0x3ff00000, RZ, 0xfc, !PT 
 @P1 LOP3.LUT R5, R32, 0x3ff00000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R11, RZ, RZ, R3 
 DMUL R4, R4, 255 
 DMUL R10, R10, 255 
 FRND.F64 R4, R4 
 FRND.F64 R10, R10 
 ST.E.64 desc[UR4][R8.64], R10 
 ST.E.64 desc[UR4][R6.64], R4 
 RET.ABS.NODEC R20 0x0 
.L_x_0:
 BRA `(.L_x_0)

LowLevelKB · December 24, 2024, 12:22am

Thank You for Your help. I looked only for the output ptx file, which constitutes the target od my compilation. I have rtx 4070 gpu (cc 8.9).

LowLevelKB · December 24, 2024, 12:24am

The error is very hard to reproduce since as I said some pieces of code with nested fmin and fmax compile correctly and in some fmin and fmax in the ptx code are completely missing.

njuffa · December 24, 2024, 12:24am

I can only speculate about code that I have not seen. Consider extracting relevant code into a standalone reproducer and posting that here so that others can build and inspect. Do you, by any chance, see instructions with a .sat suffix in the PTX code, indicating saturation? If so, read the specifications for those instructions in the PTX documentation.

So far no information has been provided that would allow us to conclude that the compiler generates incorrect code. While compiler bugs are always a possibility, they are quite rare these days. The more likely scenario is that the generated code is in fact correct, it just looks different than you expect.

LowLevelKB · December 24, 2024, 12:41am

You were absolutely right. Compiler generated cvt.sat.f64.f64. So the issue of code, that is not generated is closed. But there’s another problem. It works completely wrong. And when I put:


R = (R < 0) ? 0 : R;


R = (R > 1) ? 1 : R;


R = rint(R * 255)

everything is fine. Well, maybe RTX 4070 doesn’t support cvt.sat.f64.f64… .

LowLevelKB · December 24, 2024, 12:45am

Or translation from cvt.sat.f64.f64 to SASS is buggy… .

njuffa · December 24, 2024, 12:53am

PTX is a virtual ISA and a compiler intermediate format, so it contains many instructions that have no direct equivalent in the hardware of any particular GPU architecture.

I cannot make a diagnosis of incorrect code generation based on the information provided. If you are convinced there is such a bug, feel free to file a bug report with NVIDIA. Make sure to include code that reproduces the problem.

LowLevelKB · December 24, 2024, 12:55am

Thank You one more time.

system · January 7, 2025, 12:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Strange behavior of cvt.sat.f64.f64 CUDA Programming and Performance	30	116	December 26, 2024
Weird error when compiling CUDA Programming and Performance	3	1347	August 2, 2011
Strange PTX Output CUDA Programming and Performance	9	3320	December 19, 2014
Simple surface compilation problem on CUDA CUDA Programming and Performance	6	59590	May 25, 2012
Nvcc lower version than CUDA causes compiled code runtime error 300 CUDA NVCC Compiler	4	97	September 24, 2024
Compilation Errors with GCC Versions 11-14 and CUDA Toolkit 12.5/12.6 Due to Undefined `__builtin_ia32_ldtilecfg` and `__builtin_ia32_sttilecfg`, etc GPU-Accelerated Libraries cuda	2	1303	October 15, 2024
compile segmentation fault CUDA Programming and Performance	14	14558	September 2, 2010
Nvcc(cuda 11.6) compiled failed: __hmax undefined CUDA NVCC Compiler	15	1070	August 18, 2023
Nvcc 12.3 with gcc 13.2 not working CUDA NVCC Compiler	11	10302	March 12, 2024
CUDA 1.1 Bug - Compiler crash (ptxas) w/repro CUDA Programming and Performance	16	8627	May 19, 2008

Strange behaviour of the compiler

Related topics