Strange behaviour of the compiler

Hello everyone,

I have recently come across the very strange behaviour of the nvcc compiler. In the following piece of code:

...
R = rint(fmin(fmax(R, 0.0),1.0) * 255);
G = rint(fmin(fmax(G, 0.0),1.0) * 255);
B = rint(fmin(fmax(B, 0.0),1.0) * 255);
...

the compiler utterly ignores fmin and fmax and in the generated ptx code I see only the assembly instructions, that perform rint of argument times 255. It’s worth noting, that R, G and B are of type double and I compile only to the PTX file using appropriate option of MSVCC in the file properties (the piece of code comes from the OptiX shader, so full compilation cannot be done due to the unresolved OptiX functions like OptixTrace etc).

The more strange fact is that both fmin and fmax are “seen” by the compiler. If I put single fmin or single fmax, its OK. I just cannot nest them or compute max for variable in which the minimum was previously computed.

Cannot reproduce with CUDA 12.5. For my test code, I see DMNMX instructions for older architecture and a combo of {DSETP.MIN | DSETP.MAX} with FSEL for newer architectures.

Thank You for Your answer. I use CUDA 12.4.1.

In some fragments fmin(fmax(…)) works and in some doesn’t. That is the most strange thing about it.

You looked at the machine code (SASS), correct? For which GPU architecture? My observation does not change when I switch to CUDA 12.4.1. There is a small chance that the CUDA compiler eliminates clamping constructs when it knows that this is safe to do. But best I know, for floating-point arithmetic the CUDA compiler does not perform the kind of range tracking that would allow it to safely perform such an optimization.

__device__ void scale_clamp (double *r, double *g, double *b)
{
   double R = *r;
   double G = *g;
   double B = *b;
   R = rint(fmin(fmax(R, 0.0),1.0) * 255); 
   G = rint(fmin(fmax(G, 0.0),1.0) * 255); 
   B = rint(fmin(fmax(B, 0.0),1.0) * 255);
   *r = R;
   *b = B;
   *g = G;
}

The above results in this SASS for sm_90 when compile with CUDA 12.4.1 (note DSETP and {F}SEL instructions used for clamping):

scale_clamp(double*, double*, double*):
 ULDC.64 UR4, c[0x0][0x208] 
 LD.E.64 R14, desc[UR4][R4.64] 
 LD.E.64 R12, desc[UR4][R6.64] 
 LD.E.64 R10, desc[UR4][R8.64] 
 IMAD.MOV.U32 R0, RZ, RZ, RZ 
 IMAD.MOV.U32 R32, RZ, RZ, 0x80000 
 DSETP.MAX.AND P0, P1, RZ, R14, PT 
 IMAD.MOV.U32 R3, RZ, RZ, R14 
 IMAD.MOV.U32 R14, RZ, RZ, RZ 
 FSEL R33, R0, R15, P0 
 SEL R14, R14, R3, P0 
 IMAD.MOV.U32 R3, RZ, RZ, 0x80000 
 IMAD.MOV.U32 R0, RZ, RZ, R14 
 @P1 LOP3.LUT R33, R15, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R15, RZ, RZ, R33 
 IMAD.MOV.U32 R33, RZ, RZ, R12 
 DSETP.MIN.AND P0, P1, R14, 1, PT 
 FSEL R15, R15, 1.875, P0 
 SEL R14, R0, RZ, P0 
 IMAD.MOV.U32 R0, RZ, RZ, RZ 
 @P1 LOP3.LUT R15, R3, 0x3ff00000, RZ, 0xfc, !PT 
 DSETP.MAX.AND P1, P2, RZ, R12, PT 
 IMAD.MOV.U32 R3, RZ, RZ, RZ 
 IMAD.MOV.U32 R12, RZ, RZ, RZ 
 DMUL R14, R14, 255 
 FSEL R3, R3, R13, P1 
 SEL R12, R12, R33, P1 
 DSETP.MAX.AND P0, P1, RZ, R10, PT 
 @P2 LOP3.LUT R3, R13, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R13, RZ, RZ, R10 
 FRND.F64 R14, R14 
 IMAD.MOV.U32 R10, RZ, RZ, RZ 
 SEL R10, R10, R13, P0 
 FSEL R13, R0, R11, P0 
 @P1 LOP3.LUT R13, R11, 0x80000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R11, RZ, RZ, R13 
 ST.E.64 desc[UR4][R4.64], R14 
 IMAD.MOV.U32 R13, RZ, RZ, R3 
 IMAD.MOV.U32 R0, RZ, RZ, R11 
 DSETP.MIN.AND P2, P3, R10, 1, PT 
 DSETP.MIN.AND P0, P1, R12, 1, PT 
 FSEL R3, R0, 1.875, P2 
 IMAD.MOV.U32 R0, RZ, RZ, R13 
 SEL R10, R10, RZ, P2 
 SEL R4, R12, RZ, P0 
 FSEL R5, R0, 1.875, P0 
 @P3 LOP3.LUT R3, R32.reuse, 0x3ff00000, RZ, 0xfc, !PT 
 @P1 LOP3.LUT R5, R32, 0x3ff00000, RZ, 0xfc, !PT 
 IMAD.MOV.U32 R11, RZ, RZ, R3 
 DMUL R4, R4, 255 
 DMUL R10, R10, 255 
 FRND.F64 R4, R4 
 FRND.F64 R10, R10 
 ST.E.64 desc[UR4][R8.64], R10 
 ST.E.64 desc[UR4][R6.64], R4 
 RET.ABS.NODEC R20 0x0 
.L_x_0:
 BRA `(.L_x_0)

Thank You for Your help. I looked only for the output ptx file, which constitutes the target od my compilation. I have rtx 4070 gpu (cc 8.9).

The error is very hard to reproduce since as I said some pieces of code with nested fmin and fmax compile correctly and in some fmin and fmax in the ptx code are completely missing.

I can only speculate about code that I have not seen. Consider extracting relevant code into a standalone reproducer and posting that here so that others can build and inspect. Do you, by any chance, see instructions with a .sat suffix in the PTX code, indicating saturation? If so, read the specifications for those instructions in the PTX documentation.

So far no information has been provided that would allow us to conclude that the compiler generates incorrect code. While compiler bugs are always a possibility, they are quite rare these days. The more likely scenario is that the generated code is in fact correct, it just looks different than you expect.

You were absolutely right. Compiler generated cvt.sat.f64.f64. So the issue of code, that is not generated is closed. But there’s another problem. It works completely wrong. And when I put:

R = (R < 0) ? 0 : R;
R = (R > 1) ? 1 : R;
R = rint(R * 255)

everything is fine. Well, maybe RTX 4070 doesn’t support cvt.sat.f64.f64… .

Or translation from cvt.sat.f64.f64 to SASS is buggy… .

PTX is a virtual ISA and a compiler intermediate format, so it contains many instructions that have no direct equivalent in the hardware of any particular GPU architecture.

I cannot make a diagnosis of incorrect code generation based on the information provided. If you are convinced there is such a bug, feel free to file a bug report with NVIDIA. Make sure to include code that reproduces the problem.

Thank You one more time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.