You looked at the machine code (SASS), correct? For which GPU architecture? My observation does not change when I switch to CUDA 12.4.1. There is a small chance that the CUDA compiler eliminates clamping constructs when it knows that this is safe to do. But best I know, for floating-point arithmetic the CUDA compiler does not perform the kind of range tracking that would allow it to safely perform such an optimization.
__device__ void scale_clamp (double *r, double *g, double *b)
{
double R = *r;
double G = *g;
double B = *b;
R = rint(fmin(fmax(R, 0.0),1.0) * 255);
G = rint(fmin(fmax(G, 0.0),1.0) * 255);
B = rint(fmin(fmax(B, 0.0),1.0) * 255);
*r = R;
*b = B;
*g = G;
}
The above results in this SASS for sm_90
when compile with CUDA 12.4.1 (note DSETP
and {F}SEL
instructions used for clamping):
scale_clamp(double*, double*, double*):
ULDC.64 UR4, c[0x0][0x208]
LD.E.64 R14, desc[UR4][R4.64]
LD.E.64 R12, desc[UR4][R6.64]
LD.E.64 R10, desc[UR4][R8.64]
IMAD.MOV.U32 R0, RZ, RZ, RZ
IMAD.MOV.U32 R32, RZ, RZ, 0x80000
DSETP.MAX.AND P0, P1, RZ, R14, PT
IMAD.MOV.U32 R3, RZ, RZ, R14
IMAD.MOV.U32 R14, RZ, RZ, RZ
FSEL R33, R0, R15, P0
SEL R14, R14, R3, P0
IMAD.MOV.U32 R3, RZ, RZ, 0x80000
IMAD.MOV.U32 R0, RZ, RZ, R14
@P1 LOP3.LUT R33, R15, 0x80000, RZ, 0xfc, !PT
IMAD.MOV.U32 R15, RZ, RZ, R33
IMAD.MOV.U32 R33, RZ, RZ, R12
DSETP.MIN.AND P0, P1, R14, 1, PT
FSEL R15, R15, 1.875, P0
SEL R14, R0, RZ, P0
IMAD.MOV.U32 R0, RZ, RZ, RZ
@P1 LOP3.LUT R15, R3, 0x3ff00000, RZ, 0xfc, !PT
DSETP.MAX.AND P1, P2, RZ, R12, PT
IMAD.MOV.U32 R3, RZ, RZ, RZ
IMAD.MOV.U32 R12, RZ, RZ, RZ
DMUL R14, R14, 255
FSEL R3, R3, R13, P1
SEL R12, R12, R33, P1
DSETP.MAX.AND P0, P1, RZ, R10, PT
@P2 LOP3.LUT R3, R13, 0x80000, RZ, 0xfc, !PT
IMAD.MOV.U32 R13, RZ, RZ, R10
FRND.F64 R14, R14
IMAD.MOV.U32 R10, RZ, RZ, RZ
SEL R10, R10, R13, P0
FSEL R13, R0, R11, P0
@P1 LOP3.LUT R13, R11, 0x80000, RZ, 0xfc, !PT
IMAD.MOV.U32 R11, RZ, RZ, R13
ST.E.64 desc[UR4][R4.64], R14
IMAD.MOV.U32 R13, RZ, RZ, R3
IMAD.MOV.U32 R0, RZ, RZ, R11
DSETP.MIN.AND P2, P3, R10, 1, PT
DSETP.MIN.AND P0, P1, R12, 1, PT
FSEL R3, R0, 1.875, P2
IMAD.MOV.U32 R0, RZ, RZ, R13
SEL R10, R10, RZ, P2
SEL R4, R12, RZ, P0
FSEL R5, R0, 1.875, P0
@P3 LOP3.LUT R3, R32.reuse, 0x3ff00000, RZ, 0xfc, !PT
@P1 LOP3.LUT R5, R32, 0x3ff00000, RZ, 0xfc, !PT
IMAD.MOV.U32 R11, RZ, RZ, R3
DMUL R4, R4, 255
DMUL R10, R10, 255
FRND.F64 R4, R4
FRND.F64 R10, R10
ST.E.64 desc[UR4][R8.64], R10
ST.E.64 desc[UR4][R6.64], R4
RET.ABS.NODEC R20 0x0
.L_x_0:
BRA `(.L_x_0)