Strange behavior of cvt.sat.f64.f64

It looks like cvt.sat.f64.f64 in addition to clamping double to the range [0, 1] as supposed to do ALSO MISTAKENLY PERFORMS THE BINARIZATION OF THE ARGUMENT. For example:
For -1 we get 0;
For -0.5 we get 0;
For 0 we get 0;
For 0.25 we get 0; (instead of 0.25)
For 0.75 we get 1; (instead of 0.75)
For 1 we get 1;
For 1.5 we get 1;

I have RTX 4070, CUDA 12.4.1 and Windows 10.

I built the following on Windows 10 with CUDA 12.3 using nvcc -arch=sm_75 -o cvt_sat_f64_f64.exe cvt_sat_f64_f64.cu

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (double a, double *r)
{
    asm ("cvt.sat.f64.f64 %0,%0;\n\t" : "+d"(a));
    *r = a;
}

int main (void)
{
    double a, r, *r_d = 0;
    cudaMalloc ((void**)&r_d, sizeof (r_d[0]));
    for (a = -1.2; a <= 1.2; a += 0.1) {
        kernel<<<1,1>>>(a, r_d);
        cudaMemcpy (&r, r_d, sizeof r, cudaMemcpyDeviceToHost);
        printf ("a = % 23.16e  sat(a)=% 23.16e\n", a, r);
    }
    cudaFree (r_d);
    return (EXIT_SUCCESS);
}

The output of the above program is as expected:

a = -1.2000000000000000e+00  sat(a)= 0.0000000000000000e+00
a = -1.0999999999999999e+00  sat(a)= 0.0000000000000000e+00
a = -9.9999999999999989e-01  sat(a)= 0.0000000000000000e+00
a = -8.9999999999999991e-01  sat(a)= 0.0000000000000000e+00
a = -7.9999999999999993e-01  sat(a)= 0.0000000000000000e+00
a = -6.9999999999999996e-01  sat(a)= 0.0000000000000000e+00
a = -5.9999999999999998e-01  sat(a)= 0.0000000000000000e+00
a = -5.0000000000000000e-01  sat(a)= 0.0000000000000000e+00
a = -4.0000000000000002e-01  sat(a)= 0.0000000000000000e+00
a = -3.0000000000000004e-01  sat(a)= 0.0000000000000000e+00
a = -2.0000000000000004e-01  sat(a)= 0.0000000000000000e+00
a = -1.0000000000000003e-01  sat(a)= 0.0000000000000000e+00
a = -2.7755575615628914e-17  sat(a)= 0.0000000000000000e+00
a =  9.9999999999999978e-02  sat(a)= 9.9999999999999978e-02
a =  1.9999999999999998e-01  sat(a)= 1.9999999999999998e-01
a =  2.9999999999999999e-01  sat(a)= 2.9999999999999999e-01
a =  4.0000000000000002e-01  sat(a)= 4.0000000000000002e-01
a =  5.0000000000000000e-01  sat(a)= 5.0000000000000000e-01
a =  5.9999999999999998e-01  sat(a)= 5.9999999999999998e-01
a =  6.9999999999999996e-01  sat(a)= 6.9999999999999996e-01
a =  7.9999999999999993e-01  sat(a)= 7.9999999999999993e-01
a =  8.9999999999999991e-01  sat(a)= 8.9999999999999991e-01
a =  9.9999999999999989e-01  sat(a)= 9.9999999999999989e-01
a =  1.0999999999999999e+00  sat(a)= 1.0000000000000000e+00
a =  1.2000000000000000e+00  sat(a)= 1.0000000000000000e+00

This is the closest I can approximate your platform configuration at this time. If you post code for others to build and run, we may get more data points. As the holidays are imminent, I would expect forum participation to be extremely low, though.

Thank You for Your answer. The code comes from my OptiX shader. In my ray generation shader I have:

R = MIN_R(MAX_R(R, 0.0), 1.0)
G = MIN_R(MAX_G(G, 0.0), 1.0)
B = MIN_R(MAX_R(B, 0.0), 1.0)

which compiles into three cvt.sat.f64.f64 that don’t work as I described.

But if I change the piece of code into:

R = (R < 0) ? 0 : R;
R = (R > 1) ? 1 : R;
...

it compiles into:

setp.lt.f64 ...
selp.f64 ...
setp.gt.f64 ...
selp.f64 ...

and that works perfectly fine. I don’t have clue what’s going on, since the cvt.sat.f64.f64 theoretically speaking should be equivalent to the code with setp.lt.f64, setp.gt.f64 and self.f64 and it shouldn’t perform the “binarization”.


shaders.cu.ptx.txt (30.9 KB)
shaders.cu.txt (10.4 KB)

Are MIN_R/MAX_R preprocessor defines? What is their long form?

In your environment, when does PTX get translated to machine code (SASS), and what software component performs this translation? Obviously when I am running nvcc it runs the offline version of ptxas to compile PTX into SASS and I can then inspect the resulting SASS with cuobjdump --dump-sass.

No idea how that works in the context of OptiX.

Yes,

#define MIN_G(x, y) fminf(x, y)
#define MAX_G(x, y) fmaxf(x, y)

Calling float-variants of the functions strikes me as a bad idea if your inputs and outputs are actually double-precision variables.

It’s compiled by the OptiX function optixModuleCreate.

FILE *f = fopen("C:/Users/pc/source/repos/GaussianRenderingCUDA/GaussianRenderingCUDA/x64/Release/shaders.cu.ptx", "rb");
fseek(f, 0, SEEK_END);
int ptxCodeSize = ftell(f);
fclose(f);
char *ptxCode = (char *)malloc(sizeof(char) * (ptxCodeSize + 1));
char *buffer = (char *)malloc(sizeof(char) * (ptxCodeSize + 1));
ptxCode[0] = 0; // !!! !!! !!!

f = fopen("C:/Users/pc/source/repos/GaussianRenderingCUDA/GaussianRenderingCUDA/x64/Release/shaders.cu.ptx", "rt");
while (!feof(f)) {
	fgets(buffer, ptxCodeSize + 1, f);
	ptxCode = strcat(ptxCode, buffer);
}
fclose(f);

free(buffer);

// *********************************************************************************************

OptixModuleCompileOptions moduleCompileOptions = {};
OptixPipelineCompileOptions pipelineCompileOptions = {};

moduleCompileOptions.maxRegisterCount = 50;
moduleCompileOptions.optLevel = OPTIX_COMPILE_OPTIMIZATION_DEFAULT;
moduleCompileOptions.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_NONE;

pipelineCompileOptions.traversableGraphFlags = OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS;
pipelineCompileOptions.usesMotionBlur = false;
pipelineCompileOptions.numPayloadValues = 2; // (12) !!! !!! !!!
pipelineCompileOptions.numPayloadValues = 2; // (19) !!! !!! !!!
pipelineCompileOptions.numAttributeValues = 0;
pipelineCompileOptions.exceptionFlags = OPTIX_EXCEPTION_FLAG_NONE;
pipelineCompileOptions.pipelineLaunchParamsVariableName = "optixLaunchParams";

OptixModule module;
error_OptiX = optixModuleCreate(
	params_OptiX.optixContext,
	&moduleCompileOptions,
	&pipelineCompileOptions,
	ptxCode,
	strlen(ptxCode),
	NULL, NULL,
	&module
);
if (error_OptiX != OPTIX_SUCCESS) goto Error;

free(ptxCode);

Thanks for showing the compilation step. Unfortunately I wouldn’t know what that really does and how things could go wrong. Any chance this JIT-compilation step is misconfigured in some way?

The people in the OptiX sub-forum might have more insights, given that this is not a pure CUDA scenario.

Thank You for Your help. Frankly speaking, I don’t know :( .

I posted wrong piece of code. It belongs to the bigger fragment:

#ifndef RENDERER_OPTIX_USE_DOUBLE_PRECISION
#define MIN_R(x, y) fminf(x, y)
#define MAX_R(x, y) fmaxf(x, y)
#else
#define MIN_R(x, y) fmin(x, y)
#define MAX_R(x, y) fmax(x, y)
#endif

So I am sure this part is OK.

Yes, I perform the computation on the doubles.

Could you extract a direct example of code using fmin and/or fmax with a specific argument giving a wrong result (like a one-liner), which could be manually reasoned about?

E.g. do you get?

fmin(fmax(0.25, 0.0), 1.0) == 0.0

and more specifically?

fmax(0.25, 0.0) == 0.0

R = MIN_R(MAX_R(((REAL_R)R), ((REAL_R)0)), ((REAL_R)1));
G = MIN_R(MAX_R(((REAL_R)G), ((REAL_R)0)), ((REAL_R)1));
B = MIN_R(MAX_R(((REAL_R)B), ((REAL_R)0)), ((REAL_R)1));

Yes, it behaves exactly that way. But it’s worth noting, that the ptx is loaded dynamically by OptiX function optixModuleCreate.

Just look at the pixel colors.

Please replace with a more direct approach, e.g.

if (fmax(0.25, 0.0) == 0.0)
    R = G = B = 1.0;
else
    if (fmax(0.25, 0.0) == 0.25)
        R = G = B = 0.0;
    else
        R = G = B = 0.5;

0.25 has an exact representation in float and double, so a comparison with no tolerance (epsilon) should work.

According to your post, the result would be white.
If not (e.g. black), find out, where the difference lies, e.g. change to fmin(fmax(…)) with numbers, and so on.

Excellent idea! I’ll check this out!

Since the compiler ruled out

if (MIN_R(MAX_R(0.25, 0.0), 1.0) == 0.25) {
R = 1.0;
G = 1.0;
B = 1.0;
}

likely due to the advanced optimization, I used something similar:

R = (R < 0.0) ? 0.0 : R;
R = (R > 1.0) ? 1.0 : R;
if (MIN_R(MAX_R(R, 0.0), 1.0) < R) {
R = 1.0;
G = 1.0;
B = 1.0;
}

Here is the resulting render:

Note, that it was AFTER CLAMPING R to the interval [0, 1] using

R = (R < 0.0) ? 0.0 : R;
R = (R > 1.0) ? 1.0 : R;