Strange behavior of cvt.sat.f64.f64

LowLevelKB · December 24, 2024, 8:27am

It looks like cvt.sat.f64.f64 in addition to clamping double to the range [0, 1] as supposed to do ALSO MISTAKENLY PERFORMS THE BINARIZATION OF THE ARGUMENT. For example:
For -1 we get 0;
For -0.5 we get 0;
For 0 we get 0;
For 0.25 we get 0; (instead of 0.25)
For 0.75 we get 1; (instead of 0.75)
For 1 we get 1;
For 1.5 we get 1;

I have RTX 4070, CUDA 12.4.1 and Windows 10.

njuffa · December 24, 2024, 9:14am

I built the following on Windows 10 with CUDA 12.3 using nvcc -arch=sm_75 -o cvt_sat_f64_f64.exe cvt_sat_f64_f64.cu

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (double a, double *r)
{
    asm ("cvt.sat.f64.f64 %0,%0;\n\t" : "+d"(a));
    *r = a;
}

int main (void)
{
    double a, r, *r_d = 0;
    cudaMalloc ((void**)&r_d, sizeof (r_d[0]));
    for (a = -1.2; a <= 1.2; a += 0.1) {
        kernel<<<1,1>>>(a, r_d);
        cudaMemcpy (&r, r_d, sizeof r, cudaMemcpyDeviceToHost);
        printf ("a = % 23.16e  sat(a)=% 23.16e\n", a, r);
    }
    cudaFree (r_d);
    return (EXIT_SUCCESS);
}

The output of the above program is as expected:

a = -1.2000000000000000e+00  sat(a)= 0.0000000000000000e+00
a = -1.0999999999999999e+00  sat(a)= 0.0000000000000000e+00
a = -9.9999999999999989e-01  sat(a)= 0.0000000000000000e+00
a = -8.9999999999999991e-01  sat(a)= 0.0000000000000000e+00
a = -7.9999999999999993e-01  sat(a)= 0.0000000000000000e+00
a = -6.9999999999999996e-01  sat(a)= 0.0000000000000000e+00
a = -5.9999999999999998e-01  sat(a)= 0.0000000000000000e+00
a = -5.0000000000000000e-01  sat(a)= 0.0000000000000000e+00
a = -4.0000000000000002e-01  sat(a)= 0.0000000000000000e+00
a = -3.0000000000000004e-01  sat(a)= 0.0000000000000000e+00
a = -2.0000000000000004e-01  sat(a)= 0.0000000000000000e+00
a = -1.0000000000000003e-01  sat(a)= 0.0000000000000000e+00
a = -2.7755575615628914e-17  sat(a)= 0.0000000000000000e+00
a =  9.9999999999999978e-02  sat(a)= 9.9999999999999978e-02
a =  1.9999999999999998e-01  sat(a)= 1.9999999999999998e-01
a =  2.9999999999999999e-01  sat(a)= 2.9999999999999999e-01
a =  4.0000000000000002e-01  sat(a)= 4.0000000000000002e-01
a =  5.0000000000000000e-01  sat(a)= 5.0000000000000000e-01
a =  5.9999999999999998e-01  sat(a)= 5.9999999999999998e-01
a =  6.9999999999999996e-01  sat(a)= 6.9999999999999996e-01
a =  7.9999999999999993e-01  sat(a)= 7.9999999999999993e-01
a =  8.9999999999999991e-01  sat(a)= 8.9999999999999991e-01
a =  9.9999999999999989e-01  sat(a)= 9.9999999999999989e-01
a =  1.0999999999999999e+00  sat(a)= 1.0000000000000000e+00
a =  1.2000000000000000e+00  sat(a)= 1.0000000000000000e+00

This is the closest I can approximate your platform configuration at this time. If you post code for others to build and run, we may get more data points. As the holidays are imminent, I would expect forum participation to be extremely low, though.

LowLevelKB · December 24, 2024, 9:26am

Thank You for Your answer. The code comes from my OptiX shader. In my ray generation shader I have:


R = MIN_R(MAX_R(R, 0.0), 1.0)


G = MIN_R(MAX_G(G, 0.0), 1.0)


B = MIN_R(MAX_R(B, 0.0), 1.0)

which compiles into three cvt.sat.f64.f64 that don’t work as I described.

But if I change the piece of code into:


R = (R < 0) ? 0 : R; 

R = (R > 1) ? 1 : R; 

...

it compiles into:


setp.lt.f64 ... 

selp.f64 ... 

setp.gt.f64 ... 

selp.f64 ...

and that works perfectly fine. I don’t have clue what’s going on, since the cvt.sat.f64.f64 theoretically speaking should be equivalent to the code with setp.lt.f64, setp.gt.f64 and self.f64 and it shouldn’t perform the “binarization”.

shaders.cu.ptx.txt (30.9 KB)
shaders.cu.txt (10.4 KB)

Curefab · December 24, 2024, 9:36am

Are MIN_R/MAX_R preprocessor defines? What is their long form?

njuffa · December 24, 2024, 9:37am

In your environment, when does PTX get translated to machine code (SASS), and what software component performs this translation? Obviously when I am running nvcc it runs the offline version of ptxas to compile PTX into SASS and I can then inspect the resulting SASS with cuobjdump --dump-sass.

No idea how that works in the context of OptiX.

LowLevelKB · December 24, 2024, 9:37am

Yes,


#define MIN_G(x, y) fminf(x, y) 

#define MAX_G(x, y) fmaxf(x, y)

njuffa · December 24, 2024, 9:41am

Calling float-variants of the functions strikes me as a bad idea if your inputs and outputs are actually double-precision variables.

LowLevelKB · December 24, 2024, 9:41am

It’s compiled by the OptiX function optixModuleCreate.


FILE *f = fopen("C:/Users/pc/source/repos/GaussianRenderingCUDA/GaussianRenderingCUDA/x64/Release/shaders.cu.ptx", "rb"); 

	fseek(f, 0, SEEK_END); 

	int ptxCodeSize = ftell(f); 

	fclose(f);

char *ptxCode = (char *)malloc(sizeof(char) * (ptxCodeSize + 1));
char *buffer = (char *)malloc(sizeof(char) * (ptxCodeSize + 1));
ptxCode[0] = 0; // !!! !!! !!!

f = fopen("C:/Users/pc/source/repos/GaussianRenderingCUDA/GaussianRenderingCUDA/x64/Release/shaders.cu.ptx", "rt");
while (!feof(f)) {
	fgets(buffer, ptxCodeSize + 1, f);
	ptxCode = strcat(ptxCode, buffer);
}
fclose(f);

free(buffer);

// *********************************************************************************************

OptixModuleCompileOptions moduleCompileOptions = {};
OptixPipelineCompileOptions pipelineCompileOptions = {};

moduleCompileOptions.maxRegisterCount = 50;
moduleCompileOptions.optLevel = OPTIX_COMPILE_OPTIMIZATION_DEFAULT;
moduleCompileOptions.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_NONE;

pipelineCompileOptions.traversableGraphFlags = OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS;
pipelineCompileOptions.usesMotionBlur = false;
pipelineCompileOptions.numPayloadValues = 2; // (12) !!! !!! !!!
pipelineCompileOptions.numPayloadValues = 2; // (19) !!! !!! !!!
pipelineCompileOptions.numAttributeValues = 0;
pipelineCompileOptions.exceptionFlags = OPTIX_EXCEPTION_FLAG_NONE;
pipelineCompileOptions.pipelineLaunchParamsVariableName = "optixLaunchParams";

OptixModule module;
error_OptiX = optixModuleCreate(
	params_OptiX.optixContext,
	&moduleCompileOptions,
	&pipelineCompileOptions,
	ptxCode,
	strlen(ptxCode),
	NULL, NULL,
	&module
);
if (error_OptiX != OPTIX_SUCCESS) goto Error;

free(ptxCode);

njuffa · December 24, 2024, 9:44am

Thanks for showing the compilation step. Unfortunately I wouldn’t know what that really does and how things could go wrong. Any chance this JIT-compilation step is misconfigured in some way?

The people in the OptiX sub-forum might have more insights, given that this is not a pure CUDA scenario.

LowLevelKB · December 24, 2024, 9:46am

Thank You for Your help. Frankly speaking, I don’t know :( .

LowLevelKB · December 24, 2024, 9:49am

I posted wrong piece of code. It belongs to the bigger fragment:


#ifndef RENDERER_OPTIX_USE_DOUBLE_PRECISION 

	#define MIN_R(x, y) fminf(x, y) 

	#define MAX_R(x, y) fmaxf(x, y) 

#else 

	#define MIN_R(x, y) fmin(x, y) 

	#define MAX_R(x, y) fmax(x, y) 

#endif

LowLevelKB · December 24, 2024, 9:50am

So I am sure this part is OK.

LowLevelKB · December 24, 2024, 9:51am

Yes, I perform the computation on the doubles.

Curefab · December 24, 2024, 10:02am

Could you extract a direct example of code using fmin and/or fmax with a specific argument giving a wrong result (like a one-liner), which could be manually reasoned about?

E.g. do you get?

fmin(fmax(0.25, 0.0), 1.0) == 0.0

and more specifically?

fmax(0.25, 0.0) == 0.0

LowLevelKB · December 24, 2024, 10:04am


R = MIN_R(MAX_R(((REAL_R)R), ((REAL_R)0)), ((REAL_R)1)); 

G = MIN_R(MAX_R(((REAL_R)G), ((REAL_R)0)), ((REAL_R)1)); 

B = MIN_R(MAX_R(((REAL_R)B), ((REAL_R)0)), ((REAL_R)1));

LowLevelKB · December 24, 2024, 10:08am

Yes, it behaves exactly that way. But it’s worth noting, that the ptx is loaded dynamically by OptiX function optixModuleCreate.

LowLevelKB · December 24, 2024, 10:09am

Just look at the pixel colors.

Curefab · December 24, 2024, 1:12pm

Please replace with a more direct approach, e.g.

if (fmax(0.25, 0.0) == 0.0)
    R = G = B = 1.0;
else
    if (fmax(0.25, 0.0) == 0.25)
        R = G = B = 0.0;
    else
        R = G = B = 0.5;

0.25 has an exact representation in float and double, so a comparison with no tolerance (epsilon) should work.

According to your post, the result would be white.
If not (e.g. black), find out, where the difference lies, e.g. change to fmin(fmax(…)) with numbers, and so on.

LowLevelKB · December 24, 2024, 3:16pm

Excellent idea! I’ll check this out!

LowLevelKB · December 24, 2024, 7:21pm

Since the compiler ruled out


if (MIN_R(MAX_R(0.25, 0.0), 1.0) == 0.25) { 

R = 1.0; 

G = 1.0; 

B = 1.0; 

}

likely due to the advanced optimization, I used something similar:


R = (R < 0.0) ? 0.0 : R; 

R = (R > 1.0) ? 1.0 : R; 

if (MIN_R(MAX_R(R, 0.0), 1.0) < R) { 

R = 1.0; 

G = 1.0; 

B = 1.0; 

}

Here is the resulting render:

Note, that it was AFTER CLAMPING R to the interval [0, 1] using


R = (R < 0.0) ? 0.0 : R; 

R = (R > 1.0) ? 1.0 : R;

Topic		Replies	Views
Strange behavior of cvt.sat.f64.f64 in the OptiX shader OptiX	9	71	March 4, 2025
Undefined behavior in device code OptiX	16	1500	June 14, 2022
Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function CUDA Programming and Performance opencv , cuda	27	1461	May 8, 2024
Strange behaviour of the compiler CUDA Programming and Performance	12	36	December 24, 2024
Compile optix without Cmake OptiX	13	1731	June 15, 2022
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	49891	August 24, 2009
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	460	March 3, 2024
Help me understand "-Xptxas -dlcm=cg" (take 2) CUDA Programming and Performance	1	6975	November 24, 2010
Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!! CUDA Programming and Performance	11	5799	November 16, 2012
Simple PTX shader - OptiX 7 OptiX	27	4201	October 12, 2021

Strange behavior of cvt.sat.f64.f64

Related topics