Cost of data type?

what’s the diffirence bettwen 1&2 ?
why double’s cost is not doubled of float’s ?
why compiler not replace the const express with 1 ?

global void sub_kernel(int* counter, int *b)
{
int k = *counter;
for (int i = 0; i < 10; i++)
{
(*counter) += *counter / 2; // fast
(*counter) += 2 / *counter; // much slower
(*counter) += 2 / k; // fast as line 1
(*counter) += 2 / 1.3f; // slower
(*counter) += 2 / 1.3; // slowest(tens times of 1.3f)
}
}

Because typical consumer GPUs process double-precision operations at 1/32 or 1/64 of the throughput of single-precision operations. You are presumably using a consumer GPU rather than an expensive professional-grade GPU.

Which “const expr” are you referring to?

First line: An integer division by a compile-time constant which is a power of two, resolves to a single right shift instruction.

Second line: An integer division by a variable, resolves to an actual integer division operation, which is an inlined instruction sequence of some 16 instructions or so

2 / k is loop invariant. It gets computed once, using an integer division operation (canned sequenced of 16 or so instructions). The compiler notices that the loop is taken ten times, so it computes (2 / k) * 10 and stores that. Loop eliminated.

–Which “const expr” are you referring to?–
I mean:
(*counter) += 2 / 1.3; can be replaced with
(*counter) += 1; in compiling time. obviously.
I think the last two lines are totally the same after compiling. but the result tolds they are so different.

The type conversion is implicit, so it is really (int)((float)2 / 1.3f)(int)(1.538461565971f). At least the older CUDA compiler I am looking at here on my web-browsing PC does not propagate the constant through the FP32 → I32 conversion, so it never gets to see the 1. Thus it happily performs a floating-point addition, and the FP64 addition is a lot slower than the FP32 addition, leading to your observation.

Let me check what CUDA 12.3 does, on my workstation.

Nope, no luck with CUDA 12.3 either. I am not a compiler engineer, but in my thinking there should be nothing that precludes the compiler from propagating the constant through F2I to find that it results in 1 which then could be used to further strength-reduce the loop.

You may wish to file a bug with NVIDIA on this. Non-contrived scenarios were this could cause a negative performance impact seem possible.

Now, I may be overlooking something tricky in terms of C++ semantics, so a good sanity check would be to hand this code to the host compiler to see whether it will propagate the constant through the float-to-int conversion.

I tried gcc 13.2 and clang 18.1.0 with -O3 -march=core-avx2. They both do what the CUDA compiler does: convert integer to float, add 1.538461565971f, convert float-to-int; this instruction sequence repeated ten times.

Intel icx 2024.0.0 with -O3 -march=core-avx2 is a tiny bit smarter. It converts *count to float once, then adds 1.538461565971f followed by a trunc operation (vroundss mode 11); rinse and repeat 10 times, then converts float to int once at the end.

The refusal of these highly-optimizing compilers to propagate 1.538461565971f through the float-to-int conversion likely means something, but I don’t know what that could be. If you file a performance bug with NVIDIA you may find out.

[Later:]

I think I understand it now, and the compiler behavior follows from plain C++ semantics.

(*counter) += 2 / 1.3f;

means

(*counter) = (*counter) + 2 / 1.3f;

means

(*counter) = (int)((float)(*counter) + ((float)2 / 1.3f));

which can be optimized to (as reflected in the generated machine code):

(*counter) = (int)((float)(*counter) + 1.538461565971f);

Basically we have the addition of an int and a float on the RHS, causing the int to be promoted to float by the rules for implicit conversions applied during expression evaluation. To get the desired code optimization, you would need to write

(*counter) += (int)(2 / 1.3f);


[Even later]

What seems to be missing from the CUDA compiler is a potential peephole optimization that contracts

int i;
float f, r;
i = f;
r = i;

into

int i;
float f, r;
r = truncf (f);

Analogous for double and long long int. The ISO-C++11 standard specifies in section 4.9:

A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.

Since the only cases where the behavior of the two code variants above could differ is for those inputs where the “truncated value cannot be represented in the destination type”, and the compiler assumes the absence of undefined behavior when optimizing, this seems like a legitimate transformation.

Unfortunately, not under strict IEEE-754 floating-point semantics, as the two code variants differ for an input of -0.0f, for which truncf() delivers a negative zero but the conversion through the intermediate integer delivers a positive zero. So the optimization is not suitable by default but may be suitable when -use_fast_math is specified, or when the compiler can prove that negative zero is not one of the possible inputs.