Several questions on cuda (arithmetic, rounding, for loop ad performance)

Hello,

I’m kinda new in this CUDA programming and currently i’m struggling with one code which i’ve tried to transfer from CPU to GPU and i have several questions. Sorry, i cant share code. Project currently is private.

  1. In my code there is a line (not exactly this)

int val = static_cast<int>(ptr[idx] * vec.a + vec.b - ptr[idx]);

Here, ptr is just a ptr to uchar and vec is a structure with two floats inside (a and b). So, currently in cuda kernel this line is just the same without any cuda additional arithmetic functions. I know, that there are __maf function which adds 3 floats and do a rounding (maf_rn, maf_rz etc). This one helps me to get better performance and i’m writing, for example, this line instead of the line above:

int val = static_cast<int>(__fmaf_rn(ptr[idx] * vec.a, vec.b, (-1)*ptr[idx]));

Unfortunately, this change leads to worse results at the end of function. Difference between values calculated with those two line could be like val_maf = -12890, val_normal = -6258 which is unacceptable. Maybe i’m applying this function somehow wrong? ALso, is there are any maf, add operations for int? I’ve found only hadd which calculates average between two ints… (well, i guess i can use __vadd2 for this case. So, nevermind)

  1. Is there are like normal, arithmetical rounding in CUDA? I dont need round up, round down, round tow zero or round tow nearest even. I need like in school, you know - 5.5 = 6; 5.4 = 5; I’ve tried to search through nvidia’s page https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html#group__CUDA__MATH__INTRINSIC__SINGLE_1ga8255ea2b671a8488813d9d3527e661a i’ve tried to just google it - havent find anything. Of course there are always an option to just floor(x+0.5) but what to do in the case of maf? I can’t just add 0.5 to all three and i dont think that will even helps. So… any other __maf but with another rounding? Or without any rounding?

  2. Is it true that consequent for inside cuda kernel is slower than for on CPU code? I mean if i have some for loop that i want to transfer to GPU but it couldn’t be parallelized (each iteration depends on the result of previous one) will i gain any performance boost by just putting it into kernel? As i understand, if i’ll just put it as is - i won’t. So, as i understand, i need to replace, for example, any arithmetic operations i have in for loop to __maf, __add, __mul and other fast arithmetics which cuda has?

  3. And finally, what could be the case that for loop putted into kernel without any parallelizations and cuda-arithmetic additions gives different result from regular for loop? Assuming that all malloc, memcpy and free operations are done right? I know, there are no one with extrasensorical abilities to read the code which i cannot provide, i’m sorry about that, I really am. But if you’ll just give me any possible options, any variants of what could be wrong (assuming, i repeat, that cuda memory operations are fine and that for loop just copypasted to kernel and running consequently with kernel launched just with <<<1,1>>> blocks/threads), i’ll be really grateful.

Thanks a lot in advance. Sorry for my language, english is not my native.

If you have not had a chance to read NVIDIA’s whitepaper “Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs” (https://docs.nvidia.com/pdf/Floating_Point_on_NVIDIA_GPU.pdf) now would be a perfect time to do so. Generally speaking, the floating-point arithmetic on NVIDIA GPUs adheres to the same IEEE-754 standard that CPUs from Intel, AMD, ARM, etc adhere to. Also, the CUDA compiler optimizes floating-point expressions conservatively because floating-point arithmetic is not associative.

How has it been established that the results are worse? Simply comparing to the output of another platform is meaningless. It’s the dilemma of the person with two watches showing different time: which one is correct? What is needed for a correct assessment at minimum is a higher-precision reference.

In general, what one typically finds when using fused multiply-add operations (either by explicit use, or by the compiler generating them) is that results differ from the equivalent computation that does not use FMA operations. If one then looks more closely, one finds in most cases that the computation using FMA computes the result more accurately because (1) there is only a single rounding in FMA, but two roundings in an FMUL/FADD sequence (2) because FMA uses the full double-width product it offers some protection against subtractive cancellation. Huge numeric differences between a computation expressed with or without FMA typically indicate that the computation is poorly conditioned: time to re-think the algorithm.

You can turn off contraction of FMUL/FADD into FMA by using the compiler switch -fmad=false.

Rounding modes: Except for a few operations that can be exact, finite-precision floating-point operations generally need to round their results according to some rule. The IEEE-754 standard specifies multiple such rules. All programming environments using floating point based on IEEE-754 that I am familiar with default to the rounding mode “to nearest or even”. That is what CUDA does as well. For some use cases, it can be advantageous to use other rounding modes specified by IEEE-754, so CUDA provides device-function intrinsics to use these with various operations.

As for rounding to an integer, all rounding modes are incorporated into standard C++ functions as follows: ceil (round up), floor (round down), trunc (round towards zero), rint (round to nearest or even), round (round to nearest, ties up). As CUDA is a C++ –derivative language, it offers all of those, too.

Re multiply-add for integers: Different GPU architectures have varying degrees of support for multiply-add integer operations. The compiler will use them automatically where available, by applying the kind of contracting optimization that is also used to generate FMAs from code with not explicitly written FMAs.

Thanks, njuffa, for your reply.

If you have not had a chance to read NVIDIA’s whitepaper

Yes, i’ve read a little this paper. Not entirely from A to Z, but i’ve got the main idea that i’lll get slightly different result on Cuda and on CPU. But

How has it been established that the results are worse?

My kernel function at the end gives me one item of a vector. Launching it n’times with different inputs i’m receiving a vector of floats. So, i’m comparing 2 vector of floats with each other (CPU vs GPU). And question is - how much difference i could get between them? For example, i’ve got like 10^-4 difference in some vector items, but i’ve got like 14 difference for another and, sometimes, even higher. What is the normal difference between two numbers received by gpu and cpu?

computation using FMA computes the result more accurately
So, if i’ve understood this right, i need to use not __maf, but something else? Can you point me at the documentation page with function names?

You can turn off contraction of FMUL/FADD into FMA by using the compiler switch -fmad=false

Hm… ANd if i’m using CLion and cmake to build and run project, can you tell me how to do that? Thanks

For some use cases, it can be advantageous to use other rounding modes specified by IEEE-754, so CUDA provides device-function intrinsics to use these with various operations.

SO, as i’ve seen in the documentation, there are only 4 types of rounding for __maf and other arithmetical operations - __fmaf_rn, __fmaf_rd, __fmaf_ru, __fmaf_rz (for example). Question was, is there another rounding like x.(5-9) = x+1, x.(0-4) = x? But, i guess, i can just turn this roundings off by using -fmad=false ? So how those fmaf will work? Same way regardless of _rz, _ru endings?

As CUDA is a C++ –derivative language, it offers all of those, too.

Well, in documentation i’m seeing once again only those 4 endings for function __float2int for example - rd, ru, rz, rn. No rounding as c++ round. Om am i missing something?

Different GPU architectures have varying degrees of support for multiply-add integer operations

My GPU is gtx 1070 on laptop, as i remember compute ability 6.2 or 6.3

The compiler will use them automatically where available

If i got you right there, compiler will force something like this
ptr[idx] * vec.a + vec.b - ptr[idx]

to use FMAs?

Huge numeric differences between a computation expressed with or without FMA typically indicate that the computation is poorly conditioned

Well, i’ve mentioned simple example of
int val = static_cast<int>(ptr[idx] * vec.a + vec.b - ptr[idx]);
and
int val = static_cast<int>(__fmaf_rn(ptr[idx] * vec.a, vec.b, (-1)*ptr[idx]));
And difference is huge between those two. I dont think that’s algorithmic issue, since that the only line i’ve been changing to see the result and i’ve compared result of those vals directly using printf function from kernel. For example, here is a simple kernel
global void check_maf()
{
uchar ptr_idx = 1;
float vec_a = 5.5;
float vec_b = 7.9;
int val = static_cast(ptr_idx * vec_a + vec_b - ptr_idx);
int val2 = static_cast(__fmaf_rn(ptr_idx * vec_a, vec_b, (-1)*ptr_idx));
printf(“val = %d, val2 = %d\n”, val, val2);
}

just to reproduce the issue. In this example i’m receiving val = 12, val2 = 42. If i’m getting something wrong, i’ll be glad to hear where, since i’m struggling with this for a while… And interesting thing, no matter what ending for __fmaf function i’m using (rd, rn etc), it still gives me the same…

There is no such thing as a “normal” difference. Depending on your computation, a single small difference in intermediate computation could result in a very large difference in the output. Your computation might be ill-conditioned. Differences in floating-point results between platforms are common and not something particular to host/device comparison.

That’s correct. GPU hardware supports four IEEE-754 rounding modes for arithmetic operations. On a CPU where you can select the rounding mode dynamically with the standard C++ function fesetround() you get the same four rounding modes, except they are called FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD. As I pointed out, at the HLL-level you get five rounding modes for rounding to integers specifically, where on the GPU, round() uses an emulation sequence, while trunc(), ceil(), floor(), rint() each map directly to a hardware instruction.

No. Rounding modes for arithmetic operations have nothing to do with -fmad=false. Except for a few arithmetic operations that can be exact, such as remainder() and remquo(), all arithmetic operations provide non-exact, rounded, results most of the time. -fmad=false disables the contraction of an FMUL operation followed by a dependent FADD operation into a single FMA operation.

I am afraid I don’t know what “c++ round” is. Can you show me an example? Ordinary floating-point arithmetic in C++ uses operations that round. So when you write “a+b”, that maps to fadd_rn(a,b) at the machine language level on a GPU. Or, on your x86 CPU you might see a plain fadd instruction whose rounding mode is controlled dynamically by a control register, with the rounding mode defaulting to “rn” (round to nearest or even). Either way you get the same IEEE-754 specified addition functionality.

By default, because it is generally conducive to performance and accuracy, the CUDA compiler turns on FMA-contraction, so you would see an FMA instruction used for an expression like this. You can check the generated code with the utilitycuobjdump --dump-sass. You can turn off FMA contraction by passing -fmad=false on the nvcc command line.

__fmaf_rn(ptr_idx * vec_a, vec_b, (-1)*ptr_idx) computes this: ptr_idx * vec_a * vec_b - ptr_idx, so the two expressions don’t compute the same thing. You could use FMA to compute __fmaf_rn(ptr_idx , vec_a, vec_b - ptr_idx) or __fmaf_rn(ptr_idx , vec_a, vec_b) - ptr_idx or _fmaf_rn(ptr_idx , vec_a, - ptr_idx) + vec_b.

Note that in general, float to int conversion can turn small differences into large ones. E.g. ((int)1.999999f) evaluates to 1, while ((int)2.000001f) evaluates to 2.

For rounding to an integer specifically, this is what round() does. It rounds to nearest, with ties (so x.500000…) rounded away from zero. So rint(3.5) = 4, rint(4.5) = 4, rint(5.5) = 6, but round(3.5) = 4, round(4.5) = 5, round(5.5)= 6. This rounding mode “round to nearest, ties way from zero” is not available for other floating-point operations. If you look at Intel’s documentation for AVX2, I think you will find the exact same situation when it comes to available rounding modes.

Alright. Then, i guess, i need to check every line in line-by-line manner. Though, it is strange. As i said, i’ve just moved same for loop from CPU to GPU and changed almost nothing except adding atomicAdd functions. Oh, well.

I am afraid I don’t know what “c++ round” is

Well, as i said, it is the rounding that uses rule such as x.(5-9) = x+1 and x.(0-4) = x. round() in c++. But you’ve already answered this so i guess it’s nevermind then.

__fmaf_rn(ptr_idx * vec_a, vec_b, (-1)*ptr_idx) computes this: ptr_idx * vec_a * vec_b - ptr_idx

So you want to say, that fmaf(x, y, z) computes x*y + z? Oh, man. My bad. I’ve misread x × y + z as x + y + z. Sorry about that.

Note that in general, float to int conversion can turn small differences into large ones
Yes, i’m totally understand this.

Thanks for all your answers. Though, questions 3 and 4 are still there. And if we havent addressed 3rd question, i’ll explain what bothers me on 4th. I’ve got your 2 points that a) there is a difference between CPU computations and Cuda computations and differences beween them could be explained by that. At least some difference; and b) probably, some troubles in common with algorithm (though, as i said, i’m just taking ready-to-use algorithm from CPU and transferring it as it is to GPU. CPU function works fine and is used by our group for a long time so it’s algorithm should be just fine), i guess i’m just missing something with transferring and it is not so simple as i thought. But what still bothers me - there is a two kernels. One with two for loops (one inside another) and another without for loops. It is the same algorithm lies there, but in one case i’m launching it like kernel_with_for<<<1,1>>>(input parameters) and in another kernel_parallel<<<blocks, threads>>>(same input parameters). And thing is, that the only difference inside those kernels is the presence of for loop. And at the end, i’m receiving different result for the same inputs for those two kernels. That’s what bothers me. Moreover, parallel one gives better results, though they are not perfect. And i’m actually don’t know what could be the case.

and changed almost nothing except adding atomicAdd functions.

Floating-point arithmetic is not associative. By using atomic adds, you are changing the order of summation in an unpredictable way.

Floating-point arithmetic is not associative. By using atomic adds, you are changing the order of summation in an unpredictable way.

Alright. So using atomicAdd is bad. But, without it results are even worse since i’ve got to do like += operation to the variable each iteration and it depends on previous one. So atomic operations are the only possible way to do such a thing. Or am i wrong? There are some better way to ensure that += won’t affect anything in a wrong way - i’ll be glad to know, since i havent found anything like it.

It’s not that atomic adds are “bad”. It’s a matter of being aware that using atomic adds leads to an unspecified order of summation and that this can alter results because floating-point arithmetic is not associative.