Hello,
I’m kinda new in this CUDA programming and currently i’m struggling with one code which i’ve tried to transfer from CPU to GPU and i have several questions. Sorry, i cant share code. Project currently is private.
 In my code there is a line (not exactly this)
int val = static_cast<int>(ptr[idx] * vec.a + vec.b  ptr[idx]);
Here, ptr is just a ptr to uchar and vec is a structure with two floats inside (a and b). So, currently in cuda kernel this line is just the same without any cuda additional arithmetic functions. I know, that there are __maf function which adds 3 floats and do a rounding (maf_rn, maf_rz etc). This one helps me to get better performance and i’m writing, for example, this line instead of the line above:
int val = static_cast<int>(__fmaf_rn(ptr[idx] * vec.a, vec.b, (1)*ptr[idx]));
Unfortunately, this change leads to worse results at the end of function. Difference between values calculated with those two line could be like val_maf = 12890, val_normal = 6258 which is unacceptable. Maybe i’m applying this function somehow wrong? ALso, is there are any maf, add operations for int? I’ve found only hadd which calculates average between two ints… (well, i guess i can use __vadd2 for this case. So, nevermind)

Is there are like normal, arithmetical rounding in CUDA? I dont need round up, round down, round tow zero or round tow nearest even. I need like in school, you know  5.5 = 6; 5.4 = 5; I’ve tried to search through nvidia’s page https://docs.nvidia.com/cuda/cudamathapi/group__CUDA__MATH__INTRINSIC__SINGLE.html#group__CUDA__MATH__INTRINSIC__SINGLE_1ga8255ea2b671a8488813d9d3527e661a i’ve tried to just google it  havent find anything. Of course there are always an option to just floor(x+0.5) but what to do in the case of maf? I can’t just add 0.5 to all three and i dont think that will even helps. So… any other __maf but with another rounding? Or without any rounding?

Is it true that consequent for inside cuda kernel is slower than for on CPU code? I mean if i have some for loop that i want to transfer to GPU but it couldn’t be parallelized (each iteration depends on the result of previous one) will i gain any performance boost by just putting it into kernel? As i understand, if i’ll just put it as is  i won’t. So, as i understand, i need to replace, for example, any arithmetic operations i have in for loop to __maf, __add, __mul and other fast arithmetics which cuda has?

And finally, what could be the case that for loop putted into kernel without any parallelizations and cudaarithmetic additions gives different result from regular for loop? Assuming that all malloc, memcpy and free operations are done right? I know, there are no one with extrasensorical abilities to read the code which i cannot provide, i’m sorry about that, I really am. But if you’ll just give me any possible options, any variants of what could be wrong (assuming, i repeat, that cuda memory operations are fine and that for loop just copypasted to kernel and running consequently with kernel launched just with <<<1,1>>> blocks/threads), i’ll be really grateful.
Thanks a lot in advance. Sorry for my language, english is not my native.