I don’t know how to use reduction can you please clarify to me what is the reduction and what reduction do when I use it with the code?
A reduction in this context is to take a set of values and “reducing” them into a single value. A sum reduction is the most common, but OpenACC also supports min, max, product, and bitwise or/and.
Is reduction a pragma like acc parallel loop, and why is it correct when I use it in this case?
Reduction is a clause associated with a loop directive. See section 2.9.11 of the OpenACC standard for full details:
How does openACC solve the problem of shared memory here because if it is happening here why I can’t see it when I print the factorial result of 40 for example, and how GPU can handle this type of example using acc parallel loop ?
Reductions don’t use shared memory. As I noted before, reductions use independent variables (either at the gang, worker, or vector depending upon where the reduction clause appears), to accumulate a “partial reduction”. Then in a separated step added by the compiler, the partial reductions are accumulated again into the final reduction which then is used to set the value back to the original variable.
Do the GPU and the GPU cores numbers affect the result and the correctness of the result?
Again, the issue really isn’t about correctness or GPUs, it’s about the order of operations. Both values are correct (or both are “wrong” depending on how you look at it) and you’d see similar issues if you were to run this in parallel on a CPU. Also using an atomic wont necessarily help since atomics don’t enforce the order of operations, only that the update is visible to all threads.
In general, yes, the number of threads in use can effect the order of operations, and hence effect the variance. Though, for GPU I doubt that it will matter since you’re simply using too few cores. Though let’s look at using OpenACC targeting a multicore CPU using different number of threads:
First, lets disable optimization and FMA and run the code sequentially:
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000
Next, compile targeting multicore CPU, but use a single thread:
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma -acc=multicore
main:
7, Generating Multicore code
9, #pragma acc loop gang
9, Generating implicit reduction(*:factorial)
% setenv ACC_NUM_CORES 1
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000
The result match because the code is running with a single thread in both cases. Now, lets vary the number of cores:
% setenv ACC_NUM_CORES 2
% a.out
100
Factorial of Given Number is =93326215443944126576986155452264574909672612614072335170089899970495111151406684221849403559869420087934419480475499584406636590300478012164871871456230768640.000000
% setenv ACC_NUM_CORES 20
% a.out
100
Factorial of Given Number is =93326215443944138771316430124109228744036791493628217000551394755538669194988557758457758324578873682880134572241011927480586283295098697508526868676095246336.000000
The result now vary depending upon the number of threads in use.
If my professor wants me to use atomic, what is your suggestion Mat, and are there any alternatives?
Unfortunately since this is a compiler issue, there’s not much you can do with the product example until the fix is available in a released compiler. Though, atomic will work with the sum example.
- Does reduction solve the correctness in the result since I have this unknown error when I am using atomic?
- and as you mentioned the compiler already applied reduction, but the result still not correct, how can I reach the perfect result since my professor said we are looking for a perfect result, not the execution time?
Again, I would argue that both results are correct (or equally wrong) and verification should check to see if the results are within an acceptable tolerance. The only way to get the results to agree would be to lower optimization, enable IEEE 754 compliance, disable FMA (depending on what CPU you’re using), and run serially on the device.
% cat prod.cpp
#include <iostream>
using namespace std;
int main()
{
int num;
double factorial = 1;
scanf("%d", &num);
#pragma acc serial loop
for (int a = 1; a <= num; a++)
{
/* shared data
* non-deterministic
*/
factorial *= a;
}
printf("Factorial of Given Number is =%lf\n" , factorial );
return 0;
}
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma -acc
main:
7, Accelerator serial kernel generated
Generating Tesla code
9, #pragma acc for seq
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000
Note that optimization, even when running sequentially on the CPU can affect accuracy:
% nvc++ -Minfo=accel prod.cpp -O2
% a.out
100
Factorial of Given Number is =93326215443944175354307254139643190247129328132295862491935879110669343325734178368282822618707234467717279847537548956702435362278960753539491860335688679424.000000