What is the issue of different values between running the code in serial and run it using OpenACC?

I am using this code:

#include < iostream >
using namespace std;
int main()
{
int num;
double factorial = 1;
scanf("%d", &num);
#pragma acc parallel loop
for (int a = 1; a <= num; a++)
{
/* shared data
* non-deterministic
*/
factorial = factorial * a;
}
printf(“Factorial of Given Number is =” “%lf\n” , factorial );
return 0;
}

when I run it in serial and provide a small number like 40 I got the result of both serial and parallel as:

Factorial of Given Number is =815915283247897683795548521301193790359984930816.000000

but when I provide with bigger numbers like 100, I got this result of serial:

Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000

and this result of parallel:

Factorial of Given Number is =93326215443944126576986155452264574909672612614072335170089899970495111151406684221849403559869420087934419480475499584406636590300478012164871871456230768640.000000

My professor told me there is a shared memory (non-deterministic) and the factorial inside the printf won’t know what is the last value it received from the loop, and he said you should use atomic pragma with it, I changed the pragma with #pragma acc atomic and it gave me this error:

NVC+±S-0155-Invalid atomic expression (Acc_factorial.cpp: 12)
NVC+±S-0155-Invalid atomic expression (Acc_factorial.cpp: 18)
NVC+±S-0155-Invalid atomic expression (Acc_factorial.cpp: 18)
NVC+±F-0000-Internal compiler error. bitcast with differing sizes -4 (Acc_factorial.cpp: 12)
NVC++/x86-64 Linux 20.7-0: compilation aborted

I also use it inside the loop and this error appears to me:

NVC+±F-0000-Internal compiler error. bitcast with differing sizes -4 (Acc_factorial.cpp: 12)
NVC++/x86-64 Linux 20.7-0: compilation aborted

the same issue happens in this code as will:

#include < iostream>
#include < iomanip>
#include < cstdlib>

using std::cout;

int main(int argc, char *argv)
{
long i, nsteps;
double pi, step, sum = 0.0, x;
nsteps = 100;
step = (1.0) / nsteps;

for (i = 0; i < nsteps; ++i)
{
x = (i + 0.5) * step;

sum = sum + 1.0 / (1.0 + x * x);

}
pi = 4.0 * step * sum;
cout << std::fixed;
cout << "pi is " << std::setprecision(15) << pi << “\n”;
}

when I increase the setprecision it gave me the same issue as the previous factorial code, and my professor said it is the same issue that happens in factorial.

my argument is that why when I provide a smaller number it gives me the correct answer and when I put huge numbers it gave me the wrong result, is that related to my GPU model or something else, and does atomic work for these types of issue or not?
my GPU is: GeForce GTX 960M

Thank you in advance

-Hisham

Hi Hisham,

My professor told me there is a shared memory (non-deterministic) and the factorial inside the printf won’t know what is the last value it received from the loop, and he said you should use atomic pragma with it,

In this case, you can use a reduction and in looking at the compiler feedback messages, you can see the the compiler is implicitly adding a reduction for you:

% nvc++ fact.cpp -acc -Minfo=accel -V20.7
main:
      7, Generating Tesla code
          9, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(*:factorial)
      7, Generating implicit copy(factorial) [if not already present]

Reductions can be performed in parallel by having each thread gather a partial reduction and then launch a second kernel to gather the partial reductions into a final reduction.

NVC+±F-0000-Internal compiler error. bitcast with differing sizes -4 (Acc_factorial.cpp: 12)
NVC++/x86-64 Linux 20.7-0: compilation aborted

This is a known compiler error that will be fixed in the upcoming 20.11 release. Sans this error, you could use an atomic to accomplish the same thing, but reductions are more performant so should be used in this case.

my argument is that why when I provide a smaller number it gives me the correct answer and when I put huge numbers it gave me the wrong result, is that related to my GPU model or something else, and does atomic work for these types of issue or not?

The result isn’t necessarily wrong, just different. It’s best to think of all floating point computation as always “wrong” in that almost all floating point values can’t be represented exactly. Floating point operations result in a slight amount of rounding error with the amount of error being effected by the precision, choice to operations (such as FMA), optimizations, order of operations, etc.

Now you can increase the precision, lower the optimization level (-O0), tell the compiler to use strict IEEE754 compliance (-Kieee) and disable/enable FMA operations (-Mfma (default) / -Mnofma), to get closer to consistent results. However when run in parallel, the order of operations will be different and hence you should not expect bit for bit agreement. Instead you should be checking if the results are within an acceptable tolerance.

Thank you for your reply Mat,

  • I don’t know how to use reduction can you please clarify to me what is the reduction and what reduction do when I use it with the code?

  • Is reduction a pragma like acc parallel loop, and why is it correct when I use it in this case?

  • How does openACC solve the problem of shared memory here because if it is happening here why I can’t see it when I print the factorial result of 40 for example, and how GPU can handle this type of example using acc parallel loop?

  • Do the GPU and the GPU cores numbers affect the result and the correctness of the result?

  • If my professor wants me to use atomic, what is your suggestion Mat, and are there any alternatives?

you mentioned:

but reductions are more performant so should be used in this case.

  • Does reduction solve the correctness in the result since I have this unknown error when I am using atomic?
  • and as you mentioned the compiler already applied reduction, but the result still not correct, how can I reach the perfect result since my professor said we are looking for a perfect result, not the execution time?

Thank you in advance
-Hisham

I don’t know how to use reduction can you please clarify to me what is the reduction and what reduction do when I use it with the code?

A reduction in this context is to take a set of values and “reducing” them into a single value. A sum reduction is the most common, but OpenACC also supports min, max, product, and bitwise or/and.

Is reduction a pragma like acc parallel loop, and why is it correct when I use it in this case?

Reduction is a clause associated with a loop directive. See section 2.9.11 of the OpenACC standard for full details:

How does openACC solve the problem of shared memory here because if it is happening here why I can’t see it when I print the factorial result of 40 for example, and how GPU can handle this type of example using acc parallel loop ?

Reductions don’t use shared memory. As I noted before, reductions use independent variables (either at the gang, worker, or vector depending upon where the reduction clause appears), to accumulate a “partial reduction”. Then in a separated step added by the compiler, the partial reductions are accumulated again into the final reduction which then is used to set the value back to the original variable.

Do the GPU and the GPU cores numbers affect the result and the correctness of the result?

Again, the issue really isn’t about correctness or GPUs, it’s about the order of operations. Both values are correct (or both are “wrong” depending on how you look at it) and you’d see similar issues if you were to run this in parallel on a CPU. Also using an atomic wont necessarily help since atomics don’t enforce the order of operations, only that the update is visible to all threads.

In general, yes, the number of threads in use can effect the order of operations, and hence effect the variance. Though, for GPU I doubt that it will matter since you’re simply using too few cores. Though let’s look at using OpenACC targeting a multicore CPU using different number of threads:

First, lets disable optimization and FMA and run the code sequentially:

% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000

Next, compile targeting multicore CPU, but use a single thread:

% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma -acc=multicore
main:
      7, Generating Multicore code
          9, #pragma acc loop gang
      9, Generating implicit reduction(*:factorial)
% setenv ACC_NUM_CORES 1
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000

The result match because the code is running with a single thread in both cases. Now, lets vary the number of cores:

% setenv ACC_NUM_CORES 2
% a.out
100
Factorial of Given Number is =93326215443944126576986155452264574909672612614072335170089899970495111151406684221849403559869420087934419480475499584406636590300478012164871871456230768640.000000
% setenv ACC_NUM_CORES 20
% a.out
100
Factorial of Given Number is =93326215443944138771316430124109228744036791493628217000551394755538669194988557758457758324578873682880134572241011927480586283295098697508526868676095246336.000000

The result now vary depending upon the number of threads in use.

If my professor wants me to use atomic, what is your suggestion Mat, and are there any alternatives?

Unfortunately since this is a compiler issue, there’s not much you can do with the product example until the fix is available in a released compiler. Though, atomic will work with the sum example.

  • Does reduction solve the correctness in the result since I have this unknown error when I am using atomic?
  • and as you mentioned the compiler already applied reduction, but the result still not correct, how can I reach the perfect result since my professor said we are looking for a perfect result, not the execution time?

Again, I would argue that both results are correct (or equally wrong) and verification should check to see if the results are within an acceptable tolerance. The only way to get the results to agree would be to lower optimization, enable IEEE 754 compliance, disable FMA (depending on what CPU you’re using), and run serially on the device.

% cat prod.cpp
#include <iostream>
using namespace std;
int main()
{
int num;
double factorial = 1;
scanf("%d", &num);
#pragma acc serial loop
for (int a = 1; a <= num; a++)
{
/* shared data
* non-deterministic
*/
factorial *= a;
}
printf("Factorial of Given Number is =%lf\n" , factorial );
return 0;
}
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma -acc
main:
      7, Accelerator serial kernel generated
         Generating Tesla code
          9, #pragma acc for seq
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000
% nvc++ -Minfo=accel prod.cpp -O0 -Kieee -Mnofma
% a.out
100
Factorial of Given Number is =93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248.000000

Note that optimization, even when running sequentially on the CPU can affect accuracy:

% nvc++ -Minfo=accel prod.cpp -O2
% a.out
100
Factorial of Given Number is =93326215443944175354307254139643190247129328132295862491935879110669343325734178368282822618707234467717279847537548956702435362278960753539491860335688679424.000000

Thank you for your replay Mat,

and I am sorry for my questions, but I am a beginner with OpenACC and I am trying my best to understand it very well.
my questions are:

  • if we assume the error that appears to me not exist, where should atomic pragma be written in the code?
    you mentioned:

Though, for GPU I doubt that it will matter since you’re simply using too few cores.

  • How I can know the number of cores that I used, I believe my GPU has 640 cores?

When I run the code using #pragma acc parallel loop

9, Generating Tesla code
11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x /
Generating implicit reduction(
:factorial)
9, Generating implicit copy(factorial) [if not already present]

what does 128 mean in #pragma acc loop gang, vector(128) and is this the reduction clause because I read the file that you shared with me and I found this directive:
#pragma acc loop gang worker vector reduction(+:x) but I don’t really understand it very well.

Thank you in advance
-Hisham

No worries. I’m happy to help answer questions to the best of my abilities. I just want to make sure you understand the issues with floating point accuracy is a much larger topic and different results round off error and order of operations is general topic to parallelism, and is not specific to OpenACC.

if we assume the error that appears to me not exist, where should atomic pragma be written in the code?

You could use atomics but I’d recommend using reductions for these situations. Atomics aren’t going to solve the order of operations accuracy issue, and will most likely be less performant in this context.

  • How I can know the number of cores that I used, I believe my GPU has 640 cores?

This is the wrong question to ask. What you should be thinking about are the number of CUDA threads that are in use and the occupancy. Occupancy the percentage utilization of the hardware resources. For both, you need to use a profiler.

You can set the environment variable “NV_ACC_TIME=1” (or PGI_ACC_TIME=1) and this will give you a quick profile showing the execution times of the kernels as well as the number of gangs (CUDA blocks) and vectors (CUDA thread). Total number threads is number of blocks multiplied by the threads per block.

For a quick estimate of the theoretical occupancy, you can use the CUDA Occupancy Calculator spreadsheet (https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html). Theoretical occupancy is based of the number of blocks, threads per block, register usage per thread, and shared memory usage per block. To get the number of registers per thread, add the compiler option “-gpu=ptxinfo” (or -ta=tesla:ptxinfo).

To get the actual occupancy, as well as an in-depth analysis of the kernel performance, you can use Nsight-Compute (https://developer.nvidia.com/nsight-compute). Nsight-Compute will take some effort to understand (definitely read the docs), but is a very powerful tool to understand the low-level performance of the kernels.

Night-Systems is also useful but is more focused on overall system performance rather individual kernel performance.

what does 128 mean in #pragma acc loop gang, vector(128)

That’s the vector length, i.e. the number of vectors per gang (in CUDA, this is the number of threads per block).

is this the reduction clause

No, the reduction clause would be “reduction(*:factorial)”. Though note from the compiler feedback messages that the compiler is implicitly adding the reduction for you. So when using the NVHPC compiler, the reduction is optional. However, to be compliant with the OpenACC standard, you should be explicitly adding it. The GNU OpenACC implementation is a bit more of of stickler on this point and complain if you don’t have it.

because I read the file that you shared with me and I found this directive:
#pragma acc loop gang worker vector reduction(+:x) but I don’t really understand it very well.

The “parallel” and “kernels” directives (i.e. the compute regions) define the region of code to accelerate.
The “loop” directive defines how the work will be distributed. While separate directives, they are used together often enough that they can be combined on the same line. For example:

#pragama acc parallel loop
is equivalent to
#pragma acc parallel
{
#pragma acc loop

}

“gang worker vector” is the loop schedule clauses. In this case all three parallel levels are applied to a single loop with the computation split across the levels. When these clauses are not present, the loop uses an “auto” schedule, which allows the compiler to choose the schedule and how to distribute the work.

“reduction(+:x)” this is a reduction clause which is used to tell the compiler that the “x” variable will be used as the sum reduction.

I’m happy to help answer any question you have, but you might consider reviewing the resources available on the OpenACC website (https://www.openacc.org/resources). In particular Jeff Larkin’s recorded on-line classes may be helpful. OpenACC.org and NVIDIA also have several online GPU “Boot-camps” which are intensive two-day training sessions. There are a few more this year (https://www.gpuhackathons.org/events) if you’d like to apply, though there’ll be more next year as well.

Hello Mat,
regarding this error:

NVC+±F-0000-Internal compiler error. bitcast with differing sizes -4 (Acc_factorial.cpp: 12)
NVC++/x86-64 Linux 20.7-0: compilation aborted

I read in OpenACC Programming and Best Practices Guide in 2015 the following:

For more complex operations, the atomic directive will ensure that no two threads
can attempt to perform the contained operation simultaneously.

that’s why I am asking this question:

  • Is the error exist in old PGI compilers, or it only exists in newer ones.

-Hisham

No, sorry. Atomic multiply will be a new feature not available in older compiler versions.

I read in OpenACC Programming and Best Practices Guide in 2015 the following:

For more complex operations, the atomic directive will ensure that no two threads
can attempt to perform the contained operation simultaneously.

Correct, but the order in which those operations occur is not fixed. Again, the issue is that when the order of floating points operation occur differs, the round error can give slightly divergent results.

To help illustrate this, let’s run you factorial code, but this time also compute the factorial in reverse order, i.e. start at “num” and then work down to 1. Mathematically, they should be the same, but actually show a large absolute difference due to the order of operations once the floating point can no longer represent the exact value and rounding error occurs. Though the relative difference is quite close.

 % cat test.cpp
    #include <iostream>
    using namespace std;
    int main()
    {
    int num;
    double factorial = 1;
    scanf("%d", &num);
    for (int a = 1; a <= num; a++)
    {
    factorial = factorial * a;
    }
    double rfactorial = 1;
    for (int a = num; a > 0; a--)
    {
    rfactorial = rfactorial * a;
    }
    printf("  Forward Factorial of Given Number is %f\n", factorial );
    printf("Backwards Factorial of Given Number is %f\n", rfactorial );
    printf("ABS Difference is %f\n", abs(rfactorial-factorial) );
    printf("REL Difference is %22.20f\n", abs(rfactorial-factorial)/factorial );
    return 0;
    }
    % nvc++ test.cpp -O0 -Kieee
    % a.out
    20
      Forward Factorial of Given Number is 2432902008176640000.000000
    Backwards Factorial of Given Number is 2432902008176640000.000000
    ABS Difference is 0.000000
    REL Difference is 0.00000000000000000000
    % a.out
    25
      Forward Factorial of Given Number is 15511210043330986055303168.000000
    Backwards Factorial of Given Number is 15511210043330983907819520.000000
    ABS Difference is 2147483648.000000
    REL Difference is 0.00000000000000013845
    % a.out
    30
      Forward Factorial of Given Number is 265252859812191032188804700045312.000000
    Backwards Factorial of Given Number is 265252859812191104246398737973248.000000
    ABS Difference is 72057594037927936.000000
    REL Difference is 0.00000000000000027166

Note, I looked up what the actual factorial values should be. So while 20! is correct, the rest are wrong no matter the order:

20! = 2432902008176640000
25! = 15511210043330985984000000
30! = 265252859812191058636308480000000 

When run in parallel, the order in which the iterations are run is non-deterministic.

thank you Mat, now I understand,
so you said:

No, sorry. Atomic multiply will be a new feature not available in older compiler versions.

because in factorial we have multiply the
#pragma acc atomic update
can’t handle it because it is not supported yet in openAcc, but in sum, it supported even tho when I have a bigger number I have an issue in order because I found it strange when I use
#pragma acc atomic update
with sum = sum + i;
no error appears to me but when I use it with
factorial = factorial * a;
this error appears to me

NVC+±F-0000-Internal compiler error. bitcast with differing sizes -4 (Acc_factorial.cpp: 12)
NVC++/x86-64 Linux 20.7-0: compilation aborted

which I mentioned previously.
but it is also strange when I use #pragma acc atomic update in this example

#pragma acc parallel loop
for (i = 0; i < nsteps; ++i)
{
x = (i + 0.5) * step;
#pragma acc atomic update
sum = sum + 1.0 / (1.0 + x * x);
}

it doesn’t give any error and there is a multiplication in the statement

sum = sum + 1.0 / (1.0 + x * x);

and the result is correct and match when run it in serial.
but why the compiler doesn’t complain about (1.0 + x * x)

The expression doesn’t matter, it’s still a atomic sum (sum = sum + exprs).

and the result is correct and match when run it in serial.

That’s just an artifact of the number of steps (100) being smaller than the vector length (128). All the threads are in the same gang (CUDA thread block) essentially causing the code to serialize. Increasing the number of steps, or reducing the vector length so more gangs are uses, you start to see divergence.

% cat sum.cpp
#include <iostream>
#include <iomanip>
#include <cstdlib>

#ifndef VL
#define VL 128
#endif
#ifndef NSTEPS
#define NSTEPS 100
#endif

using std::cout;

int main(int argc, char *argv)
{
long i, nsteps;
double pi, step, sum = 0.0, x;
nsteps = NSTEPS;
step = (1.0) / nsteps;

cout << "Nsteps="<<nsteps<<"\n";
for (i = 0; i < nsteps; ++i)
{
x = (i + 0.5) * step;
sum = sum + 1.0 / (1.0 + x * x);
}
pi = 4.0 * step * sum;
cout << std::fixed;
cout << "Serial pi is " << std::setprecision(25) << pi << "\n";
sum=0.0;
#pragma acc parallel loop reduction(+:sum) vector_length(VL)
for (i = 0; i < nsteps; ++i)
{
x = (i + 0.5) * step;
sum = sum + 1.0 / (1.0 + x * x);
}
pi = 4.0 * step * sum;
cout << std::fixed;
cout << "Reduct pi is " << std::setprecision(25) << pi << "\n";
sum=0.0;
#pragma acc parallel loop vector_length(VL)
for (i = 0; i < nsteps; ++i)
{
x = (i + 0.5) * step;
#pragma acc atomic update
sum = sum + 1.0 / (1.0 + x * x);
}
pi = 4.0 * step * sum;
cout << std::fixed;
cout << "Atomic pi is " << std::setprecision(25) << pi << "\n";
}
% nvc++ sum.cpp -acc -DNSTEPS=100 -DVL=128 ; a.out
Nsteps=100
Serial pi is 3.1416009869231253937016390
Reduct pi is 3.1416009869231245055232193
Atomic pi is 3.1416009869231253937016390
% nvc++ sum.cpp -acc -DNSTEPS=100 -DVL=32 ; a.out
Nsteps=100
Serial pi is 3.1416009869231253937016390
Reduct pi is 3.1416009869231249496124292
Atomic pi is 3.1416009869231249496124292
% nvc++ sum.cpp -acc -DNSTEPS=10000 -DVL=128 ; a.out
Nsteps=10000
Serial pi is 3.1415926544231340677981734
Reduct pi is 3.1415926544231265182816060
Atomic pi is 3.1415926544231349559765931
% nvc++ sum.cpp -acc -DNSTEPS=10000 -DVL=32 ; a.out
Nsteps=10000
Serial pi is 3.1415926544231340677981734
Reduct pi is 3.1415926544231265182816060
Atomic pi is 3.1415926544231327355305439

Hello Mat,
Thank you for your replay,

You said

All the threads are in the same gang (CUDA thread block) essentially causing the code to serialize

is that mean the code doesn’t work in parallel, and how I can increase the number of vector?
this is the message I got after running the code using:

nvc++ -o parallel code.cpp -Minfo=accel -acc

main:
          9, Generating Tesla code
             11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
          9, Generating implicit copy(factorial) [if not already present]

also I have another question, when I am using atomic in the following code:

  #include <iostream>
  using namespace std;

  int main()
  {
      int num, factorial = 1;
      scanf("%d", &num);
      #pragma acc parallel loop
      for (int a = 1; a <= num; a++)
      {
          #pragma acc atomic update
          factorial = factorial * a;
           printf("inside the loop =" "%d\n" , factorial );
      }
      printf("The factorial of Given Number is =" "%d\n" , factorial );

      return 0;
  }

and put number 5, for example, the result will be like this,

inside the loop =120
inside the loop =120
inside the loop =120
inside the loop =120
inside the loop =120
The factorial of Given Number is =120

and the same happens when I put a different number like 10:

inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
inside the loop =3628800
The factorial of Given Number is =3628800

my question is, what is the scientific explanation of the repeated number inside the loop because when I run the application using g++ it prints

inside the loop =1
inside the loop =2
inside the loop =6
inside the loop =24
inside the loop =120
The factorial of Given Number is =120

why not the same happens when I use atomic, I know it gave the last correct value but why the repeated values inside the loop happen?

Thank you in advance
-Hisham

No, not exactly. CUDA capable devices uses threads as the lowest unit for compute with an OpenACC vector mapping to a CUDA thread. 32 threads are grouped into a “warp” where the warp is run in SIMT (single instruction multiple threads). All 32 threads will execute the same instruction at the same time. For an atomic, this means that each thread in a warp will need to wait until all the other threads in the same warp complete the atomic in order for the warp to continue. (note this is a somewhat simplistic explanation and the since the atomics are in L2 cache it’s not too bad, but hopefully gives you and idea as to why you’d see serialization within a warp).

and how I can increase the number of vector?

In your case, the number of vectors match the loop iterations, so you need to increase the iterations. With 5 or 10 iterations, you not even using a full warp, and 100 iterations it’s really small as well. To be performant, you really want loop counts in the thousands, tens of thousand, or even in the millions.

and put number 5, for example, the result will be like this,

Given the small loop count and if the threads in a warp serialize waiting for the atomic operations to complete, when the threads call printf, they’re using the last value of the factorial. Basically, you have a race condition where the value of “factorial” has been updated since this thread last updated it. To be correct, you should be using an atomic capture so you save this iteration’s value to a local variable.

% cat fact.1.cpp
  #include <iostream>
  using namespace std;

  int main()
  {
      int num;
      int factorial = 1;
      int f;
      scanf("%d", &num);
      #pragma acc parallel loop
      for (int a = 1; a <= num; a++)
      {
          #pragma acc atomic capture
          {
            factorial = factorial * a;
            f = factorial;
          }
          printf("inside the loop =" "%d\n" , f );
      }
      printf("The factorial of Given Number is =" "%d\n" , factorial );

      return 0;
  }
% nvc++ -acc fact.1.cpp -V20.11
% a.out
5
inside the loop =1
inside the loop =2
inside the loop =6
inside the loop =24
inside the loop =120
The factorial of Given Number is =120

Note the behavior will change somewhat depending on the device you’re using in that the warp blocking behavior on the atomic changed with the Volta architecture. With Volta, you’d see the same output with or without the capture, but to be correct, you do want to use the capture.

-Mat