Output using ta=host or ta=tesla:cc60

Hi everyone,

I parallelized a C++ code using OpenACC in one computer under Ubuntu and with a NVIDIA GTX1080. To compile the code I used

pgc++ -Minform=inform -Minfo=all,ccff -std=c++11 -ta=tesla:cc60 -ta=time *.cpp

To test for the correctness of the Open ACC pragmas I compared the code run in the device against the outputs obtained after compiling using -ta=host.

As expected, after a number of time steps both outputs differ, that’s way I used short simulations first and longer simulations later. However, for the short simulations (100 time steps, sampling every 10 time steps), the match between both versions was almost perfect.

Now, I have to use another computer with the same code. The computer also uses Ubuntu and a NVIDIA GTX1080 (Not sure if the drivers are the same version).

The point is that, compiling and running the same code in the new computer, I’m getting an error much more relevant when comparing the host and tesla compilations.

Can you provide some help? (I looked for existing posts but I’ve not been able to find one, which does not mean it doesn’t exist already).

Thank you very much for your help,

Xavier

Hi Xavier,

I’m not too sure what’s going on.

It sounds like there is an expectation that running in parallel on the device will result in differing answers but the problem is that you’re seeing different results when running on a different but similar system both using GTX1080s.

My first thought is to look for race conditions or more likely uninitialized device memory. If the code is reading from uninitialized data, then the answers will depend on the system state, and could appear to “work” in one case but not in another.

If that’s not it, I’d want to understand the differences between the two system. What PGI compiler versions are you using? We do rely on the GNU headers for C++11 support so different underlying GNU versions could have an effect. Granted, this is a low probability to be causing this particular issue. What CPUs are in both systems? Perhaps the difference is due to the host side computation versus the device?

On thing you can try is building the binary on one system and then running it on the other. If it fails in the same way, it’s much less likely a system difference but more likely a problem with your code.

Also, try compiling at low optimization and with strict IEEE 754 floating point compliance enabled (-O0 -Kieee).

If you can’t figure it out, please feel free to post or send to PGI Customer Service (trs@pgroup.com), a reproducing example. I can take a look to see if I can determine what’s going on.

-Mat

Hi Mat,

thank you very much for your reply.

I’ve tried adding -O0 -Kieee and I got back to the results I was getting with the previous computer.

I’ve also tried playing with the -O flag (without including -Kieee). Discrepancies start when using -O1 and -O. Using -O2 or -O3 the execution (in the GPU I think) becomes very slow.

Any further advice?

Thanks again,

Xavier

Hi Xavier,

Numerical differences between running serially on the host versus parallel on the device is more easily explained. Order of operations can cause slight rounding differences. Especially when performing parallel reductions (sum, min, max), this rounding can be magnified creating divergent results when compared to a serial computation.

FMA (Fuse-Multiply-Add) instructions are used on the device. So depending on your CPU and if it also supports FMA, you can see differences. FMA is actually more precise since there’s less rounding error, but can be different than non-FMA operations.

Also, math intrinsics such as sin, cos, pow, etc. can yield slightly different results on the device. We try to be as consistent as possible and keeping to 1Ulp of imprecision, but this can cause differences.

For numerically sensitive algorithms, you should consider always using “-Kieee”. This tells the compiler to keep to strict adherence to the IEEE 754 standard. However, it wont help with order of operation differences between serial and parallel execution.

There’s also the “-Mnofma” option which will disable FMA operations on both the host and device.

Another source of difference is if you miss a point where you need to synchronize the host and device data. I have seen codes where the user forgot to update a variable on the host after computing on the device. In this case the device and host copy were only slightly different so not easily noticed.

As for the differences between the GPUs, that I can’t account for. If you see differences using the same binary on both devices, then I’d look at your code for race conditions or uninitialized memory. Tools such as “cuda-memcheck” which ships with the CUDA SDK, or “valgrind” for your host code, might help find the problem.

Also, it’s unclear why higher optimization would be slower. Maybe if you have a convergence loop and it’s simply taking longer to converge? Or possibly at higher opts, the code is using more registers so having a lower occupancy on the device. You can check register usage either through a profiler such as pgprof or nvprof, or at compile time by adding the flag “-ta=tesla:cc60,ptxinfo”.

Having a reproducing example would be very helpful, otherwise I’m just guessing.

-Mat

Hi Mat,

I apologize for my late reply but I’ve finally been able to further test the code. I’ll answer all your queries .

I’m using an Intel Xeon microprocessor (CPU E5-2630). I think that this processor supports FMA, so I assume the difference can not be due to that.

Our code uses functions from the mat library (pow, sqrt, …), therefore, I assume that some errors can be explained by the difference between the gpu and cpu implementations. By the way, I take advantage of the mail to ask if I have to mark this functions using the routine directive. I tried using “#pragma acc routine(pow) seq� after reading the technical news from PGI of august 2014 (Using the OpenACC Routine Directive), but I get the following compilation error: “extra text after expected end of preprocessing directive�. Am I missing something? Is this required in order to use mat functions properly in Open ACC?

With respect to synchronizing the host and device. I’m loading the required data to the device at the beginning and I perform all required computations in the device, just sending the global outputs to the host before recording them. So, I don’t think this can be a likely cause.
Â
 We have tested the same binaries (compiled using -Kieee and -O0 or -O1) in the two computers and we have found the same. The discrepancies I found at the beginning came from using the default optimization setting. However, we were using the PGI community edition v 17.4 compiler. I’ve used Valgrind and I had a couple of deallocation errors but no initialization issues. I’ve also used cuda “cuda-memcheckâ€�, and “cuda-memcheck –tool initcheckâ€� to look for uninitialized data and I’ve got no errors. Using cuda-memcheck –tool racecheck –racecheck-report“ both in hazard and analysis mode. In analysis mode I got “1145 hazards (1145 errors, 0 warnings) ." In hazard mode I got “0 errors, 3654607 warnings”. First line of the warnings I’m getting is as follows: “WARN: (Warp Level Programming) Potential RAW hazard detected at ‘XXX’ in bloc (‘I’,’J’,’K’)â€�, with ‘XXX’ and ‘I’,’J’ and ‘K’ meaning changing values. To be honest I’m not familiar with the racecheck report but, after having a look at the WARNING definition, I’m not sure if it’s a bug in the code or if its intentional and coming from the Open ACC implementation itself. I think the diffusion code might be a problem – I seem to loose mass when the optimizations are applied . Could this be the problem? What strategies can we take when using openACC to maintain data integrity?
Â
Finally, when it comes to the slowing down with the optimization on, I realized that the compiler fails to parallelize the code when using the optimization flag “-O2� or “-O3�). The compiler shows a message of “complex loop carried dependencies prevent parallelization.� Do you think this is down to the diffusion code not being embarrassingly parallel?
Â
Do you have any idea of what is going on? My next step, if required, would be to create a minimal reproducing example to send to you.

Thank you vary much for your help,

Xavier

Hi Xavier,

Am I missing something? Is this required in order to use mat functions properly in Open ACC?

For the error, I’m not sure and would need to see it in context.

For the device definition of “pow” as well as the other “math.h” functions, include the “accelmath.h” header file.

Could this be the problem?

Possible, though I’ve only used racecheck a few times so am not an expert in using it.

What strategies can we take when using openACC to maintain data integrity?

If you do have a data collision then you might be able to use the “atomic” directive. Or if the collision is on a scratch array, then you can privatize it.

Do you think this is down to the diffusion code not being embarrassingly parallel?

Probably not, though I hesitant to speculate on why this is occurring. Often loop dependencies occur in C/C++ since the compiler can’t tell that two pointers are not aliased which can be resolved by decorating the pointers with the “restrict” keyword, use the “parallel” construct, or add the “independent” clause to a loop directive. Why this would only occur with optimization is unclear.

My next step, if required, would be to create a minimal reproducing example to send to you.

I’ll do my best to diagnose the issues without it, but having a reproducing example would be very helpful.

-Mat