Code works with PGI_ACC_DEBUG=1 but fails without it

caplanr · October 18, 2017, 12:14am

I am currently trying to get my OpenACC code to work without having to use managed memory.

I have the code in a state where it compiles fine, but fails with incorrect results (with managed memory turned on, the code works fine).

Strangely, if I set PGI_ACC_DEBUG=1, the code works perfectly and gets the correct results.

Is there something that PGI_ACC_DEBUG does that could help me find my problem?

Thanks

MatColgrove · October 18, 2017, 8:59pm

Hi sumseq,

The only thing PGI_ACC_DEBUG would do extra is add more synchronization.

Are you using “async”? If so, you might be coping data via an update directive before the compute region is finished updating the data.

Also up until recently, using managed memory would cause the runtime to not use “async”. We lifted this in PGI 17.7 when running with CUDA 8 on P100s since there was no longer a danger of segfaults when accessing the same memory on both the host and device.

If you’re not using “async”, then my best guess is that you’re missing an “update” directive someplace and one of your device arrays isn’t synchronized with the host copy of the array.

-Mat

caplanr · October 18, 2017, 9:23pm

Hi,

I am not using any asyncs but knowing that the debug mode adds more synchronizations is a good place to start my bug squashing hunt.

(I am using PGI 17.9, but the problem exists using 17.4 as well).

Could a race condition within a parallel region be the culprit?

Is there anything I could look for using the PGI_ACC_NOTIFY=2 output when using managed memory to see when/where the managed memory is doing an update/sync? There is a ton of output there and I am not sure what to grep.

MatColgrove · October 18, 2017, 10:55pm

I don’t think PGI_ACC_NOTIFY is going to help here. That only reports what the PGI runtime is doing. The CUDA driver manages UVM so wouldn’t be reported. Plus this only shows what updates occur but what you need to know is what update your missing (assuming that’s the cause).

Could a race condition within a parallel region be the culprit?

I guess it’s possible, but PGI_ACC_DEBUG is only going to effect synchronization between kernel launches, but not have an effect on the kernel (the parallel region) itself.

I’m leaning towards a missing update or an uninitialized device array. Though, this doesn’t explain why PGI_ACC_DEBUG works.

The way “-ta=tesla:managed” works, is that the compiler simply replaces the underlying memory allocator (malloc, new, allocate) with a call to cudaMallocManaged. And you don’t need to use it on all files. So one thought is to compile everything without managed, and then start doing a binary search where you add managed to half the files until you can pin down to one or more files that when using managed allows it to pass. This hopefully will give you a list of potential arrays to track.

Then recompile without managed and add “update” directives for these arrays before and after each parallel region that they are used. If it starts passing, then you can start taking iteratively taking out the update directives until it starts failing again. Then you’ll know where the missing update goes.

Another tactic you can try, is using the environment variables “PGI_ACC_FILL=1” and “PGI_ACC_FILL_VALUE=”. This causes the PGI runtime to initialize all allocated device data with the fill value (with the default being zero). My one thought is that maybe an array is getting zero’d out when it created using UVM and PGI_ACC_DEBUG but is uninitialized otherwise. I’m just guessing, but it’s an easy thing to try.

caplanr · October 18, 2017, 11:15pm

Hi,

WORKED! (sort of)

I have been relying on using the “zeroinit” option in the compiler to initialize the GPU arrays to 0.

If I take this out, and then use use PGI_ACC_FILL=1, the code works perfectly!

Aren’t both of these supposed to do the same thing??

I would much prefer to rely on a compiler flag than and ENV variable…

[If OpenACC would include an “init(a)” clause to enter data create, this would be a lot easier…]

MatColgrove · October 19, 2017, 8:00pm

I just tested “zeroinit” on a toy program and it worked fine for me. Not sure why it’s not working in your case. If possible, could you send a reproducing example to PGI Customer Service (trs@pgroup.com) so we can investigate?

Topic		Replies	Views
Launch failed error Legacy PGI Compilers	8	11283	September 13, 2013
Oddity in OpenACC Legacy PGI Compilers	15	13124	November 23, 2015
analysis of memory usage on GPU Legacy PGI Compilers	4	5836	March 15, 2016
Suggestions needed to debug a weird segfault Legacy PGI Compilers	8	1139	July 12, 2021
Problem using the -tp=x64 flag with OpenACC Legacy PGI Compilers	6	5277	August 25, 2017
CUDA Unified Memory By PGI Legacy PGI Compilers	5	5692	April 6, 2016
CPU parallel and accelerator regions in the same program Legacy PGI Compilers	13	9041	July 18, 2012
OpenACC async problem when using PGI compiler v13.9 or v14.1 Legacy PGI Compilers	3	5570	February 4, 2014
total/free CUDA memory: 0/0 using openacc with PGI 17.5 Legacy PGI Compilers	2	2547	October 8, 2017
error for a simple OPENACC program Legacy PGI Compilers	23	12014	May 16, 2013

Code works with PGI_ACC_DEBUG=1 but fails without it

Related topics