The only thing PGI_ACC_DEBUG would do extra is add more synchronization.
Are you using “async”? If so, you might be coping data via an update directive before the compute region is finished updating the data.
Also up until recently, using managed memory would cause the runtime to not use “async”. We lifted this in PGI 17.7 when running with CUDA 8 on P100s since there was no longer a danger of segfaults when accessing the same memory on both the host and device.
If you’re not using “async”, then my best guess is that you’re missing an “update” directive someplace and one of your device arrays isn’t synchronized with the host copy of the array.
I am not using any asyncs but knowing that the debug mode adds more synchronizations is a good place to start my bug squashing hunt.
(I am using PGI 17.9, but the problem exists using 17.4 as well).
Could a race condition within a parallel region be the culprit?
Is there anything I could look for using the PGI_ACC_NOTIFY=2 output when using managed memory to see when/where the managed memory is doing an update/sync? There is a ton of output there and I am not sure what to grep.
I don’t think PGI_ACC_NOTIFY is going to help here. That only reports what the PGI runtime is doing. The CUDA driver manages UVM so wouldn’t be reported. Plus this only shows what updates occur but what you need to know is what update your missing (assuming that’s the cause).
Could a race condition within a parallel region be the culprit?
I guess it’s possible, but PGI_ACC_DEBUG is only going to effect synchronization between kernel launches, but not have an effect on the kernel (the parallel region) itself.
I’m leaning towards a missing update or an uninitialized device array. Though, this doesn’t explain why PGI_ACC_DEBUG works.
The way “-ta=tesla:managed” works, is that the compiler simply replaces the underlying memory allocator (malloc, new, allocate) with a call to cudaMallocManaged. And you don’t need to use it on all files. So one thought is to compile everything without managed, and then start doing a binary search where you add managed to half the files until you can pin down to one or more files that when using managed allows it to pass. This hopefully will give you a list of potential arrays to track.
Then recompile without managed and add “update” directives for these arrays before and after each parallel region that they are used. If it starts passing, then you can start taking iteratively taking out the update directives until it starts failing again. Then you’ll know where the missing update goes.
Another tactic you can try, is using the environment variables “PGI_ACC_FILL=1” and “PGI_ACC_FILL_VALUE=”. This causes the PGI runtime to initialize all allocated device data with the fill value (with the default being zero). My one thought is that maybe an array is getting zero’d out when it created using UVM and PGI_ACC_DEBUG but is uninitialized otherwise. I’m just guessing, but it’s an easy thing to try.
I just tested “zeroinit” on a toy program and it worked fine for me. Not sure why it’s not working in your case. If possible, could you send a reproducing example to PGI Customer Service (trs@pgroup.com) so we can investigate?