I am currently working with #pragma directives on CUDA accelerator. It works rather smoothly but in these days I have got a curious behavior. The code has at least 5-6 levels of nested loops but the computations executed starting with the penultimate loop has all the variables zeroed and so this is the output of the algorithm. Without #pragma acc directives the code runs fine. I tried to use -Mvect=levels: but this does not work while -Mconcur=levels: makes the code crash.
We currently max at 7 loop levels (though are in the process of expanding this), but since you’re only at 5-6 levels, this shouldn’t matter. Something else is going on.
Can you please post or send to PGI Customer Service (trs@pgroup.com) a reproducing example?
If not, what is the output from “-Minfo=accel”? How are the loops being scheduled?
Thank you for the prompt answer. I cannot send around the code but I can provide you the output of the compilation. Please, note that is a mex function for Matlab and all the environment I built up is properly working to get such Matlab extensions to properly run. Loops at lines 966 and 1035 are those not working zeroing the variables computed above in the code.
PGC-W-0095-Type cast required for this conversion (addTotalClutter_rain_mex.c: 183)
PGC-W-0095-Type cast required for this conversion (addTotalClutter_rain_mex.c: 183)
PGC-W-0095-Type cast required for this conversion (addTotalClutter_rain_mex.c: 183)
PGC/x86-64-Extractor Windows 12.10-0: completed with warnings
Please, note that is a mex function for Matlab and all the environment I built up is properly working to get such Matlab extensions to properly run.
Interesting. I have a background project to write an article on using OpenACC in Matlab, but unfortunately have gotten sidetrack with other projects so haven’t had the opportunity to work on it. Glad to see that you are experimenting with it.
I’m not liking the schedule being generated:
249, Loop is parallelizable
Accelerator kernel generated
249, #pragma acc loop gang /* blockIdx.x /
CC 1.3 : 108 registers; 136 shared, 836 constant, 40 local memory bytes
CC 2.0 : 63 registers; 120 shared, 736 constant, 0 local memory bytes
368, #pragma acc loop vector(128) / threadIdx.x */
272, Loop is parallelizable
356, Loop is parallelizable
368, Loop is parallelizable
470, Loop is parallelizable
570, Loop is parallelizable
644, Loop is parallelizable
651, Loop is parallelizable
768, Loop is parallelizable
769, Loop is parallelizable
774, Loop is parallelizable
785, Loop is parallelizable
790, Loop is parallelizable
966, Loop is parallelizable
1035, Loop is parallelizable
It looks to me that you’re using the “parallel” construct and only have loop directives around the loops at lines 249 and 368. The rest of the loops are paralleizable, but getting executed sequentially within the “gang”.
What I’d like you to try is to change to using the “kernels” construct and remove any loop directives. This will allow the compiler to generate what it thinks is the best schedule. I’m not sure this will fix the problem, but I’m curious what it comes up with.
Now the code is almost perfectly running. The problem was that nested loops were not rectangular. This yielded a strange behavior with the loops extrema set to zero and loops never executed.
Fixed this, I have now the problem that an array, that I initialize to zero before the accelerated region, is no more initialized inside the region. This sums up a lot of garbage producing an Inf as output instead of the correct result, after a summation r_[j] += k is performed. This is the only remaining problem as the code performs really well otherwise and with an exceptional gain of about two magnitude orders with respect to normal Matlab code.
Mex function+PGI compilers work and perform well. It is time that Mathworks supports PGI compilers.
It appears that r is being copied to the device so I’m not sure why this would occur. Maybe the working set of r is smaller than the actual size so it’s not being mapped correctly? Try setting the full extent of both dimensions.
Could r be a ragged array or not contiguous in memory?
Mex function+PGI compilers work and perform well. It is time that Mathworks supports PGI compilers.
The folks at MathWorks are very open to supporting PGI, and I have contacts there. I unfortunately haven’t had time to push them on it. Please do contact them with you experience since having a user push goes further than a vendor.