Hello,
I’ve been trying to parallelize my scientific code using OpenACC. Although I have been thinking that using OpenACC is very easy, mapping it to my code has proved to be quite difficult. It is certainly nice on easy examples but not so easy on larger nested loops and several arrays.
Anyways, I have following longer questions (excuse me for the length):
- What is the best way to parallelize following loop?
long cnti, cntj, cntk;
double psi2, psi2lin, tmp;
#pragma acc parallel loop collapse(3) //private (cnti, cntj, cntk, psi2, psi2lin, tmp)
for(cnti = 0; cnti < NXDUMMY; cnti ++) {
for(cntj = 0; cntj < NYDUMMY; cntj ++) {
for(cntk = 0; cntk < NZDUMMY; cntk ++) {
psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
psi2lin = psi2 * G;
tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
psi[cnti][cntj][cntk] *= exp(-tmp);
}
}
}
As you can see I have done it by interesting the collapse statement in pragma. However to be honest I did that buy trying out different solutions of most which have proven not to work. Some are from this forum, some are from scarce tutorials on OpenACC.
This one works, but I’m not sure why, or weather it is the best solution. What does collapse exactly do there? I’m native OpenMP programmer and I only know to parallelize the outmost loop into threads while inner loops are done sequentially in each thread. Here I’m not sure what is happening with inner loops - are they divided in threads, workers or gangs whatever you call them? I’m not a CUDA programmer so those terms are vague to me. I know that outmost loop should be divided in gangs right? But what is happening inside or how is it synchronized?
On the other hand if I don’t put collapse statement there, compiler gives out information about multiple loop dependent variables (psi and pot for example), although logically they should be divided in the first loop (cnti) within gangs, and shouldn’t have any loop dependencies from then on right? cntj and cntk should be independent for each gang right? Also uncommenting or commenting those private variables seems to make no difference in the end result (which is correct compared to original OpenMP code), although from what I have been thought all variables that have been declared on host (psi2 and tmp for example) or copied to the device are shared by default unless declared private right? How come then that the result is correct if all the variables are shared between the gangs?
Keep in mind that I have copied psi and pot arrays on the device prior to execution on previous code so data movement is not an issue here.
- As you saw I put NXDUMMY, NYDUMMY and NZDUMMY array bounds in my loops. Those are const values defined by #define directive in the beginning of my program. If I don’t do that, and resort to Nx, Ny and Nz variables that are read from input file in runtime, the program returns following error:
call to clSetKernelArg returned error -51: invalid argument size
Keep in mind that the accelerator I’m using is Radeon R9 280X, and thus under the hood it is using OpenCL and not CUDA.
Does this mean that I have to define my array sizes in compile time (via #define) and not in runtime? Bear in mind that my arrays are dynamically allocated with malloc and then copied to the device.
- When I start my program with previous code it runs that code in function call (previous code is basically a function with psi as only parameter). Nothing else happens in this program besides array allocations, initialization, deinitialization and previous function called say 1000 times in the loop. So basically that is the only real work going on there. However when I watch system monitor all of my 6 cores on the processor are 100% busy. I don’t understand why? Arrays are copied to the device prior to 1000 loops and copied back after those loops, device should be doing the loops, and the only job that host has is 1000 iterations of the function call while function body itself is executed on the device. The only thing I can think of is that the invocation of
#pragma acc parallel loop collapse(3)
1000 times is producing total processor usage, although I think that is highly unlikely. Do you have any clues why this happens?
As an end note, 1000 iterations in OpenACC variant give about 2 times faster execution then the OpenMP variant. At least for this single function. It is a nice result, although I believe it can be even better.
Thank you in advance.