Atomic usage

I’d like to parallelize nested for loop. In addition to find peak and keep other parameters, I use “atomic” as the following. Can someone tell me if this is valid or not? Any suggestion? Thanks!

int peak = maxi = maxj = maxk = 0;
#pragma acc kernels
for (int i = 0; i <= m; i++) {
for (int j = 0; j <= n; j++) {
for (int k = 0; j <= o; j++) {
double q = func1(i, j, k);
#pragma acc atomic
{
if(q > peak){
peak = q;
maxi = i;
maxj = j;
maxk = k;
}
}
}
}
}

Frngs(cimage, numr);
// compare with all reference images
// for iref in xrange(len(crefim)):
for ( iref = 0; iref < (int)crefim_len; iref++) {
if(fabs(n1[iref]*imn1 + n2[iref]*imn2 + n3[iref]*imn3)>=ant) {
Dict retvals;
if ((psi-90.0f) < 90.0f) retvals = Crosrng_sm_psi(crefim[iref], cimage, numr, 0.0f, 0, psi_max);
else retvals = Crosrng_sm_psi(crefim[iref], cimage, numr, 180.0f, 0, psi_max);

double qn = retvals[“qn”];
if( qn >= peak) {
sx = -ix;
sy = -iy;
nref = iref;
ang = ang_n(retvals[“tot”], mode, numr[numr.size()-1]);
peak = static_cast(qn);
mirror = 0;
}
}
}
delete cimage; cimage = 0;
}
}

Hi Po Chun LAI,

Atomic operations can’t be done over a group since this would require critical sections which are not available on the device. However, you could put an atomic on each individual write. Something like:

if(q > peak){
   #pragma acc atomic write
   peak = q;
   #pragma acc atomic write
   maxi = i;
   #pragma acc atomic write
   maxj = j;
   #pragma acc atomic write
   maxk = k;
}

Though this is problematic since there’s a race condition on peak given the value of peak could change between the time it’s evaluated against q and the time it’s updated. Hence, atomics don’t really work for value+maxloc operation type operations.

While not ideal, to handle these operations in parallel you need to first capture the peak and max values for each serial portion of the code and then perform a second final operation to find the overall peak and max. Something like:

 int peak = maxi = maxj = maxk = 0; 
 int *peaks = malloc(sizeof(int)*m); 
 int *maxis = malloc(sizeof(int)*m); 
 int *maxjs = malloc(sizeof(int)*m); 
 int *maxks = malloc(sizeof(int)*m); 
for (int i = 0; i <= m; i++) {
  peaks[i] = 0;
  maxis[i] = 0;
  maxjs[i] = 0;
  maxks[i] = 0;
}

#pragma acc parallel loop gang vector copy(peaks[:m],maxis[:m],maxjs[:m],maxks[:m])
for (int i = 0; i <= m; i++) {
#pragma acc loop seq
for (int j = 0; j <= n; j++) {
#pragma acc loop seq
for (int k = 0; k <= o; k++) {
double q = func1(i, j, k);
if(q > peaks[i]){
peaks[i] = q;
maxis[i] = i;
maxjs[i] = j;
maxks[i] = k;
}
}}}

for (int i = 0; i <= m; i++) {
   if (peaks[i] > peak) {
      peak = peaks[i];
      maxi = maxis[i];
      maxj = maxjs[i];
      maxk = maxks[i];
  }
}
free(peaks);
free(maxis);
free(maxjs);
free(maxks);

Hope this helps,
Mat

Hi Mat,

You are right. I am trying to form a critical section like what we do in CPU multi-thread programming.
What you proposed is the way I can think of to resolve this race condition issue. Confirm my idea is right. Thanks!

Hi Mat,

I have another question about array allocation. My code looks like the following. Each iteration has its own dynamic allocated double arrays for storing temporary data. I use GPU/CPU unified memory with compiler option “-ta=tesla:managed”. Does OpenACC with unified memory enabled and GPU memory allocation and perform data movement when it encounters malloc? Or we should handle this by invoking cudeMallocManaged like the second code fragment? Thank you!



#pragma acc loop independent
for ( iref = 0; iref < (int)crefim_len; iref++) {
double *t = (double *)malloc(sizeof(double) * maxrin);
double *q = (double *)malloc(sizeof(double) * maxrin);
Crosrng_ms_p(crefim_l[iref], cimage, numr, 0.0f, qn[i+lky][j+lkx][iref], tot[i+lky][j+lkx][iref], qm[i+lky][j+lkx][iref], tmt[i+lky][j+lkx][iref], t, q);
} //delete cimage; cimage = 0;



//#pragma acc loop independent
for ( iref = 0; iref < (int)crefim_len; iref++) {
double t;
double q;
cudaMallocManaged(&t, maxrin
sizeof(double));
cudaMallocManaged(&q, maxrin
sizeof(double));

Crosrng_ms_p(crefim_l[iref], cimage, numr, 0.0f, qn[i+lky][j+lkx][iref], tot[i+lky][j+lkx][iref], qm[i+lky][j+lkx][iref], tmt[i+lky][j+lkx][iref], t, q);
} //delete cimage; cimage = 0;

Hi Chun LAI,

Does OpenACC with unified memory enabled and GPU memory allocation and perform data movement when it encounters malloc?

When using the flag “-ta=tesla:managed”, the PGI compiler will essentially replace your calls to “malloc” with calls to “cudaMallocManaged”. It’s a bit more complicated in that we’re using a managed memory pool allocator for small allocations to help with the allocation overhead, but from the user perspective that’s basically what’s happening.

For more details see this blog post: https://www.pgroup.com/blogs/posts/openacc-unified-memory.htm

Hope this helps,
Mat

Oh, I just noticed that you’re trying to call malloc from device code. CUDA Unified Memory calls can only done from the host-side.

While “malloc” is supported in device code, I high recommend to not use it. The device has a very small heap (default 8MB, max 32MB), it’s very easy to crash the program if you o over this limit. Plus, mallocs are serialized on the device so can have a negative impact on performance.

What’s the goal of this bit of code? Are you trying to insert the “t” and “q” arrays in another structure? If so, then it may be better to do this on the host and then just offload the compute portions. Trying to build this structure on the device, will cause issue if the same structure is used on the host side.

-Mat

Hi Mat,

I encounter problems which might be what you mentioned.
It shows:
Call to cuMemcpyDtoHAsync returned error 700: Illegal address during kernel execution
Call to cuMemFreeHost returned error 700: Illegal address during kernel execution

t and q are buffers for subroutine “Crosrng_ms_p”. Due to size is dynamic, I use malloc.
malloc is from host code, I think OpenACC will translate to corresponding cudaMalloc calls during compiling and allocate GPU memory in run time. Any suggestion to resolve this dynamic size of an array in OpenACC?

While compiling, there is some abnormal information shown in log saying “prevents parallelization”.

------- log ----------

1072, Complex loop carried dependence of rdata->,data->,limage->rdata-> prevents parallelization
Loop carried dependence due to exposed use of data[:*],data+(((numr->__b_St12_Vector_baseIiSaIiEE._M_impl._M_start->)-1)*4)[:2] prevents parallelization
Complex loop carried dependence of ->__b_St12_Vector_baseIfSaIfEE._M_impl._M_start->,…inline,data+(((numr->__b_St12_Vector_baseIiSaIiEE._M_impl._M_start->)-1)*4)-> prevents parallelization
Loop carried dependence due to exposed use of data+(((numr->__b_St12_Vector_baseIiSaIiEE._M_impl._M_start->)-1)4)[:] prevents parallelization
Complex loop carried dependence of data+(((numr->__b_St12_Vector_baseIiSaIiEE._M_impl._M_start->)-1)4)±> prevents parallelization
Loop carried dependence due to exposed use of data+(((numr->__b_St12_Vector_baseIiSaIiEE._M_impl._M_start->)-1)4)+[:] prevents parallelization
Parallelization would require privatization of array data[:
]
Accelerator kernel generated

----- log -----------
limage->rdata is a C++ class. Structure is like
Class EMData
{

private:
float *rdata;
}

What does this message mean? should I take actions?


Thank you!
Po Chun

Hi Po Chun,

For #1, what I was writing about would only apply if you’re using “malloc” within device code. If you’re only using malloc from the host, then this error is coming from something else, though in the code snip-it you showed, “t” and “q” were within an “acc loop independent”, which assuming is within an outer compute region, would be performed on the device.

Could you post or send to PGI Customer Service (trs@pgroup.com) an example where I could reproduce the issue? I don’t want to point you in the wrong direction which can happen if I given only partial snipits of the code.

Note, an “illegal address” error is similar to a host-side seg fault where a bad device address was dereferenced. It’s somewhat generic in that it could be coming from dereferencing a host address, an out-of-bounds memory access, accessing a null ptr, stack/heap overflows, etc. Although it’s listed as coming from the call to “cuMemcpyDtoHAsync”, it most likely that the kernel that runs before this call is actually causing the error, but the CUDA runtime doesn’t trigger the error until the next API call.

I would suggest setting the environment flag “PGI_ACC_DEBUG=1” to see which OpenACC compute region (i.e. which CUDA kernel) is executed before the error. It wont tell you why it’s erroring, but just where to start looking.

For #2, the compiler will attempt to auto-parallelize loops when using the “kernels” compute region or loops within a “parallel” compute region that are not explicitly marked with a “loop” directive. Though to safely auto-parallelize a loop, the compiler must first prove that the loop does not contain dependencies.

In C/C++ since pointers of the same type are allowed to be aliased, the compiler must assume that they are. Since aliased pointers aren’t safe to parallelize, this often prevents auto-parallelization. You can force parallelization by adding a “loop” directive when using a “parallel” compute region, or a “loop independent” directive when using “kernels”. Though it’s up to the user to make sure that the code is actually independent and safe to parallelize.

Again, having a full example would help me here. I can’t tell from this info if the problem is potential aliasing between “rdata”, “data”, and “limage->rdata” or something else.

Also, are you using Vectors? Vectors are not thread safe so should be used with caution in parallel execution. It’s usually ok if all you’re doing is reading or writing to an existing vector, but will become problematic if you are trying to push or pop from the list. Vectors are also difficult to perform device data management on since they simply consist of just three pointers which need to get translated to device pointers. Possible to do, but it much easier to use CUDA Unified Memory (-ta=tesla:managed) when using vector since both the device and host can access the same pointer addresses.

-Mat

Hi Mat,

I made some changes according to your suggestion and sent my project and source code to PGI Customer Service. Many thanks for your support.

Po Chun