Reductions on pointer type must have bounds specified

I am using hpc sdk v22.7, cudatoolkit 11.7, trying to compile some C/C++ code to do some openmp target offloading onto my Geforce GTX 1650. I want to do reduction on an array. I am not using CUDA unified memory. But I keep getting compiler error

NVC++-F-0155-Reductions on pointer type must have bounds specified:  heap_sum_arr

My simple program is something like this:

#define N 1000

double *heap_sum_arr = new double[N];
//initialize heap_sum_arr here

#pragma omp target data map(tofrom:heap_sum_arr[0:N]) 
	{
        #pragma omp target teams loop reduction(+:heap_sum_arr[0:N])
		for (int i = 0; i < N; i++) {
			heap_sum_arr[i] += <some computations>
		}
	}


delete [] heap_sum;

The same thing happens with a malloc’ed C pointer. It seems to me I’ve specified the bounds quite well in compliance with openMP 5.0 standards.

It appears that openMP target offload array reduction only works with variables allocated on the stack, i.e. something declared with double arr[N];

This severely limits the sizes of reduction arrays I can work with.

Why is this so? Is this intended behavior? Is there a workaround?

PS: compile command I use is:
nvc++ -mp=gpu -gpu=cc75,cuda11.7 src.c

Why is this so? Is this intended behavior?

The code looks correct to me, so it’s likely a compiler issue. I filed a problem report, TPR #32404, and have asked our engineers to investigate.

Is there a workaround?

In this particular example, the reduction isn’t needed given each loop iteration is accumulating a unique index. Hence the work around is to simply remove the reduction clause.

If you did have a code that has collisions, then the work around would be to use atomics. Also, array reductions can have a high overhead, especially for larger arrays, so atomics may be the better option anyway.

For example:

% cat test.cpp
#include<iostream>
#include<cstdlib>
#include<omp.h>
#define N 1000

using namespace std;

int main(){
double *hist = new double[10];
for (int i = 0; i < 10; i++) {
    hist[i] = 0;
}
#pragma omp target teams loop map(tofrom:hist[:10])
        for (int i = 0; i < N; i++) {
                int idx = i%10;
#pragma omp atomic update
                hist[idx]+=1;
        }
for (int i = 0; i < 10; i++) {
   cout << i << ":" << hist[i] << endl;
}
delete [] hist;

}
% nvc++ test.cpp -mp=gpu; a.out
0:100
1:100
2:100
3:100
4:100
5:100
6:100
7:100
8:100
9:100

Thanks for the report,
Mat

1 Like

Thank you, Mat, for reporting this issue. In the example I gave, I used indices with no collision for simplicities sake – my actual use case has collisions.

I don’t think atomics are the solution for me, because

  1. I am working with double complex numbers, and I don’t think atomic update of complex numbers is currently supported (please correct me if I’m wrong ). When I try to prepend an accumulation statment to a double complex array with #pragma omp atomic update, compilation fails with NVC++-S-1073-Procedures called in a OpenMP target region must have 'omp declare target' information - __atomic_load This does not occur when the array type is double, rather than double complex.

  2. I might be able to find a workaround by accumulating the real and complex parts of my calculations separately, but this adds complexity to my code, and I have also heard that pegging the same address repeatedly with atomic statements have a performance penalty. This might end up being slower than doing reduction with a larger array. Please correct me if I’m wrong on this as well.

Thanks!

Yes, I believe you’re correct that there’s no support for atomics on double complex (at least not that I’m aware). I do see that we have an open RFE for this in OpenACC (which could be applied to OpenMP as well), but I don’t have details on when/if engineering will implement it.

Yes, atomic can also incur a performance penalty, but this is largely dependent on the timing of when the threads access the same memory. If they all hit at the same time, then yes, it can be a problem. But if the accesses are staggered, then the overhead can be quite low. No idea what it would be in your case.

Another possibility is to switch to using OpenACC, since at least for this example, we’re able to compile and run the code correctly.

% cat test1.cpp
#include<iostream>
#define N 1000

using namespace std;

int main(){

double *heap_sum_arr = new double[N];
#pragma acc data copy(heap_sum_arr[:N])
{
#pragma acc parallel loop reduction(+:heap_sum_arr[0:N])
        for (int i = 0; i < N; i++) {
                heap_sum_arr[i] += 1;
        }
}

cout << heap_sum_arr[1] << endl;
delete [] heap_sum_arr;

}
% nvc++ -acc test1.cpp -Minfo=accel -V22.7; a.out
main:
     10, Generating copy(heap_sum_arr[:1000]) [if not already present]
         Generating NVIDIA GPU code
         12, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:heap_sum_arr[:1000])
     10, Local memory used for heap_sum_arr
1
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.