OpenACC produces suspicious results when acc_malloc() is used and built with -gpu=mem:separate

Let’s consider the following reproducer.

When c→pis allocated with acc_malloc()summing all the elements of the vector produces the result I expect when cis passed as an argument to a subroutine (and then c→pis used) but returns 0when the pointer c→pis passed as an argument to an other subroutine.
The program always produces the result I expect when malloc() is used instead of acc_malloc()and the program is built with -gpu=mem:managed

Is the program valid OpenACC? If so, is this the expected result or a bug?

I use NVHPC 25.9 but the same behavior occurs with older releases.

$ cat reproducer.cpp 
#include <openacc.h>
#include <iostream>

double reduce_pointer(double *array, int size) {
	double res = 0.0;
#pragma acc parallel loop present(array[0:size]) reduction(+:res)
	for (int i=0; i<size; i++) res += array[i];
	return res;
}

typedef struct {
	double *p;
	int size;
} chunk;

double reduce_chunk(chunk *c) {
	double res = 0.0;
#pragma acc parallel loop present(c->p[0:c->size]) reduction(+:res)
	for (int i=0; i<c->size; i++) res += c->p[i];
	return res;
}
#define N 1024
int main(int argc, char *argv[]) {
	chunk *c = (chunk *)malloc(sizeof(chunk));
	// no issue if malloc() is used instead of acc_malloc()
	c->p = (double *)acc_malloc(N * sizeof(double));
	c->size = N;
	double res = 0.0;
#pragma acc enter data create(c->p[0:N])

#pragma acc parallel loop present(c->p[0:N])
	for (int i=0; i<N; i++) c->p[i] = i;

#pragma acc parallel loop present(c->p[0:N]) reduction(+:res)
	for (int i=0; i<N; i++) res += c->p[i];
	std::cout << "bozzo  : " << res << std::endl;

	// returns 0 when acc_malloc() is used!
	res = reduce_pointer(c->p, N);
	std::cout << "pointer: " << res << std::endl;

	res = reduce_chunk(c);
	std::cout << "chunk  : " << res << std::endl;


#pragma acc exit data delete(c->p[0:N])
	return 0;
}

$ nvc++ -acc -gpu=mem:separate reproducer.cpp
$ ./a.out
bozzo  : 523776
pointer: 0
chunk  : 523776


Hi Gilles,

The primary issue here is that given “p” is allocated via “acc_malloc”, it’s a device pointer. Hence you need to use the “deviceptr” clause when using it as a raw pointer (i.e. not through “c”). Pointers allocated via acc_malloc are user managed device pointers and aren’t added to the runtime’s “present” table.

Some other suggestions are to put “c” in a data directive, put “c” in the present clause, and use “acc_free” to deallocate “p”. See below for my edits.

The program always produces the result I expect when malloc() is used instead of acc_malloc() and the program is built with -gpu=mem:managed

Makes sense though I presume you’re adding “p” to a data directive. With “malloc”, “p” is a host pointer so when added to data directive creates a mirrored device array which is managed via the OpenACC present table.

With “managed”, the created pointer from acc_malloc can be accessed on both the host and device.

Example:

#include <openacc.h>
#include <iostream>

double reduce_pointer(double *array, int size) {
        double res = 0.0;

// MEC: Given "p" is a device pointer created by acc_malloc,
//      "deviceptr" should be used in place of "present"
//      "array" isn't in the present table
//#pragma acc parallel loop present(array[0:size]) reduction(+:res)
#pragma acc parallel loop deviceptr(array) reduction(+:res)
        for (int i=0; i<size; i++) res += array[i];
        return res;
}

typedef struct {
        double *p;
        int size;
} chunk;

double reduce_chunk(chunk *c) {
        double res = 0.0;
//#pragma acc parallel loop present(c->p[0:c->size]) reduction(+:res)
#pragma acc parallel loop present(c) reduction(+:res)
        for (int i=0; i<c->size; i++) res += c->p[i];
        return res;
}
#define N 1024
int main(int argc, char *argv[]) {
        chunk *c = (chunk *)malloc(sizeof(chunk));
        // no issue if malloc() is used instead of acc_malloc()
        c->p = (double *)acc_malloc(N * sizeof(double));
        c->size = N;
// Copy "c" to the device so it's present on the device
// A shallow copy is performed so the device pointer "p"
//  will get copied over to the device.
#pragma acc enter data copyin(c)
        double res = 0.0;

//#pragma acc parallel loop present(c->p[0:N])
#pragma acc parallel loop present(c)
        for (int i=0; i<N; i++) c->p[i] = i;

//#pragma acc parallel loop present(c->p[0:N]) reduction(+:res)
#pragma acc parallel loop present(c) reduction(+:res)
        for (int i=0; i<N; i++) res += c->p[i];
        std::cout << "bozzo  : " << res << std::endl;

        // returns 0 when acc_malloc() is used!
        res = reduce_pointer(c->p, N);
        std::cout << "pointer: " << res << std::endl;

        res = reduce_chunk(c);
        std::cout << "chunk  : " << res << std::endl;

// Since "p" is a device pointer, it needs to be free'd
//#pragma acc exit data delete(c->p[0:N])
        acc_free(c->p);
#pragma acc exit data delete(c)
        return 0;
}
% nvc++ -acc reproducer.cpp; a.out
bozzo  : 523776
pointer: 523776
chunk  : 523776

Hope this helps,
Mat

1 Like

Thanks a lot Mat, it definitely helped!

I acknowledge my program was incorrect and a fix to use the deviceptr() clause when using a pointer returned by acc_malloc().

That being said, I am a bit surprised the compiler nor the runtime detected this.
Also, do you know why reduce_pointer() returned 0?
Did the compiler flag the openacc region as undefined behavior and removed the dead code?
Or did some code run on the GPU and used an other memory region that happened to be initialized with zeros?

Also, do you know why reduce_pointer() returned 0?

When I originally tried your code on x86, I got an illegal address error. It’s not until I moved over to a Grace-Hopper system was I able to reproduce your results.

Exactly why it “works”, I’m not sure but suspect that the stack address for “array” is getting picked up by the GPU since, on a GH system, the GPU can read host memory directly. The stack address is a host address so likely writing to the wrong spot.

FWIW, I used a Xeon machine with two A100 GPU and never got a crash.

“I was lucky it did not crash” is a reasonable explanation.

I will let you decide whether the fact the compiler/runtime did not detect this should be escalated.

Thanks again!