OpenACC deep copy approaches in c++. GPU pointers crashing

aivakhnenko · December 16, 2017, 11:06am

Hello,

I used this post Data Structures in C · fomics/EuroHack15 Wiki · GitHub as a reference to organize manual deep copy to GPU.

Unfortunately, I found out that it works with very large limitations.

It is matrix-matrix multiplication sample.
First of all, I created a class

class matrix
{
	public:
	int _n;
	float * _mat;
	matrix(){}
	~matrix(){}
	matrix(int n, float * mat)
	{
		_n=n;
		_mat=mat;
	}
	matrix(int n)
	{
		_mat = (float*)malloc(n * n * sizeof(float));
		_n=n;
	}
	
	void create(int n)
	{
		_mat = (float*)malloc(n * n * sizeof(float));
		_n=n;
	}	
	
	#pragma acc routine seq
	float * get(){return &_mat[0];}	
};

Then I allocate, init and copy objects to the GPU

	int n = 1000;
	matrix *a,*b,*c;
		
	a=new matrix(n);
	b=new matrix(n);
	c=new matrix(n);
		
	for (int i = 0; i < n * n; i++)
	{
		a->get()[i] = (float)rand() / RAND_MAX;
		b->get()[i] = (float)rand() / RAND_MAX;
	}

	acc_copyin( a, sizeof(matrix));
	acc_copyin( b, sizeof(matrix));
	acc_copyin( c, sizeof(matrix));
	acc_copyin( a->_mat, sizeof(float)*n*n);
	acc_copyin( b->_mat, sizeof(float)*n*n);
	acc_copyin( c->_mat, sizeof(float)*n*n);

Then I call kernels region:

	#pragma acc kernels present (a->_mat[0:n*n],b->_mat[0:n*n],c->_mat[0:n*n])
	{
	#pragma acc loop independent
	for (int i = 0; i < n; i++)
	{
		#pragma acc loop independent
		for (int j = 0; j < n; j++)
		{
			float temp = 0.0f;
			#pragma acc loop reduction(+:temp)
			for (int k = 0; k < n; k++)
			{
				temp += a->_mat[i * n + k] * b->_mat[k * n + j];
			}
			c->_mat[i * n + j] = temp;
		}	
	}	
	}

and copy data back

	acc_copyout( a->get(), sizeof(float)*n*n);
	acc_copyout( b->get(), sizeof(float)*n*n);
	acc_copyout( c->get(), sizeof(float)*n*n);

In this case everything works fine.
But if replace the pointers to objects with objects, like here:

	matrix a(n);
	matrix b(n);
	matrix c(n);

And replace all “->” calls with “.” changing copyin as follows:

	acc_copyin( &a, sizeof(matrix));
	acc_copyin( &b, sizeof(matrix));
	acc_copyin( &c, sizeof(matrix));
	acc_copyin( a._mat, sizeof(float)*n*n);
	acc_copyin( b._mat, sizeof(float)*n*n);
	acc_copyin( c._mat, sizeof(float)*n*n);

Then it crashes during the kernel execution with

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Moreover, I found out that inside of kernels region I’m unable to make statements like this:

				float * temp1 = a->_mat;
				temp += temp1[i * n + k] * b->_mat[k * n + j];

or this

			float &temp1=c->_mat[i*n+j];
			temp1 = temp;

or even call a simple function, returning pointers to a device variables:

	#pragma acc routine seq
	float * get(){return &_mat[0];}	
...
				temp += a->get()[i * n + k] * b->get()[k * n + j];

All of this works fine with cpu binary.
Compile options and PGI version:

pgc++ -fast matmul.cpp -o matmul.cpu -std=c++11
pgc++ -acc -Minfo=accel -fast -ta=tesla:cc35 matmul.cpp -o matmul.acc -std=c++11

pgc++ 17.10-0 64-bit target on x86-64 Linux -tp nehalem

MatColgrove · December 18, 2017, 7:40pm

Hi aivakhnenko,

You missed that you need to add the calls to “acc_attach” which will go back and fill in the device pointer addresses to “_mat”. Otherwise, you’re dereferencing the host addresses, and hence get the illegal memory address errors.

Though what I would suggest is to update your matrix class to handle the device data management itself, rather than managing the device data from main. I’ve written an example below to show how to do this.

For the other errors, I’d need to see a full reproducing example to see what’s going on. I’ve done similar things inside of compute regions, so need to see the error in context.

Hope this helps,
Mat

“Matrix.h”

#ifdef _OPENACC
#include <openacc.h>
#endif
#include <stdlib.h>

class matrix
 {
    public:
    int _n;
    float * _mat;
    bool _external_mat; // Track if _mat is managed by this class
#ifdef _OPENACC
    bool _device_mat;  // Track if the device data is managed by this class
                       // or externally
#endif
    matrix( ) {
// Initalize data in default constructor including adding the
//  device copy of the this pointer.
       _n=0;
       _mat=NULL;
       _external_mat=false;
#ifdef _OPENACC
       _device_mat=false;
       #pragma acc enter data copyin(this)
#endif
    }
    ~matrix( ){
// Delete the data on the device only if it's managed by this class
       if (_mat != NULL && !_external_mat) {
#ifdef _OPENACC
           if (_device_mat) {
              #pragma acc exit data delete(_mat)
           }
#endif
           delete [] _mat;
       }
       #pragma acc exit data delete(this)
    }
    matrix(int n, float * mat)
    {
// Since the array is being passed in, check to see if it's already
// on the device, if so, then attach it.  Otherwise create and copy it in.
       _n=n;
       _mat=mat;
       _external_mat=true;
#ifdef _OPENACC
       #pragma acc enter data copyin(this)
       if (acc_is_present((void*)mat,_n*_n*sizeof(float))) {
          _device_mat=false;
          acc_attach((void**)&_mat);
       } else {
          _device_mat=true;
          #pragma acc enter data copyin(_mat[0:_n*_n])
       }
#endif
    }

    explicit matrix(int n)
    {
       _mat = (float*)malloc(n * n * sizeof(float));
       _n=n;
       _external_mat=false;
#ifdef _OPENACC
       _device_mat=true;
       #pragma acc enter data copyin(this)
       #pragma acc enter data create(_mat[0:n*n])
#endif

    }

    void create(int n)
    {
       _mat = (float*)malloc(n * n * sizeof(float));
       _n=n;
#ifdef _OPENACC
       _device_mat=true;
       #pragma acc enter data create(_mat[0:_n*_n])
#endif
    }

    float * get(){return &_mat[0];}



#ifdef _OPENACC
// Extend the class so that it can copy the data to/from the device
    void acc_update_self( ) {
        #pragma acc update self(_mat[0:_n*_n])
    }

    void acc_update_device( ) {
        #pragma acc update device(_mat[0:_n*_n])
    }
#endif

};

“Matrix.cpp”

#include <iostream>
#include "matrix.h"

int main( ) {

    int n = 1000;
    float * c_arr;
    c_arr = new float[n*n];
#pragma acc enter data create(c_arr[0:n*n])

// Create the arrays three different ways to test each of the cases in 
//  the matrix class.  
//  1) Use the constructor with the size passed in
//  2) Use the default constructor and then call "create"
//  3) Pass in an already created array

    matrix a(n),b,c(n,c_arr);
    b.create(n);

    for (int i = 0; i < n; i++)
    {
       for (int j = 0; j < n; j++)
       {
           a._mat[i * n + j]=1.0;
           b._mat[i * n + j]=2.0;
       }
    }
#ifdef _OPENACC
    a.acc_update_device();
    b.acc_update_device();
#endif

    #pragma acc parallel loop collapse(2) present(a,b,c,a._mat,b._mat,c._mat)
    for (int i = 0; i < n; i++)
    {
       for (int j = 0; j < n; j++)
       {
          float temp = 0.0f;
          float * temp1 = a._mat;
          #pragma acc loop reduction(+:temp)
          for (int k = 0; k < n; k++)
          {
             temp += temp1[i * n + k] * b._mat[k * n + j];
          }
          c._mat[i * n + j] = temp;
       }
    }
#ifdef _OPENACC
    c.acc_update_self();
#endif
    for (int i = 0; i < 10; i++)
    {
       std::cout << i << ":";
       for (int j = 0; j < 10; j++)
       {
          std::cout << " " << c._mat[i * n + j];
       }
       std::cout << std::endl;
    }
#pragma acc exit data delete(c_arr)
    delete [] c_arr;
    exit(0);
}

% pgc++ matrix.cpp -fast -ta=tesla:cc60 -Minfo=accel -V17.10
main:
     11, Generating enter data create(c_arr[:n*n])
     24, Generating present(a._mat[:],b._mat[:],a,b,c._mat[:],c)
         Accelerator kernel generated
         Generating Tesla code
         28, #pragma acc loop gang collapse(2) /* blockIdx.x */
         30,   /* blockIdx.x collapsed */
         34, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(+:temp)
     34, Loop is parallelizable
     54, Generating exit data delete(c_arr[:1])
matrix::~matrix():
      3, include "matrix.h"
          33, Generating exit data delete(_mat[:1])
          38, Generating exit data delete(this[:1])
matrix::matrix(int, float *):
      3, include "matrix.h"
          48, Generating enter data copyin(this[:1])
          54, Generating enter data copyin(_mat[:_n*_n])
matrix::matrix(int):
      3, include "matrix.h"
          69, Generating enter data copyin(this[:1])
              Generating enter data create(_mat[:n*n])
matrix::create(int):
      3, include "matrix.h"
          79, Generating enter data create(_mat[:_n*_n])
matrix::acc_update_self():
      3, include "matrix.h"
          89, Generating update self(_mat[:_n*_n])
matrix::acc_update_device():
      3, include "matrix.h"
          93, Generating update device(_mat[:_n*_n])
% a.out
0: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
1: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
2: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
3: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
4: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
5: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
6: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
7: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
8: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
9: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000

[/code]

aivakhnenko · December 18, 2017, 8:31pm

Hi Mat,

Thanks for your help!
Your example helped me to find the main problem in my code. Of course I tried both with acc_attach and without it. But all my troubles came from this line:

#pragma acc kernels present (a._mat[0:n*n],b._mat[0:n*n],c._mat[0:n*n])

In general case, replacing it with the following fixes everything.

	#pragma acc kernels present (a,b,c,a._mat[0:n*n],b._mat[0:n*n],c._mat[0:n*n])

Topic		Replies	Views
OpenACC Fortran Derived Types with Pointer Elements Legacy PGI Compilers	3	6282	March 15, 2016
C array of pointers in OpenACC Legacy PGI Compilers	4	4975	August 26, 2015
OpenACC pgc++ compiling error duplicate lives at 0x7ffff78db5c0 size 96 partially present Legacy PGI Compilers	4	794	September 15, 2022
OpenACC regions with C++ structs nvc, nvc++ and nvfortran	3	804	January 8, 2021
Strange issues with OpenACC data and loop directives in C++ classes Legacy PGI Compilers	6	1026	January 8, 2021
OpenACC: Copying array, returned by a function, to device nvc, nvc++ and nvfortran	10	925	August 11, 2021
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2404	December 4, 2023
Calling object function in another object's function causes OpenACC code to crash nvc, nvc++ and nvfortran	6	588	October 12, 2021
C++ class in OpenACC nvc, nvc++ and nvfortran	3	419	May 18, 2023
OpenACC copy clause for pointer member of struct does not get attached if parent is copied implicitly nvc, nvc++ and nvfortran	5	1116	February 24, 2022

OpenACC deep copy approaches in c++. GPU pointers crashing

Related topics