OpenACC deep copy approaches in c++. GPU pointers crashing

Hello,

I used this post Data Structures in C · fomics/EuroHack15 Wiki · GitHub as a reference to organize manual deep copy to GPU.

Unfortunately, I found out that it works with very large limitations.

It is matrix-matrix multiplication sample.
First of all, I created a class

class matrix
{
	public:
	int _n;
	float * _mat;
	matrix(){}
	~matrix(){}
	matrix(int n, float * mat)
	{
		_n=n;
		_mat=mat;
	}
	matrix(int n)
	{
		_mat = (float*)malloc(n * n * sizeof(float));
		_n=n;
	}
	
	void create(int n)
	{
		_mat = (float*)malloc(n * n * sizeof(float));
		_n=n;
	}	
	
	#pragma acc routine seq
	float * get(){return &_mat[0];}	
};

Then I allocate, init and copy objects to the GPU

	int n = 1000;
	matrix *a,*b,*c;
		
	a=new matrix(n);
	b=new matrix(n);
	c=new matrix(n);
		
	for (int i = 0; i < n * n; i++)
	{
		a->get()[i] = (float)rand() / RAND_MAX;
		b->get()[i] = (float)rand() / RAND_MAX;
	}

	acc_copyin( a, sizeof(matrix));
	acc_copyin( b, sizeof(matrix));
	acc_copyin( c, sizeof(matrix));
	acc_copyin( a->_mat, sizeof(float)*n*n);
	acc_copyin( b->_mat, sizeof(float)*n*n);
	acc_copyin( c->_mat, sizeof(float)*n*n);

Then I call kernels region:

	#pragma acc kernels present (a->_mat[0:n*n],b->_mat[0:n*n],c->_mat[0:n*n])
	{
	#pragma acc loop independent
	for (int i = 0; i < n; i++)
	{
		#pragma acc loop independent
		for (int j = 0; j < n; j++)
		{
			float temp = 0.0f;
			#pragma acc loop reduction(+:temp)
			for (int k = 0; k < n; k++)
			{
				temp += a->_mat[i * n + k] * b->_mat[k * n + j];
			}
			c->_mat[i * n + j] = temp;
		}	
	}	
	}

and copy data back

	acc_copyout( a->get(), sizeof(float)*n*n);
	acc_copyout( b->get(), sizeof(float)*n*n);
	acc_copyout( c->get(), sizeof(float)*n*n);

In this case everything works fine.
But if replace the pointers to objects with objects, like here:

	matrix a(n);
	matrix b(n);
	matrix c(n);

And replace all “->” calls with “.” changing copyin as follows:

	acc_copyin( &a, sizeof(matrix));
	acc_copyin( &b, sizeof(matrix));
	acc_copyin( &c, sizeof(matrix));
	acc_copyin( a._mat, sizeof(float)*n*n);
	acc_copyin( b._mat, sizeof(float)*n*n);
	acc_copyin( c._mat, sizeof(float)*n*n);

Then it crashes during the kernel execution with

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Moreover, I found out that inside of kernels region I’m unable to make statements like this:

				float * temp1 = a->_mat;
				temp += temp1[i * n + k] * b->_mat[k * n + j];

or this

			float &temp1=c->_mat[i*n+j];
			temp1 = temp;

or even call a simple function, returning pointers to a device variables:

	#pragma acc routine seq
	float * get(){return &_mat[0];}	
...
				temp += a->get()[i * n + k] * b->get()[k * n + j];

All of this works fine with cpu binary.
Compile options and PGI version:

pgc++ -fast matmul.cpp -o matmul.cpu -std=c++11
pgc++ -acc -Minfo=accel -fast -ta=tesla:cc35 matmul.cpp -o matmul.acc -std=c++11

pgc++ 17.10-0 64-bit target on x86-64 Linux -tp nehalem

Hi aivakhnenko,

You missed that you need to add the calls to “acc_attach” which will go back and fill in the device pointer addresses to “_mat”. Otherwise, you’re dereferencing the host addresses, and hence get the illegal memory address errors.

Though what I would suggest is to update your matrix class to handle the device data management itself, rather than managing the device data from main. I’ve written an example below to show how to do this.

For the other errors, I’d need to see a full reproducing example to see what’s going on. I’ve done similar things inside of compute regions, so need to see the error in context.

Hope this helps,
Mat

“Matrix.h”

#ifdef _OPENACC
#include <openacc.h>
#endif
#include <stdlib.h>

class matrix
 {
    public:
    int _n;
    float * _mat;
    bool _external_mat; // Track if _mat is managed by this class
#ifdef _OPENACC
    bool _device_mat;  // Track if the device data is managed by this class
                       // or externally
#endif
    matrix( ) {
// Initalize data in default constructor including adding the
//  device copy of the this pointer.
       _n=0;
       _mat=NULL;
       _external_mat=false;
#ifdef _OPENACC
       _device_mat=false;
       #pragma acc enter data copyin(this)
#endif
    }
    ~matrix( ){
// Delete the data on the device only if it's managed by this class
       if (_mat != NULL && !_external_mat) {
#ifdef _OPENACC
           if (_device_mat) {
              #pragma acc exit data delete(_mat)
           }
#endif
           delete [] _mat;
       }
       #pragma acc exit data delete(this)
    }
    matrix(int n, float * mat)
    {
// Since the array is being passed in, check to see if it's already
// on the device, if so, then attach it.  Otherwise create and copy it in.
       _n=n;
       _mat=mat;
       _external_mat=true;
#ifdef _OPENACC
       #pragma acc enter data copyin(this)
       if (acc_is_present((void*)mat,_n*_n*sizeof(float))) {
          _device_mat=false;
          acc_attach((void**)&_mat);
       } else {
          _device_mat=true;
          #pragma acc enter data copyin(_mat[0:_n*_n])
       }
#endif
    }

    explicit matrix(int n)
    {
       _mat = (float*)malloc(n * n * sizeof(float));
       _n=n;
       _external_mat=false;
#ifdef _OPENACC
       _device_mat=true;
       #pragma acc enter data copyin(this)
       #pragma acc enter data create(_mat[0:n*n])
#endif

    }

    void create(int n)
    {
       _mat = (float*)malloc(n * n * sizeof(float));
       _n=n;
#ifdef _OPENACC
       _device_mat=true;
       #pragma acc enter data create(_mat[0:_n*_n])
#endif
    }

    float * get(){return &_mat[0];}



#ifdef _OPENACC
// Extend the class so that it can copy the data to/from the device
    void acc_update_self( ) {
        #pragma acc update self(_mat[0:_n*_n])
    }

    void acc_update_device( ) {
        #pragma acc update device(_mat[0:_n*_n])
    }
#endif

};

“Matrix.cpp”

#include <iostream>
#include "matrix.h"

int main( ) {

    int n = 1000;
    float * c_arr;
    c_arr = new float[n*n];
#pragma acc enter data create(c_arr[0:n*n])

// Create the arrays three different ways to test each of the cases in 
//  the matrix class.  
//  1) Use the constructor with the size passed in
//  2) Use the default constructor and then call "create"
//  3) Pass in an already created array

    matrix a(n),b,c(n,c_arr);
    b.create(n);

    for (int i = 0; i < n; i++)
    {
       for (int j = 0; j < n; j++)
       {
           a._mat[i * n + j]=1.0;
           b._mat[i * n + j]=2.0;
       }
    }
#ifdef _OPENACC
    a.acc_update_device();
    b.acc_update_device();
#endif

    #pragma acc parallel loop collapse(2) present(a,b,c,a._mat,b._mat,c._mat)
    for (int i = 0; i < n; i++)
    {
       for (int j = 0; j < n; j++)
       {
          float temp = 0.0f;
          float * temp1 = a._mat;
          #pragma acc loop reduction(+:temp)
          for (int k = 0; k < n; k++)
          {
             temp += temp1[i * n + k] * b._mat[k * n + j];
          }
          c._mat[i * n + j] = temp;
       }
    }
#ifdef _OPENACC
    c.acc_update_self();
#endif
    for (int i = 0; i < 10; i++)
    {
       std::cout << i << ":";
       for (int j = 0; j < 10; j++)
       {
          std::cout << " " << c._mat[i * n + j];
       }
       std::cout << std::endl;
    }
#pragma acc exit data delete(c_arr)
    delete [] c_arr;
    exit(0);
}



% pgc++ matrix.cpp -fast -ta=tesla:cc60 -Minfo=accel -V17.10
main:
     11, Generating enter data create(c_arr[:n*n])
     24, Generating present(a._mat[:],b._mat[:],a,b,c._mat[:],c)
         Accelerator kernel generated
         Generating Tesla code
         28, #pragma acc loop gang collapse(2) /* blockIdx.x */
         30,   /* blockIdx.x collapsed */
         34, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(+:temp)
     34, Loop is parallelizable
     54, Generating exit data delete(c_arr[:1])
matrix::~matrix():
      3, include "matrix.h"
          33, Generating exit data delete(_mat[:1])
          38, Generating exit data delete(this[:1])
matrix::matrix(int, float *):
      3, include "matrix.h"
          48, Generating enter data copyin(this[:1])
          54, Generating enter data copyin(_mat[:_n*_n])
matrix::matrix(int):
      3, include "matrix.h"
          69, Generating enter data copyin(this[:1])
              Generating enter data create(_mat[:n*n])
matrix::create(int):
      3, include "matrix.h"
          79, Generating enter data create(_mat[:_n*_n])
matrix::acc_update_self():
      3, include "matrix.h"
          89, Generating update self(_mat[:_n*_n])
matrix::acc_update_device():
      3, include "matrix.h"
          93, Generating update device(_mat[:_n*_n])
% a.out
0: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
1: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
2: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
3: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
4: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
5: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
6: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
7: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
8: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
9: 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000

[/code]

Hi Mat,

Thanks for your help!
Your example helped me to find the main problem in my code. Of course I tried both with acc_attach and without it. But all my troubles came from this line:

#pragma acc kernels present (a._mat[0:n*n],b._mat[0:n*n],c._mat[0:n*n])

In general case, replacing it with the following fixes everything.

	#pragma acc kernels present (a,b,c,a._mat[0:n*n],b._mat[0:n*n],c._mat[0:n*n])