OpenACC pgc++ compiling error duplicate lives at 0x7ffff78db5c0 size 96 partially present

Hello,

I used opencv to implement data transfer with openacc,but I got error when I compiled the code file.Please take a look at the information below:

#include <opencv2/opencv.hpp>
#include<queue>
#include <vector>
#include<random>
#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/highgui.hpp>
#include<openacc.h>

using namespace std;
using namespace cv;


int main(){

    cv::Mat srcImg=cv::imread("/home/usera/images/blue-mountains.jpg");
    Mat grayScale(srcImg.rows, srcImg.cols, CV_8UC1, Scalar::all(0));
    Mat duplicate(srcImg.rows,srcImg.cols, CV_8UC1,Scalar::all(255) );

    if(!srcImg.data){

     cout<<"The file is not loaded or does not exist"<<endl;
     return -1;

     }

             
       for(int i = 0; i < srcImg.rows; i++) {
          for(int j = 0; j < srcImg.cols; j++) {
             double gray = 0.21 * srcImg.at<cv::Vec3b>(i,j)[0] +
                           0.72 * srcImg.at<cv::Vec3b>(i,j)[1] +
                           0.07 * srcImg.at<cv::Vec3b>(i,j)[2];
             grayScale.at<uchar>(i,j) = (uchar) gray;
            }
       
        }

     cout<<"Matrix grayScale :"<<grayScale.rows<<" "<<grayScale.cols<<endl;

int vrows=srcImg.rows;
int vcols=srcImg.cols;

cout<<"Step"<<grayScale.step<<endl;


int b[3][5];

#pragma acc parallel loop copy(b[:3][:5])
for(int i=0;i<3;i++){
  #pragma acc loop	
  for(int j=0;j<5;j++){
    b[i][j]=i+j;


   }
}

cout<<"b[N-1][M-2] :"<<b[1][1]<<endl;
cout<<"b[N][M] :"<<b[2][4]<<endl;
 
auto *startaddress=grayScale.data;

cout<<(int)*(startaddress+1)<<endl;
cout<<(int)*(startaddress+2)<<endl;



        #pragma acc parallel loop collapse(2)
        for(int i=0;i<vrows;i++){
        //#pragma acc loop
        for(int j=0;j<vcols;j++){

          duplicate.at<uchar>(i,j)=10;//grayScale.at<uchar>(i,j);

       }

      }

    cout<<"duplicate"<<": "<<(int)grayScale.at<uchar>(23,45)<<endl;
    cout<<"duplicate"<<": "<<(int)duplicate.at<uchar>(23,45)<<endl;
}



The result shows:

Matrix grayScale :810 1440
Step1440
b[N-1][M-2] :2
b[N][M] :6
194
195
duplicate lives at 0x7ffff78db5c0 size 96 partially present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.6, threadid=1
host:0x7ffff78db610 device:0x7f1ba56fa000 size:8 presentcount:1+0 line:129 name:(null)
allocated block device:0x7f1ba56fa000 size:512 thread:1
FATAL ERROR: variable in data clause is partially present on the device: name=duplicate

I have one doubt that if there is not enough space for the gang or vectors requested by the object duplicate
Could anyone provide any hints or suggestions?

Thanks in advance.

Good, it looks like you got past the libgomp issue. Did you end up building OpenCV with nvc++? or were you able to solve it another way?

FATAL ERROR: variable in data clause is partially present on the device: name=duplicate

A partially present error means that there’s an overlap in the memory being created on the GPU between two or more variables where the sizes don’t match-up.

Given you don’t use data regions to copy “duplicate”, I suspect in this case you’re relying on the compiler attempting to implicitly copy the data structure for you. Hence it’s probably creating a pointer data member in “duplicate” first, so when it then creates “duplicate”, the memory overlaps.

The compiler is limited in that it can only do an implicit shallow copy. For deep copies, you would need to do a manual deep copy of the aggregate data when one or more members are dynamic types. In C/C++, arrays are unbounded pointers so don’t provide the shape and size of the array which the compiler can use to perform the deep copy.

Full details on manual deep copy can be found at: Deep Copy Support in OpenACC | PGI

Though I presume that since the “Mat” class is defined in OpenCV, you may or may not be able to easily do the manual deep copy. Hence as I suggested in your first post, try using CUDA Unified Memory instead (i.e. add “-gpu=managed”) to your compilation.

In this case, allocated memory will be managed by the CUDA driver so you don’t need to do the deep copy. The caveat being that static memory isn’t managed, so depending of if “duplicate” itself is allocated or not, you may still want to add it to a data region. Something like:


#pragma acc enter data copyin(duplicate)
... 
        #pragma acc parallel loop collapse(2) present(duplicate)
        for(int i=0;i<vrows;i++){
        //#pragma acc loop
        for(int j=0;j<vcols;j++){
          duplicate.at<uchar>(i,j)=10;//grayScale.at<uchar>(i,j);
       }
      }
....
#pragma acc exit data delete(duplicate)

Hope this helps,
Mat

Hello Mat,

Thanks for your reply.

I added the new routine in the code and updated the content of run scripts by changing -gpu=managed

  #pragma acc enter data copyin(duplicate)

        #pragma acc parallel loop collapse(2) present(duplicate)
        for(int i=0;i<vrows;i++){
        //#pragma acc loop
        for(int j=0;j<vcols;j++){

          duplicate.at<uchar>(i,j)=10;//grayScale.at<uchar>(i,j);

        }

       }

    #pragma acc exit data delete(duplicate)

But I got issue when I run the exe file :
$./test4

Matrix grayScale :810 1440
Step1440
b[N-1][M-2] :2
b[N][M] :6
194
195
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution.

Does this information mean the duplicate object is copied to the wrong address of the device?Or this object duplicate is a kind of wrong data type to OpenACC?

Well, at least you got past the partially present error.

An illegal address is like a seg fault on the host. Meaning that it’s encountered a bad address. It’s somewhat generic with no one root cause. Things like accessing host pointers on the device, out-of-bounds errors, heap overflows, stack overflows, null pointers, etc.

I should have thought of this before, but when using UM, nvc++ will replace underlying allocation calls (like ‘new’ and ‘malloc’) to call “cudaMallocManaged”. But given your OpenCV is built with g++, this replacement didn’t occur and the code is accessing a host address on the device.

I can think of two paths forward. Either go back and rebuild OpenCV with “nvc++” adding “-gpu=managed” to the compile flag (or at least the source that contains the allocator for “Mat”), or do a manual deep copy of the data structure.

For the deep copy, you’ll need to go find the definition for “Mat” to determine the data members. You may get lucky and it’s a simple structure, so relatively easy, but it could be complex as well. Worst would be if the data members are private, then you can’t access them directly from main. You’d need to put the deep copy in the the class itself or make the members public and rebuild.

If it is simple, then you can follow the example’s in Michael’s blog post to do the deep copy.

    typedef class{
        float *x, *y, *z;
        float coefx, coefy, coefz;
        size_t n;
    }points1;
....
    points1 p1;
    ...
    #pragma acc enter data copyin(p1)
    #pragma acc enter data copyin(p1.x[0:p1.n], p1.y[0:p1.n], p1.z[0:p1.n])
    #pragma acc parallel loop default(present) ...
    for (i = 0; i < p1.n; ++i) {
        p1.x[i] = ....
    }

Hello Mat,

Thanks for your suggestion.

After reading the information, I updated the directives :

 #pragma acc enter data copyin(startaddress[0:vrows*vcols])
//   #pragma acc enter data copyin(grayScale) attach(grayScale.data)
   #pragma acc parallel loop default(present)
   for(int i=0;i<vrows*vcols;i++){
           startaddress[i]=20;
   }

 #pragma acc exit data copyout(startaddress[0:vrows*vcols])

Finally , it works with result:

main:
    106, Generating copy(b[:][:]) [if not already present]
         Generating NVIDIA GPU code
        109, #pragma acc loop gang /* blockIdx.x */
        111, #pragma acc loop seq
    111, Loop is parallelizable
    129, Generating enter data copyin(startaddress[:vcols*vrows])
         Generating NVIDIA GPU code
        151, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    129, Generating default present(startaddress[:vcols*vrows])
    159, Generating exit data copyout(startaddress[:vcols*vrows])

$ ./test4
Matrix grayScale :810 1440
Step1440
b[N-1][M-2] :2
b[N][M] :6
194
195
-------------------------
20
20
-------------------------
duplicate: 20
duplicate: 255

code for reference:

#include <opencv2/opencv.hpp>
#include<queue>
#include <vector>
#include<random>
#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/highgui.hpp>
#include<openacc.h>

using namespace std;
using namespace cv;


int main(){

     cv::Mat srcImg=cv::imread("/home/usera/images/blue-mountains.jpg");



      Mat grayScale(srcImg.rows, srcImg.cols, CV_8UC1, Scalar::all(0));
      Mat duplicate(srcImg.rows,srcImg.cols, CV_8UC1,Scalar::all(255) );

    if(!srcImg.data){

     cout<<"The file is not loaded or does not exist"<<endl;
     return -1;

     }

         
     
       for(int i = 0; i < srcImg.rows; i++) {
          for(int j = 0; j < srcImg.cols; j++) {
             double gray = 0.21 * srcImg.at<cv::Vec3b>(i,j)[0] +
                           0.72 * srcImg.at<cv::Vec3b>(i,j)[1] +
                           0.07 * srcImg.at<cv::Vec3b>(i,j)[2];
             grayScale.at<uchar>(i,j) = (uchar) gray;
            }
       
        }




     cout<<"Matrix grayScale :"<<grayScale.rows<<" "<<grayScale.cols<<endl;

int vrows=srcImg.rows;
int vcols=srcImg.cols;

cout<<"Step"<<grayScale.step<<endl;

int b[3][5];

#pragma acc parallel loop copy(b[:3][:5])
for(int i=0;i<3;i++){
  #pragma acc loop	
  for(int j=0;j<5;j++){
    b[i][j]=i+j;


   }
}



cout<<"b[N-1][M-2] :"<<b[1][1]<<endl;
cout<<"b[N][M] :"<<b[2][4]<<endl;

  
auto *startaddress=grayScale.data;

cout<<(int)*(startaddress+1)<<endl;
cout<<(int)*(startaddress+2)<<endl;

   #pragma acc enter data copyin(startaddress[0:vrows*vcols])
//   #pragma acc enter data copyin(grayScale) attach(grayScale.data)
   #pragma acc parallel loop default(present)
   for(int i=0;i<vrows*vcols;i++){
	   startaddress[i]=20;
   }
   
 #pragma acc exit data copyout(startaddress[0:vrows*vcols])

   cout<<"-------------------------"<<endl;
   cout<<(int)grayScale.data[20]<<endl;
   cout<<(int)startaddress[20]<<endl;
   cout<<"-------------------------"<<endl;
   cout<<"duplicate"<<": "<<(int)grayScale.at<uchar>(23,45)<<endl;
   cout<<"duplicate"<<": "<<(int)duplicate.at<uchar>(23,45)<<endl;

}

One more thing, I did not see the “Loop is parallelizable” in the part of startaddress[0:vrows*vcols], does it mean the parallelization is failed?

Currently, I am wondering if I could use two loops by i and j to implement the value assignment.
#pragma ACC …
for(int i;…;){
#pragma acc
for(){
}
}