Parallelization was blocked by object in openacc region

Lance2019 · September 25, 2022, 12:37pm

Hello,

I have one question on the code below:

    vector< vector <int> > kernel;
    Mat dill(image.rows,image.cols,CV_8UC1,Scalar::all(0));
	auto *dillstart=dill.data;
    vector< vector < int> > shouldBeZeroImage(drows, vector<int>(dcols, 0));
    dilate(image, dill, element,Point(-1, -1), 1);
    int drows=dill.rows;
	int dcols=dill.cols;
    int n = kernel.size();
    int m = kernel[0].size();


	#pragma acc enter data copyin(dillstart[:drows*dcols],shouldBeZeroImage[:drows][:dcols],kernel[:n][:m])
	#pragma acc parallel loop collapse(2) default(present)
	for(int i = n / 2; i < drows - n / 2; i++) {
           for(int j = m / 2; j < dcols - m / 2; j++) {
                if( (int)dillstart[i*dcols+j]  == ONE) {
                    //bool shouldBeZero = false;
		    int shouldBeZero = 0;
                    //#pragma acc parallel loop
                    for(int crtX = i - n / 2, x = 0; crtX <= i + n / 2; crtX++, x++) {
                       // #pragma omp parallel for
                        for(int crtY = j - m / 2, y = 0; crtY <= j + m / 2; crtY++, y++) {
                            if((int)dillstart[crtX*dcols+crtY] == ZERO && kernel[x][y] == 1) {
                               // shouldBeZero = true;
			       shouldBeZero=1;
                                break;
                            }
                        }
                    }

                    if(shouldBeZero) {
                       // shouldBeZeroImage[i][j] = true;
		       shouldBeZeroImage[i][j]=1;
                    }
                }
            }
         }
	  #pragma acc exit data copyout(dillstart[:drows*dcols],shouldBeZeroImage[:drows][:dcols],kernel[:n][:m])

Related information:
			  WatershedAlg::erosion(cv::Mat, std::vector<std::vector<int, std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>):
     21, Generating enter data copyin(dillstart[:dcols*drows],shouldBeZeroImage,kernel[:m])
         Generating NVIDIA GPU code
         24, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         25,   /* blockIdx.x threadIdx.x collapsed */
         30, #pragma acc loop seq
     21, Generating default present(dillstart[:],shouldBeZeroImage,kernel[:])
     30, Scalar last value needed after loop for shouldBeZero at line 41
         Complex loop carried dependence of dillstart->,kernel-> prevents parallelization
     32, Complex loop carried dependence of kernel->,dillstart-> prevents parallelization
     47, Generating exit data copyout(dillstart[:dcols*drows],shouldBeZeroImage)
         Generating enter data copyin(shouldBeZeroImage,dillstart[:dcols*drows])
         Generating exit data copyout(kernel[:m])
         Generating NVIDIA GPU code
         81, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         82,   /* blockIdx.x threadIdx.x collapsed */

I don’t know why the parallelization was blocked by dillstart and kernel. Could anyone provide any hint or suggestion?Thanks in advance!

MatColgrove · September 26, 2022, 4:00pm

The message is just informational that the compiler can’t implicitly auto-parallelize the inner two for loops due to potential aliasing of the dillstart and kernel pointers (i.e. a compilation the compiler wont know if they point to the same data, but since they can, it has to assume they do).

Now I wouldn’t expect this given that neither array has an assignment, but I might be missing some context. Though if dillstart and kernel aren’t assigned in the compute region, why are you copying them out? If they remain unchanged, you should put these in a “delete” clause instead and save the data movement time.

In general, you can explicitly parallelize loops that the compiler can’t auto-parallelize by adding “#pragma acc loop” Though here, “crtY” can’t be parallelized due to the “break”.

You have this in your comments, but you’ll want to remove “parallel” which defines a region to offload, while “loop” expresses the loops to parallelize. By nesting “parallel” constructs, you’re asking each thread to launch additional kernels Technically OpenACC allows this, but we don’t support nested parallelism since we’ve never found a compelling use case for it.

-Mat

Lance2019 · October 6, 2022, 1:11pm

Hello Mat,

Thanks for your reply.

I used auto* __restrict to remove the warning.

But I still have one question, is a copyin data a shallow copy?If I use a pointer to represent the data address from Mat by definition of opencv, is this data a kind of deep copy or a shallow copy?

If dillstart and kernel were not deleted, may I know if I could use kernel and dillstart in the following parallel region in the same C++ code file?

If the dillstart and kernel are reused in new acc parallel region, should I use any directive or do nothing?

Thanks in advance!

MatColgrove · October 6, 2022, 3:34pm

Shallow.

If I use a pointer to represent the data address from Mat by definition of opencv, is this data a kind of deep copy or a shallow copy?

If you have something like “Mat * dill”, which is then allocated:

#pragma acc enter data (dill) ← will copy just the pointer, so not what you want
#pragma acc enter data(dill[0:1] ← does a shallow copy of the class data members.

Assuming Mat has dynamic data members, you then need to perform a deep copy of those members. Something like:

#pragma acc enter data(dill[0].data[0:size])

This creates the data array on the device and then implicitly “attach” data’s pointer to the parent object’s data member. Details can be found in the following article:

Now this presumes that the class data members are accessible, i.e. “public”, and you know the underlying definition of the data members. Which may or may not be the case here.

It’s the main reason why I encouraged you to look at using CUDA Unified Memory, but since the allocation is done within OpenCV, you’d need to build OpenCV with nvc++ and add the “-gpu=managed” flag so the compiler can replace any “new” or “malloc” calls with calls to cudaMallocManaged.

If delete dillstart and kernel were not deleted, mayI know if I could use kernel and dillstart in the following parallel region in the same C++ code file?

Kernel, yes, but if you delete dillstart (presumably you mean delete the device copy) then you’d need to recreate it to be used in another compute region.

If the dillstart and kernel are reused in new acc parallel region ,should I use any directive or do nothing?

A data region defines the scope and lifetime of the device data. Any compute region between the start and end data pragmas, including those in subroutines, will have access to the device copies of the data defined in this data region.

In other words, yes, they will be reused presuming that the exit data directive if used after this second compute region.

-Mat

Lance2019 · October 7, 2022, 3:02pm

Hello Mat,

Thanks for your reply and hints.

auto * __restrict dillstart=dill.data,point to the first address of data from Matrix in opencv. dill.data[0] is the first data in Matrix. Based on the theory of deep copy whether is 'copyin dillstart[:drows*dcols] ’ a deep copy instead of shallow copy?

If it is a deep copy and there is no exit data directive to move out the update from device to host,could I continue to use dillstart in next parallel region without definition of 'copyin dillstart[:drows*dcols] ’ one more time?

When the updated data need to be used if 'copyin dillstart[:drows*dcols] ’ is a deep copy,whether should I use ‘exit data copyout’ to move out the update data in device?

No matter the phrase is a shallow copy or deep copy,there must exit exit data copypout, because the updated data must be transferred from device to host.

The reason we use deep copy is the structure of data type can be loaded into device and be used in device,such as calling sign ‘.’ in object(p1.x[0])? Otherwise, people can only operate on array or vector in device.I am not quite sure whether my thought is correct,if my thought is incorrect,please kindly provide suggestions.

Thanks in advance！

MatColgrove · October 7, 2022, 3:48pm

This is actually a shallow copy of the array. A deep copy is needed when there’s an aggregate type (ex. struct, class, jagged array) that contains a dynamic data member (ex. a pointer type).

If it is a deep copy and there is no exit data directive to move out the update from device to host,could I continue to use dillstart in next parallel region without definition of 'copyin dillstart[:drows*dcols] ’ one more time?

Again, it’s a shallow copy, but no, you do not need another “copyin” directive. “copyin” does two things, it creates the data on the device and then copies it to the device at the start of the data region. If the data is already present on the device and used in another nested data region, “present_or” semantics applies, meaning the nested copy is ignored.

When the updated data need to be used if 'copyin dillstart[:drows*dcols] ’ is a deep copy,whether should I use ‘exit data copyout’ to move out the update data in device?

To synchronize the host and device data within the scope of a data region, you’ll want to use the “update” directive. For example:

// create and copy to the device "dillstart"
#pragma acc enter data copyin(dillstart[:drows*dcols] )

// use dillstart in a device compute region
#pragma acc parallel loop present(dillstart)
for (.....

// copy the data back from the device if dillstart gets assigned values
// so the host and device copies are in sync 
#pragma acc update self(dillstart[:drows*dcols])

// use dillstart on the host
for (....

// assuming dillstart is updated on the host, synchronize the device copy
#pragma acc update device((dillstart[:drows*dcols])

// use dillstart in a device compute region
#pragma acc parallel loop present(dillstart)
for (.....
...
// delete the device copy and end the data region
#pragma acc exit data delete(dillstart)

For performance reasons, you do want to minimize data movement, so only use “update” when really needed. Ideally, you want to create the data on the device, do all computation on the device, then bring back the results at the end.

The reason we use deep copy is the structure of data type can be loaded into device and be used in device,such as calling sign ‘.’ in object(p1.x[0])? Otherwise, people can only operate on array or vector in device.I am not quite sure whether my thought is correct,if my thought is incorrect,please kindly provide suggestions.

Correct, “p1” is an object of an aggregate type with “x” being an allocated array (dynamic data member), in which case the deep copy is needed. If “x” were a fixed size array (i.e. the size is fixed at compilation, like “double x[10]”), then only a shallow copy of “p1” is needed.

-Mat

Lance2019 · October 8, 2022, 6:00am

Hello Mat,

Thanks for your reply ,the information is very helpful.

I still have one question.

If dill is a object of Mat data type and the function inside dill should be used,how could I write the directives with #pragma acc? Because I want to use function, for instance ‘. drill.at(x_cordinate,y_cordinate)’ or drill.copyTo(SOURCE_FILE) .Or I have to find the library file about Mat definition and add #pragma acc routine. Please provide a general hint.

Thanks in advance!

MatColgrove · October 10, 2022, 3:34pm

If the compiler has access to the definition of the called function (i.e. if it’s a template or the definition is in a header file, then the compiler can implicitly create the device function.

If the definition is in a separate source file, then add “#pragma acc routine seq” to the prototype in the header file. This will have the compiler know there’s a device routine from the caller size, as well as have the compiler create a device routine when compiling the source with the definition.

Of course if “drill.at” is in OpenCV, you’ll need to modify the OpenCV source and build using nvc++ and the “-acc” flag.

-Mat