Openacc - nested loops - illegal address during kernel execution

Inside a function I developed, I’m attempting to parallelize the following nested loops using OpenACC.

#ifdef USE_OPENACC
    #pragma acc parallel loop collapse(3) present(
        RGrid_->values_[0:NRGrid_], ZGrid_->values_[0:NZGrid_], PhiGrid_->values_[0:NPhiGrid_], 
        isInsideTheLastCMS_->values_[0:NRGrid_*NZGrid_*NPhiGrid_], 
        iZetaMin_for_pointsInsideTheLastCMS_->values_[0:NRGrid_*NZGrid_*NPhiGrid_], 
        iZetaMax_for_pointsInsideTheLastCMS_->values_[0:NRGrid_*NZGrid_*NPhiGrid_]
    )
#else
    #pragma omp parallel for collapse(3)
#endif
for (U iR = 0; iR < NRGrid_; iR++) {
    for (U iZ = 0; iZ < NZGrid_; iZ++) {
        for (U iPhi = 0; iPhi < NPhiGrid_; iPhi++) {
            
            // Define point coordinates
            T RPoint = RGrid_->values_[iR];
            T ZPoint = ZGrid_->values_[iZ];
            T PhiPoint = PhiGrid_->values_[iPhi];

            // Convert to Cartesian coordinates
            T XPoint = RPoint * std::cos(PhiPoint);
            T YPoint = RPoint * std::sin(PhiPoint);

            // Initialize variables
            U iRho = NRho_ - 1;
            U iZeta_FC_1 = 0, iZeta_FC_2 = 0; 
            U iTheta_FC_1_p1 = 0, iTheta_FC_1_p2 = 0, iTheta_FC_2_p = 0;
            
            T X3d, Y3d, Z3d;
            T ddist2_min, ddist2;

            // Compute initial distance
            X3d = (*R3D_)(iRho, 0, 0) * std::cos((*Phi3D_)(iRho, 0, 0));
            Y3d = (*R3D_)(iRho, 0, 0) * std::sin((*Phi3D_)(iRho, 0, 0));
            Z3d = (*Z3D_)(iRho, 0, 0);
            
            ddist2_min = (X3d - XPoint) * (X3d - XPoint) +
                         (Y3d - YPoint) * (Y3d - YPoint) +
                         (Z3d - ZPoint) * (Z3d - ZPoint);
            
            // Find the closest point on LCMS
            #ifdef USE_OPENACC
                #pragma acc seq  
            #endif
            for (U iTheta = 0; iTheta < NTheta_; iTheta++) {   
                for (U iZeta = 0; iZeta < NZeta_; iZeta++) {
                    X3d = (*R3D_)(iRho, iTheta, iZeta) * std::cos((*Phi3D_)(iRho, iTheta, iZeta));
                    Y3d = (*R3D_)(iRho, iTheta, iZeta) * std::sin((*Phi3D_)(iRho, iTheta, iZeta));
                    Z3d = (*Z3D_)(iRho, iTheta, iZeta);
                    
                    ddist2 = (X3d - XPoint) * (X3d - XPoint) + 
                             (Y3d - YPoint) * (Y3d - YPoint) + 
                             (Z3d - ZPoint) * (Z3d - ZPoint);
                    
                    if (ddist2 < ddist2_min) {
                        ddist2_min = ddist2;
                        iTheta_FC_1_p1 = iTheta;
                        iZeta_FC_1 = iZeta;                    
                    }
                }
            }
        }
    }
}

During compilation, I don’t have any problem, Howeve, when executing the code, I recieve the following message

Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /home/homam/Documents/MyCodes/SLIM/GoAHead/ToroidalSolver/FluxSurfaces3DReduced.h
 Function: _ZN21FluxSurfaces3DReducedIjdE39locateCylGridPointsWithRespectToLastCMSEv:1378
 Line: 1409

I don’t understand the source of this error. Moreover, the error disappears when I comment out the following lines in the code above.

 ddist2_min = ddist2;
                        iTheta_FC_1_p1 = iTheta;
                        iZeta_FC_1 = iZeta;    

Do you have an idea what is happening? Did I make a mistake in the way I’m parallelizing these 5 nested loops?

thanks

An illegal access error is fairly generic and just means the device code encountered a bad address. It’s similar to a segmentation violation on the CPU. The most common causes are things like using a host address on the device, an out-of-bounds access, or a stack/heap overflow.

For example, if you didn’t properly do a deep copy on one of the structs (assuming you’re not using managed or unified memory), then that could cause a host address to be passed in.

Though since you don’t get the error when commenting out this one section, let’s focus there.

iTheta_FC_1_p1 = iTheta;
iZeta_FC_1 = iZeta;

I don’t see how “iZeta” and “iTheta” are declared, but I’ll assume that these are scalars declared on the host so are implicitly firstprivate so should be fine. Plus if they weren’t I expect the code to error elsewhere.

The “iTheta_FC_1_p1” and “iZeta_FC_1” are statically initialized local variables that don’t appear to be used elsewhere in the code. Hence one theory would be that you’re getting a stack overflow due to the static initialization. But when commented out, the compiler can remove these variables since they aren’t used and the rest of the variables just fit in stack.

To test this theory, try setting the environment variable “NV_ACC_CUDA_STACKSIZE” to a larger value like 32MB. The device stack size is quite small and has a hard limit though the limit will depend on the device and driver. Though if you go too big, you’ll get an error (too many resources) at the start of the program.

Also, you can try switching to runtime initialization for all these variables. For example:

        U iRho, iZeta_FC_1, iZeta_FC_2; 
        U iTheta_FC_1_p1, iTheta_FC_1_p2, iTheta_FC_2_p;
        iRho = NRho_ - 1;
        iZeta_FC_1 = iZeta_FC_2 = 0; 
        iTheta_FC_1_p1 = iTheta_FC_1_p2 = iTheta_FC_2_p = 0;

If it still fails then I’ll likely need a reproducing example so I can investigate.

-Mat

1 Like

Thank you very much for your response, and sorry for being late in my reply. As you rightly pointed out in your answer

“if you didn’t properly do a deep copy on one of the structs (assuming you’re not using managed or unified memory), then that could cause a host address to be passed in.”

This was exactly my mistake! The function is a member of the fluxsurfaces class, which also contains tensors like X3D. In my case, I tried to move all these tensors to the device without first copying the class object itself., i.e., using

#pragma acc enter data copyin(this)

Thanks for your help

1 Like