OpenACC error: call to cuMemFreeHost returned error 700

Hi, i’m using pgi C compiler version 16.10 to parallelize my C code.
But an error occurred when i run the executable, which said “call to cuMemFreeHost returned error 700: Illegal address during kernel execution”.
I have no idea how it happened, plz help me resolve the problem.
My code is shown below.

typedef struct {
    double position;
    double direction;
    double weight;
    int cell;
    int group;
    int alive;
} Particle;

int size = 100000;             // number of particles to be simulated
int tot = (int) (1.3 * size);  // this variable limits the maximum of next generation particles
int capacity = 0;              // this variable indicates the actual number of next generation particles

/* particles to be simulated */
Particle *par = (Particle *) malloc(size * sizeof(Particle));

/* prepared memory for next generation particles */
particle *next = (Particle *) malloc(tot * sizeof(Particle));

/* initialization */
for (int i = 0; i < size; i++){
    par[i].position = rand1() * 100.0;  // random number between 0.0~1.0
    par[i].direction = rand2();         // random number between -1.0~1.0
    par[i].weight = 1.0;
    par[i].cell = 2;
    par[i].group = rand1() > 0.5 ? 1 : 2;
    par[i].alive = 1;
}

/* some parameters used in simulation */
double keff = 1.0;
double tracklength, collision, absorption;

/* start simulating */
int generation;
for (generation = 1; generation <= 100; generation++){
    int CellID, MatID, GroupID;
    int k;   // k-th particle to be simulated
    #pragma acc parallel copy(capacity) copyin(par[0:size],size, keff) copyout(next[0:tot])
    #pragma acc loop reduction(+:tracklength, collision, absorption)
    for (k = 0; k < size; k++){
        while (par[k].alive){
            /* do some calculation with par[k] */
            /* secondary particle produced under certain circumstances */
            if (condition){
                next[capacity].position = par[k].position;
                next[capacity].direction = rand2();
                next[capacity].weight = 1.0;
                next[capacity].cell = par[k].cell;
                next[capacity].group = rand1() < 0.9 ? 1 : 2;
                next[capacity].alive = 1;
                capacity++;
            }
        }
    }
    /* after simulation of current generation, update the parameters */
    keff = ........          // one formula to update keff

    size = capacity;
    capacity = 0;
    tot = (int) (1.3 * size);
    free(par);
    par = next;
    next = (Particle *) malloc(tot * sizeof(Particle));
}

free(par);
free(next);

/* do some cleanup */

Hi Rubic,

An “illegal memory” address error on the GPU occurs when dereferencing a bad memory address. It often occurs when the program accesses memory beyond the end of an array or when accessing a host address on the device. Less often, it can occur when the program runs out of heap space on the device. Without a reproducing example I can’t say for sure why this is occurring in your case.

What I do see however is a race condition in your code. “capacity” is shared across all threads. Hence when you access “next[capacity].position” there’s a good chance “capacity” will have been incremented before “next[capacity].direction” is assigned. To fix, you’ll want use an atomic capture on capacity’s value and then use the captured value when indexing “next”. Something like:

   int k;   // k-th particle to be simulated 
    #pragma acc parallel copy(capacity) copyin(par[0:size],size, keff) copyout(next[0:tot]) 
    #pragma acc loop reduction(+:tracklength, collision, absorption) 
    for (k = 0; k < size; k++){ 
        int mycap;
        while (par[k].alive){ 
            /* do some calculation with par[k] */ 
            /* secondary particle produced under certain circumstances */ 
            if (condition){ 
#pragma acc atomic capture 
                mycap = capacity++;
                next[mycap].position = par[k].position; 
                next[mycap].direction = rand2(); 
                next[mycap].weight = 1.0; 
                next[mycap].cell = par[k].cell; 
                next[mycap].group = rand1() < 0.9 ? 1 : 2; 
                next[mycap].alive = 1;  
            } 
        } 
    }

It’s possible that this is causing your runtime error if “capacity” goes beyond the end of the “next” array. If you are still having problems, please post a reproducing example or send one to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

Note that scalars are firstprivate by default so there’s no need to copyin “size” (and most likely “keff”, but you don’t show it’s use so I can’t tell).

I’m curious about your “rand1” and “rand2” functions. RNGs typically have a globally shared “seed” variable and thus can’t be called from within parallel code. Are your rand functions written so they can be used in parallel device code, such as having an array of “seeds” one for each thread, or by having them call cuRand?

Also, be sure to look over the compiler feed back messages (-Minfo=accel) for any errors or warnings. Feel free to post the output if you have difficultly interrupting it.

-Mat

Hi mkcolg, thanks for your reply.

The error is really caused by “capacity”, and it seems that everything is fine after adding #pragma acc atomic capture to my code.

But there are still some warnings during compilation.

My whole code is shown below.

#include "timer.h"       // for time record

#define RN_MULT 3512401965023503517
#define RN_ADD ((uint64_t) 0)
#define RN_MOD ((uint64_t) 1 << 63)
#define RN_MASK (((uint64_t)1 << 63) - 1)

#define getrand(seed) ((double) (seed) / (double) (RN_MOD))
#define update_seed(seed) (((RN_MULT * seed) & RN_MASK) + RN_ADD) & RN_MASK

typedef struct {
    double position;
    double direction;
    double wgt;
    int cell;
    int group;
    uint64_t seed;
    int alive;
} Particle;

typedef struct {
    double SigmaT[3];
    double SigmaA[3];
    double NuSigmaF[3];
    double SigmaS[5];
    int fissionable; /* 0: no   1:yes */
} Material;

typedef struct {
    int mat;
    double left;
    double right;
    /* bc=0: normal, bc=1: vacuum, bc=2: reflective */
    int leftbc;
    int rightbc;
} Cell;

int main(void) {
    StartTimer();

    const int Neutron = 100000;
    const int Batch = 100;

    Cell InpCell[4];

    Material InpMat[3];

    Cell cell0 = {0, 0., 0., 0, 0};
    Cell cell1 = {2, 0., 30., 2, 0};
    Cell cell2 = {1, 30., 130., 0, 0};
    Cell cell3 = {2, 130., 160., 0, 2};
    InpCell[0] = cell0;
    InpCell[1] = cell1;
    InpCell[2] = cell2;
    InpCell[3] = cell3;

    Material mat0 = {{0., 0., 0.},           // SigmaT
                     {0., 0., 0.},           // SigmaA
                     {0., 0., 0.},           // NuSigmaF
                     {0., 0., 0., 0., 0.},   // SigmaS
                     0};                     // fissionable
    Material mat1 = {{0., 2.412957E-01, 9.134388E-01},
                     {0., 9.078832E-03, 7.483482E-02},
                     {0., 5.688290E-03, 1.047830E-01},
                     {0., 2.165434E-01, 1.567340E-02, 0., 8.386040E-01},
                     1};
    Material mat2 = {{0., 2.604754E-01, 1.239454E+00},
                     {0., 2.043440E-03, 2.747390E-02},
                     {0., 0., 0.},
                     {0., 2.296604E-01, 2.877160E-02, 0., 1.211980E+00},
                     0};
    InpMat[0] = mat0;
    InpMat[1] = mat1;
    InpMat[2] = mat2;

    int capacity = 0;
    int size = Neutron;
    int tot = (int) (1.3 * size);

    Particle *par = (Particle *) malloc(size * sizeof(Particle));
    Particle *next = (Particle *) malloc(tot * sizeof(Particle));

    /* allocate initial random number seed */
    par[0].seed = update_seed((uint64_t) 1);
    for (int a = 1; a < Neutron; a++){
        par[a].seed = update_seed(par[a - 1].seed);
        for (int b = 0; b < 1000; b++){
            par[a].seed = update_seed(par[a].seed);
        }
    }

    uint64_t last_seed = par[Neutron - 1].seed;

    par[0].position = 80.0;
    par[0].direction = 2.0 * getrand(par[0].seed) - 1.0;
    par[0].seed = update_seed(par[0].seed);
    par[0].wgt = 1.0;
    par[0].cell = 2;
    par[0].group = getrand(par[0].seed) > 0.5 ? 1 : 2;
    par[0].seed = update_seed(par[0].seed);
    par[0].alive = 1;

    for (int i = 1; i < Neutron; i++){
        par[i].position = 80.0;
        par[i].direction = 2.0 * getrand(par[i].seed) - 1.0;
        par[i].seed = update_seed(par[i].seed);
        par[i].wgt = 1.0;
        par[i].cell = 2;
        par[i].group = 2 * getrand(par[i].seed) > 1.0 ? 1 : 2;
        par[i].seed = update_seed(par[i].seed);
        par[i].alive = 1;
    }

    double global_tracklength;
    double global_collision;
    double global_absorption;
    double d_collision;
    double d_boundary;
    double distance;
    double k_tracklength;
    double k_collision;
    double k_absorption;
    double keff = 1.0;

    printf("Gen     k-tra       k-abs       k-col       keff        size\n");

    int gen;
    for (gen = 1; gen <= Batch; gen++){
        global_tracklength = 0.;
        global_collision = 0.;
        global_absorption = 0.;
        int CellID, GroupID, MatID;

        int k;
        #pragma acc parallel copy(capacity) copyin(par[0:size], InpCell[0:4], InpMat[0:3]) copyout(next[0:tot])
        #pragma acc loop reduction(+:global_collision, global_absorption, global_tracklength)
        for (k = 0; k < size; k++){
            int mycap;
            while (par[k].alive){
                CellID = par[k].cell;
                GroupID = par[k].group;
                MatID = InpCell[CellID].mat;
                double SigmaS = InpMat[MatID].SigmaS[2 * GroupID - 1] + InpMat[MatID].SigmaS[2 * GroupID];

                /* find distance to the nearest boundary */
                d_boundary = ((par[k].direction > 0 ? InpCell[CellID].right : InpCell[CellID].left) -
                              par[k].position) / par[k].direction;

                /* sample a distance to collision */
                d_collision = -log(getrand(par[k].seed)) / InpMat[MatID].SigmaT[GroupID];
                par[k].seed = update_seed(par[k].seed);

                /* select the smaller between the two distances*/
                distance = d_boundary < d_collision ? d_boundary : d_collision;

                /* adjust the position of the particle */
                par[k].position += distance * par[k].direction;

                /* score track-length estimate of keff */
                global_tracklength += par[k].wgt * distance * InpMat[MatID].NuSigmaF[GroupID];

                /* particle crosses surface */
                if (d_boundary < d_collision){
                    if (par[k].direction > 0.0){
                        if (InpCell[CellID].rightbc == 1){
                            par[k].alive = 0;
                        }
                        else if (InpCell[CellID].rightbc == 2){
                            par[k].direction = -par[k].direction;
                        }
                        else{
                            par[k].cell += 1;
                        }
                    }
                    else{
                        if (InpCell[CellID].leftbc == 1){
                            par[k].alive = 0;
                        }
                        else if (InpCell[CellID].leftbc == 2){
                            par[k].direction = -par[k].direction;
                        }
                        else{
                            par[k].cell -= 1;
                        }
                    }
                }

                    /* or particle has collision */
                else{
                    /* score collision estimate of keff */
                    global_collision += par[k].wgt * InpMat[MatID].NuSigmaF[GroupID] / InpMat[MatID].SigmaT[GroupID];

                    /* sample reaction for the material the particle is in */
                    if (InpMat[MatID].fissionable){
                        /* determine expected number of neutrons produced */
                        double nu_t = par[k].wgt / keff * InpMat[MatID].NuSigmaF[GroupID] / InpMat[MatID].SigmaT[GroupID];
                        int nu = getrand(par[k].seed) > nu_t - (int) nu_t ? (int) nu_t : (int) nu_t + 1;
                        par[k].seed = update_seed(par[k].seed);
                        int m;
                        for (m = 0; m < nu; m++){
                            #pragma acc atomic capture
                            mycap = capacity++;
                            next[mycap].position = par[k].position;
                            next[mycap].wgt = 1.0;
                            next[mycap].group = getrand(par[k].seed) < 0.9 ? 1 : 2;
                            par[k].seed = update_seed(par[k].seed);
                            next[mycap].direction = getrand(par[k].seed) * 2.0 - 1.0;
                            par[k].seed = update_seed(par[k].seed);
                            next[mycap].cell = par[k].cell;
                            next[mycap].alive = 1;
                        }
                    }

                    /* determine weight absorbed in survival biasing */
                    double absorb_wgt = par[k].wgt * InpMat[MatID].SigmaA[GroupID] / InpMat[MatID].SigmaT[GroupID];
                    par[k].wgt -= absorb_wgt;

                    /* score implicit absorption estimate of keff */
                    global_absorption +=
                            absorb_wgt * InpMat[MatID].NuSigmaF[GroupID] / InpMat[MatID].SigmaA[GroupID];

                    /* sample a scattering reaction and determine the secondary energy of the exiting neutron */
                    double inf = 0, sup = 0;
                    double roulette = getrand(par[k].seed);
                    par[k].seed = update_seed(par[k].seed);
                    int n;
                    for (n = 1; n <= 2; n++){
                        inf = sup;
                        sup = inf + InpMat[MatID].SigmaS[(GroupID - 1) * 2 + n];
                        if (roulette > inf / SigmaS && roulette < sup / SigmaS){
                            par[k].group = n;
                            par[k].direction = getrand(par[k].seed) * 2.0 - 1.0;
                            par[k].seed = update_seed(par[k].seed);
                        }
                    }

                    /* play russian roulette */
                    if (par[k].wgt < 0.25){
                        double tmp = getrand(par[k].seed);
                        par[k].seed = update_seed(par[k].seed);
                        if (tmp < par[k].wgt / 1.0){
                            par[k].wgt = 1.0;
                        }
                        else{
                            par[k].alive = 0;
                        }
                    }
                }
            }
        }

        k_tracklength = global_tracklength / (double) size;
        k_collision = global_collision / (double) size;
        k_absorption = global_absorption / (double) size;
        keff = (k_tracklength + k_collision + k_absorption) / 3;

        next[0].seed = update_seed(last_seed);
        for (int a = 0; a < 1000; a++){
            next[0].seed = update_seed(next[0].seed);
        }
        for (int a = 1; a < capacity; a++){
            next[a].seed = update_seed(next[a - 1].seed);
            for (int b = 0; b < 1000; b++){
                next[a].seed = update_seed(next[a].seed);
            }
        }
        last_seed = next[capacity - 1].seed;

        size = capacity;
        capacity = 0;
        tot = (int) (1.3 * size);
        printf(" %-3d   %0.5f     %0.5f     %0.5f     %0.5f     %d\n", gen, k_tracklength, k_absorption, k_collision,
               keff, size);
        free(par);
        par = next;
        next = (Particle *) malloc(tot * sizeof(Particle));
    }

    free(par);
    free(next);

    double total_time = GetTimer();
    printf("total time: %0.5fs\n", total_time / 1000.0);
    return 0;
}

I’m using nVIDIA GTX 680 and compiling the code with pgcc -acc -Minfo=accel -ta=tesla:cc30,time -O0 main.c -o test

Here are the feedback messages

main:
    109, Generating copyin(InpCell[:],InpMat[:])
         Generating copy(capacity)
         Generating copyout(next[:tot])
         Generating copyin(par[:size])
         Accelerator kernel generated
         Generating Tesla code
        111, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:global_absorption,global_collision,global_tracklength)
        113, #pragma acc loop seq
        174, #pragma acc loop seq
        201, #pragma acc loop seq
    113, Loop carried dependence of par->alive,par->cell,par->direction,par->group,par->position,par->seed,par->wgt prevents parallelization
         Complex loop carried dependence of InpMat.NuSigmaF,InpMat.SigmaA,InpMat.SigmaS,InpMat.SigmaT,InpMatfissionable,next->alive,next->cell,next->direction,next->group,next->position,next->wgt,par->alive,par->cell,par->direction,par->group,par->position,par->seed,par->wgt prevents parallelization
         Loop carried reuse of next->alive,next->cell,next->direction,next->group,next->position,next->wgt prevents parallelization
         Loop carried backward dependence of par->alive,par->cell,par->direction,par->group,par->position,par->seed,par->wgt prevents vectorization
         Loop carried dependence of par->direction,par->group,par->position,par->seed prevents vectorization
    174, Loop carried dependence of par->seed prevents parallelization
         Complex loop carried dependence of next->alive,next->cell,next->direction,next->group,next->position,next->wgt,par->cell,par->position,par->seed prevents parallelization
         Loop carried reuse of next->position prevents parallelization
         Loop carried backward dependence of par->seed prevents vectorization
    201, Loop carried dependence of par->group prevents parallelization
         Complex loop carried dependence of InpMat.SigmaS,par->direction,par->group,par->seed prevents parallelization
         Loop carried backward dependence of par->group prevents vectorization
         Loop carried scalar dependence for sup at line 202

The good news is that the executable runs normally, while the bad one is the results are quite different from the CPU version.

So I wanna know whether there is any connection between the compiler feedback message and the wrong results. Or for other reasons such as RNG function?

Hi Rubic,

It looks to me that loop dependency messages are correct in that they correspond to the inner while loop and inner for loops which do look to have dependencies. Since the outer loop is getting off-loaded, you can ignore these messages or modify your code to try to remove the dependencies.

As for the wrong answers, I’m concerned about the correctness of your code. While it seems to “work” with GNU, compiling with either Intel’s C compiler, icc, or PGI’s, pgcc, the code seg faults due to a divide by zero. (size gets set to zero). This is without any optimization and without OpenACC enabled.

% icc -g test.c -std=c99 -w
% a.out
Gen     k-tra       k-abs       k-col       keff        size
 1     -436.77467     0.97678     0.97678     -144.94037     990
 2     -462.28514     0.88082     0.88082     -153.50784     0
 3     -nan     -nan     -nan     -nan     0
Segmentation fault
% pgcc -g test.c
% a.out
Gen     k-tra       k-abs       k-col       keff        size
 1     -81112131.20105     0.88758     0.88758     -27037376.47530     910
 2     -60102857.20889     0.81698     0.81698     -20034285.19165     0
 3     -nan     -nan     -nan     -nan     0
Segmentation fault

Are you able to get the code to run correctly on the CPU when not using GNU?

-Mat

Hi mkcolg,

I have tested the code with gcc version 6.3.0 on Linux Mint 18.1 and clang version 8.1.0 on MacBook pro and icc version 15.0.1. All of them give the correct and exactly the same results.

Maybe the code i post here is incomplete. Plz have a look at my repo on Github https://github.com/x1314aq/AccTest for more details

Thanks Rubic. I was able to get the code to run using the github source.

However, the error I see is a seg fault not wrong answers. What’s happening is that “capacity” is getting a value of 138261 which is beyond the end of the “next” array.

It doesn’t look like a problem with the atomic capture but rather an optimization issue with the back-end device compiler. If I sent the opt level to O1, then the program works as expected. At O2, I get the seg fault.

I’ve added a problem report, TPR#24131, and sent it on to our compiler engineers for investigation.

Can you try lowering the optimization level to see if it works around the problem for you as well? Note that I added some code to print out capacity’s value.

% pgcc -fast -acc main.c -ta=tesla:cc60,O2 -DDEBUG
 % a.out
 Gen k-tra k-abs k-col keff size
 1 Before Size: 100000 Capacity: 0
 1 After Size: 100000 Capacity: 138261
 Segmentation fault

 % pgcc -fast -acc main.c -ta=tesla:cc60,O1
 % a.out
 Gen k-tra k-abs k-col keff size
 1 1.25793 1.25779 1.25779 1.25784 125694
 2 1.14003 1.14058 1.14058 1.14040 113720
 3 1.14054 1.14078 1.14078 1.14070 113779
 4 1.13045 1.13129 1.13129 1.13101 113570
 5 1.12350 1.12490 1.12490 1.12443 113398
 ... cut ...
 98 1.10328 1.10315 1.10315 1.10320 109607
 99 1.10570 1.10451 1.10451 1.10490 109838
 100 1.09884 1.09867 1.09867 1.09873 109595
 total time: 17.76495s

-Mat

Hi mkcolg, the seg fault is due to out-of-bound access of array “next”.

As you have pointed out, the “capacity” is 138261 after first simulation, while the length of “next” is set to 130000(see line 52 and line 245 in source code).

So the solution is quite simple here, replacing the factor 1.3 to 1.5 or even larger.

But I have to say, the coefficient of 1.3 is given by the answer with strict proof, which means there must be something wrong in execution process and the resulting in a seg fault.

One more interesting thing, -O2 and -O0 give completely different results after modifying the coefficient. Obviously the results given by opt level O2 are definitely wrong.

Hi mkcolg,

I am curious about whether you have noticed such a phenomenon that running the OpenACC version code twice will get completely different results.

Precisely speaking, the difference is starting from “Gen 2” and after that “gen” each of the results are different from previous running.

I wonder what is the reason for this phenomenon and how to avoid this from happening.

I am curious about whether you have noticed such a phenomenon that running the OpenACC version code twice will get completely different results.

That shouldn’t normally happen unless you’ve got a race condition or are reading uninitialized memory. Though, your code does include some random number generation which could be the cause. Is your RNG able to get the same set of random numbers for a given set of seeds?

-Mat

Is your RNG able to get the same set of random numbers for a given set of seeds?

Yes, the CPU version gives exactly the same results at each execution.

Hi Rubic,

I think this is a problem with your code.

When creating the “next” array, the order in which particles are added is non-deterministic and will depend upon the order in which the threads encounter the atomic capture of capacity. After the end of the parallel loop, the initial “next” seed value is reset based on “last_seed” and then each subsequent “next” seeds are updated based on the previous one. Given the order of next will change based on thread execution order, so does each particle’s seed value and hence the resulting values.

Hope this helps,
Mat