Cannot run code using OpenACC - Illegal address during kernel execution

Hello guys, I need some help. However I try to use OpenACC directives in my code, I can compile it, but I can’t get it to work at all. My code plots Mandelbrot set using SFML graphics, and I am trying to do all of my calculations on gpu using OpenACC, put them in array of rgb colors and then plot the image using that array. I think I managed to construct the array so that each iteration of the loop that calculates rgb color writes that color in it’s own elements in array (for each pixel color there are 4 elements in imageColor array -> r,g,b,alpha). Code works as expected when I compile it using -ta=host. All my code is located at: https://github.com/uros97/mandelbrot. File with code is mandelbrot.cpp. When I compile it using -ta=tesla:managed it gives following compiler output :

mandelbrot(double, double):
42, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
updateImageSlice(double, double, double, int, int):
94, Generating copy(imageColors[:2400000],colors[:384]) [if not already present]
Generating Tesla code
95, #pragma acc loop gang /* blockIdx.x /
99, #pragma acc loop vector(128) /
threadIdx.x */
105, Generating implicit reduction(+:imag)
107, Generating implicit reduction(+:real)
46, Loop carried scalar dependence for …inline at line 46,52,47
53, Accelerator restriction: induction variable live-out from loop: …inline
54, Accelerator restriction: induction variable live-out from loop: …inline
99, Loop is parallelizable

I realize that output for scalar dependence and accelerator restrictions is because loop in mandelbrot function does not have independent iterations, so when I put #pragma acc loop seq for that part, compiler gives following output without those warnings for dependence and out of loop variable:

mandelbrot(double, double):
42, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
updateImageSlice(double, double, double, int, int):
94, Generating copy(imageColors[:2400000],colors[:384]) [if not already present]
Generating Tesla code
95, #pragma acc loop gang /* blockIdx.x /
99, #pragma acc loop vector(128) /
threadIdx.x */
105, Generating implicit reduction(+:imag)
107, Generating implicit reduction(+:real)
99, Loop is parallelizable

However, in both cases, when I run the compiled programe, I get the following error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

From the research I did, I can assume that the problem has something to do with pointers pointing to host memory instead of gpu, however, I tried various different approaches to copying and deleting data to and from GPU, but every time I manage to compile, but not run the programe. Also I left some commented pragmas of what I tried to do, I tried many different copies with various loops, so some are left commented. Any help or advice is appreciated, and thank you in advance.

Hi FreonW,

Since I don’t have SFML installed, I can’t run your code so don’t know the exact cause of the illegal memory error. Though these are like seg faults on the CPU where the code is accessing memory it doesn’t own, such as accessing an array out of bounds, or dereferencing a host address.

Though in looking at the loops, they’re not actually parallelizable. You have dependencies on “real” and “imag” where their current value depends on value from the previous iterations of the loop. Since you force parallelization, the compiler does attempt to put these in an implicit reduction, but you can’t use intermediate reduction values in the loop itself. So I suspect bad values are being used causing the return “value” to be a bad index into the “colors” arrays, thus causing the illegal address. I’m not positive since I can’t run the code, but seems likely.

To fix, I’d compute the value of real and imag from the loop indices rather than incrementing them. Something like:

    #pragma acc parallel loop copy(colors[:(MAX+1)*3], imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
    //#pragma acc kernels
    {
    for (int x = 0; x < IMAGE_WIDTH; x++) {
        double real = realstart + (x*zoom);

        #pragma acc loop
        for (int y = minY; y < maxY; y++) {
            double imag = imagstart + ((y-minY)*zoom);
            int value = mandelbrot(real, imag);
            imageColors[(y*IMAGE_WIDTH + x)*4 + 0] = colors[value*3];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 1] = colors[value*3 + 1];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 2] = colors[value*3 + 2];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 3] = 255;
        }

    }
    }

Hopefully this fixes the issue.

-Mat

Hi Mat, thank you for your fast reply.

I did what you suggested and got rid of those dependencies, but I still get the same error. I don’t think there are other dependencies like that in those loops. There is such dependency in “mandelbrot” function that is called inside inner loop, but I set the sequential loop directive for that so I don’t think it’s causing problems. I tried again with different copies, and I tried to copy “realstart” and “imagstart” to gpu too, but that also didn’t work. I got rid of the code that has to do with SFML library, because it is only used for drawing a picture once all the calculations are done, so anyone should be able to compile and run the code now.
Here is the changed code:

#include <array>
#include <vector>
#include <stdint.h>

static constexpr int IMAGE_WIDTH = 1000;
static constexpr int IMAGE_HEIGHT = 600;

struct MyColor{
    uint8_t red;
    uint8_t green;
    uint8_t blue;
};


void updateImage(double zoom, double offsetX, double offsetY) ;
static const int MAX = 127; // maximum number of iterations for mandelbrot()
                           // don't increase MAX or the colouring will look strange
                           
uint8_t *__restrict imageColors = (uint8_t *) calloc(IMAGE_HEIGHT * IMAGE_WIDTH *4 , sizeof(uint8_t));
uint8_t *__restrict colors = (uint8_t *) calloc((MAX+1)*3 , sizeof(uint8_t));
    
int mandelbrot(double startReal, double startImag) ;
MyColor getColor(int iterations) ;
void updateImageSlice(double zoom, double offsetX, double offsetY, int minY, int maxY) ;


void initColor() {
    for (int i=0; i <= MAX; ++i) {
        MyColor color = getColor(i);
        
        colors[i*3] = color.red;
        colors[i*3 + 1] = color.green;
        colors[i*3 + 2] = color.blue;
        
    }
//#pragma acc enter data copyin(colors[:(MAX+1)*3])
//#pragma acc enter data copyin(imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
}

int mandelbrot(double startReal, double startImag) {
    double zReal = startReal;
    double zImag = startImag;
    #pragma acc loop seq
    for (int counter = 0; counter < MAX; ++counter) {
        double r2 = zReal * zReal;
        double i2 = zImag * zImag;
        if (r2 + i2 > 4.0) {
            return counter;
        }
        zImag = 2.0 * zReal * zImag + startImag;
        zReal = r2 - i2 + startReal;
    }
    return MAX;
}

MyColor getColor(int iterations){
    int r, g, b;

    // colour gradient:      Red -> Blue -> Green -> Red -> Black
    // corresponding values:  0  ->  16  ->  32   -> 64  ->  127 (or -1)
    if (iterations < 16) {
        r = 16 * (16 - iterations);
        g = 0;
        b = 16 * iterations - 1;
    } else if (iterations < 32) {
        r = 0;
        g = 16 * (iterations - 16);
        b = 16 * (32 - iterations) - 1;
    } else if (iterations < 64) {
        r = 8 * (iterations - 32);
        g = 8 * (64 - iterations) - 1;
        b = 0;
    } else { // range is 64 - 127
        r = 255 - (iterations - 64) * 4;
        g = 0;
        b = 0;
    }
    MyColor retVal;
    retVal.red = r;
    retVal.green = g;
    retVal.blue = b;
    return retVal;
}

void updateImageSlice(double zoom, double offsetX, double offsetY, int minY, int maxY)
{
    double realstart = 0 * zoom - IMAGE_WIDTH / 2.0 * zoom + offsetX;
    double imagstart = minY * zoom - IMAGE_HEIGHT / 2.0 * zoom + offsetY;

    
    //#pragma acc kernels
    #pragma acc parallel loop copy(colors[:(MAX+1)*3], imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
    {
    for (int x = 0; x < IMAGE_WIDTH; x++) {
        double real = realstart + (x*zoom);
        #pragma acc loop
        for (int y = minY; y < maxY; y++) {
            double imag = imagstart + ((y-minY)*zoom);
            int value = mandelbrot(real, imag);
            imageColors[(y*IMAGE_WIDTH + x)*4 + 0] = colors[value*3];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 1] = colors[value*3 + 1];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 2] = colors[value*3 + 2];
            imageColors[(y*IMAGE_WIDTH + x)*4 + 3] = 255;
        }
        
    }
    }
    
}

void updateImage(double zoom, double offsetX, double offsetY)
{
    const int STEP = IMAGE_HEIGHT; //do whole image in one step for simplicity
    for (int i = 0; i < IMAGE_HEIGHT; i += STEP) {
        updateImageSlice(zoom, offsetX, offsetY, i, std::min(i+STEP, IMAGE_HEIGHT));
    }
}

int main() {
    double offsetX = -0.7; // move around
    double offsetY = 0.0;
    double zoom = 0.004; // allow the user to zoom in and out
    initColor();
    updateImage(zoom, offsetX, offsetY);

    
    //#pragma acc exit data delete(imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
    //#pragma acc exit data delete(colors[:(MAX+1)*3])
    
    free(imageColors);
    free(colors);

}

Looks like it may be a compiler issue in 19.10 since the code works as is with 20.1.

The problem appears to be the calloc calls located in the header of the code. Moving the calloc calls into the executable part of the code (in main) seems to work around the issue. Since it’s fixed already, I did not add a problem report.

% cat man.cpp
#include <array>
#include <vector>
#include <stdint.h>

static constexpr int IMAGE_WIDTH = 1000;
static constexpr int IMAGE_HEIGHT = 600;

struct MyColor{
    uint8_t red;
    uint8_t green;
    uint8_t blue;
};


void updateImage(double zoom, double offsetX, double offsetY) ;
static const int MAX = 127; // maximum number of iterations for mandelbrot()
                           // don't increase MAX or the colouring will look strange

//uint8_t *__restrict imageColors = (uint8_t *) calloc(IMAGE_HEIGHT * IMAGE_WIDTH *4 , sizeof(uint8_t));
//uint8_t *__restrict colors = (uint8_t *) calloc((MAX+1)*3 , sizeof(uint8_t));
uint8_t *__restrict imageColors;
uint8_t *__restrict colors;

int mandelbrot(double startReal, double startImag) ;
MyColor getColor(int iterations) ;
void updateImageSlice(double zoom, double offsetX, double offsetY, int minY, int maxY) ;


void initColor() {
    for (int i=0; i <= MAX; ++i) {
        MyColor color = getColor(i);

        colors[i*3] = color.red;
        colors[i*3 + 1] = color.green;
        colors[i*3 + 2] = color.blue;

    }
//#pragma acc enter data copyin(colors[:(MAX+1)*3])
//#pragma acc enter data copyin(imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
}

int mandelbrot(double startReal, double startImag) {
    double zReal = startReal;
    double zImag = startImag;
//    #pragma acc loop seq
    for (int counter = 0; counter < MAX; ++counter) {
        double r2 = zReal * zReal;
        double i2 = zImag * zImag;
        if (r2 + i2 > 4.0) {
            return counter;
        }
        zImag = 2.0 * zReal * zImag + startImag;
        zReal = r2 - i2 + startReal;
    }
    return MAX;
}

MyColor getColor(int iterations){
    int r, g, b;

    // colour gradient:      Red -> Blue -> Green -> Red -> Black
    // corresponding values:  0  ->  16  ->  32   -> 64  ->  127 (or -1)
    if (iterations < 16) {
        r = 16 * (16 - iterations);
        g = 0;
        b = 16 * iterations - 1;
    } else if (iterations < 32) {
        r = 0;
        g = 16 * (iterations - 16);
        b = 16 * (32 - iterations) - 1;
    } else if (iterations < 64) {
        r = 8 * (iterations - 32);
        g = 8 * (64 - iterations) - 1;
        b = 0;
    } else { // range is 64 - 127
        r = 255 - (iterations - 64) * 4;
        g = 0;
        b = 0;
    }
    MyColor retVal;
    retVal.red = r;
    retVal.green = g;
    retVal.blue = b;
    return retVal;
}

void updateImageSlice(double zoom, double offsetX, double offsetY, int minY, int maxY)
{
    double realstart = 0 * zoom - IMAGE_WIDTH / 2.0 * zoom + offsetX;
    double imagstart = minY * zoom - IMAGE_HEIGHT / 2.0 * zoom + offsetY;


    //#pragma acc kernels
    #pragma acc parallel loop copy(colors[:(MAX+1)*3], imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
    {
    for (int x = 0; x < IMAGE_WIDTH; x++) {
        double real = realstart + (x*zoom);
        #pragma acc loop
        for (int y = minY; y < maxY; y++) {
            double imag = imagstart + ((y-minY)*zoom);
            int value = mandelbrot(real, imag);
            long idx = (y*IMAGE_WIDTH + x)*4;
            imageColors[idx] = colors[value*3];
            imageColors[idx + 1] = colors[value*3 + 1];
            imageColors[idx + 2] = colors[value*3 + 2];
            imageColors[idx + 3] = 255;
        }

    }
    }

}

void updateImage(double zoom, double offsetX, double offsetY)
{
    const int STEP = IMAGE_HEIGHT; //do whole image in one step for simplicity
    for (int i = 0; i < IMAGE_HEIGHT; i += STEP) {
        updateImageSlice(zoom, offsetX, offsetY, i, std::min(i+STEP, IMAGE_HEIGHT));
    }
}

int main() {
    double offsetX = -0.7; // move around
    double offsetY = 0.0;
    double zoom = 0.004; // allow the user to zoom in and out
 imageColors = (uint8_t *) calloc(IMAGE_HEIGHT * IMAGE_WIDTH *4 , sizeof(uint8_t));
 colors = (uint8_t *) calloc((MAX+1)*3 , sizeof(uint8_t));
    initColor();
    updateImage(zoom, offsetX, offsetY);


    //#pragma acc exit data delete(imageColors[:IMAGE_HEIGHT * IMAGE_WIDTH *4])
    //#pragma acc exit data delete(colors[:(MAX+1)*3])

    free(imageColors);
    free(colors);

}

% pgc++ man.cpp -ta=tesla -Minfo=accel -V19.10 ; a.out
mandelbrot(double, double):
     42, Generating implicit acc routine seq
         Generating acc routine seq
         Generating Tesla code
updateImageSlice(double, double, double, int, int):
     95, Generating copy(imageColors[:2400000],colors[:384]) [if not already present]
         Generating Tesla code
         96, #pragma acc loop gang /* blockIdx.x */
         99, #pragma acc loop vector(128) /* threadIdx.x */
     99, Loop is parallelizable

Accelerator Kernel Timing data
man.cpp
  _Z16updateImageSlicedddii  NVIDIA  devicenum=0
    time(us): 550
    95: compute region reached 1 time
        95: kernel launched 1 time
            grid: [1000]  block: [128]
             device time(us): total=111 max=111 min=111 avg=111
            elapsed time(us): total=803 max=803 min=803 avg=803
    95: data region reached 2 times
        95: data copyin transfers: 2
             device time(us): total=221 max=205 min=16 avg=110
        110: data copyout transfers: 2
             device time(us): total=218 max=190 min=28 avg=109

Moving calloc to main solved the issue. Thank you very much for your help, couldn’t have done it without you!