Randomly got error "FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)"

Hi,

I tried to port a code to GPU using OpenACC but randomly got error

“FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)”

This implementation likes

// d is a structure including “double *x”
d->X = tmalloc(double,nx);
#pragma acc enter data copyin(d[0])
#pragma acc enter data create(d->X[0:nx])

Sometimes it crashed with this error. Sometimes it worked after recompiling or rerunning. Notice that many variables have already been created/copied-in on GPU using OpenACC directives. Could the error be related to the other variables that may be correctly created/free on GPU ? Is there any debugging tool available for such issue?

Thanks. /Jing

Hi Jing,

It’s probably the “copyin(d[0])”, this should be a range with one element, i.e. “d[:1]”.

Example:

% cat test.c
#include <stdlib.h>
#include <stdio.h>

typedef struct {
   double * X;
   int size;
} foo;

int main () {

   foo *d;
   d = (foo*) malloc(sizeof(foo)*1);
   d->size = 1024;
   d->X = (double*) malloc(sizeof(double)*d->size);
#ifdef FAILS
   #pragma acc enter data copyin(d[0])
#else
   #pragma acc enter data copyin(d[:1])
#endif
   #pragma acc enter data create(d->X[:d->size])

   #pragma acc parallel loop present(d)
   for (int i=0; i < d->size; ++i) {
        d->X[i] = ((double)i)/((double)d->size);
   }
   #pragma acc update self(d->X[:d->size])
   printf("%f \n",d->X[4]);

   #pragma acc exit data delete(d->X,d)
   free(d->X);
   free(d);
   exit(0);
}

% nvc test.c -acc -DFAILS; a.out
hostptr=0x1d78260,stride=1,size=1,extent=-1,eltsize=16,name=d,flags=0x200=present,async=-1,threadid=1
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x1d78260 device:(nil) size:0 presentcount:0+1 line:14 name:d
host:0x1d78280 device:0x1553e5afa000 size:8192 presentcount:0+1 line:14 name:(null)
allocated block device:0x1553e5afa000 size:8192 thread:1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=d host:0x1d78260
 file:/local/home/colgrove/test.c main line:14

% nvc test.c -acc ; a.out
0.003906

-Mat

Hi Mat,

Now I have changed to d[:1] and the compiling log shows it is correct.

  ...
   921, Generating enter data copyin(d[:1])
   922, Generating enter data create(d->Xp[:np])
   923, Generating enter data create(d->X[:nx])
  ...

However, I still randomly got the error. The structure includes few pointers and only the d->X does not work. And the other strangle thing with error is “name=(unknown)” but the name d can be captured in your example.

"FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)"

Thanks. /Jing

I’m not sure what the “unknown” variable is, but am guessing that there might be an anonymous pointer someplace getting passed into a kernel.

Let’s try tracking down where the error is occurring by setting “NVCOMPILER_ACC_DEBUG=1” in your environment and look for where the present check is occurring. They’ll be a lot of output so pipe it to a log file and then post the tail of the log. Also once you know where it’s occurring, if you can post this section of code, that might be helpful.

Hi Mat,

Unfortunately the code should be run on at least 2GPUs to reproduce such error (d->X likes the data shared with GPUs and it is empty if only using one single GPU). So, with the environment variable "NVCOMPILER_ACC_DEBUG=1”, the output was mixed with each other and it is difficult to identify the presented variables before the crash point.

Is it possible to set the variable only for one certain GPU, or send you the whole log file to have a look? If you want to, you can also add two lines to the source to reproduce such random errors.

Thanks. /Jing

Hi Jian,

Are you using MPI to run across multiple GPUs? If so, write a wrapper script which can then pipe the logs to separate files. If not MPI, then no, the debug output would be mix but does show the device number for each call.

d->X likes the data shared with GPUs

Not sure what you mean by this. Are you creating a shared IPC handle so the memory is accessible to both devices? or are you creating multiple copies, one for each device? Though if you are creating the memory on one device and then trying to access it on another, this could lead to the partially present error.

I’m happy to take a look at the log (even if it’s mixed) and you can direct message it to me or I can send you an email (I can access your email info). Though if you have a reproducing example of the code so I can better understand what the code is doing, that might be better.

-Mat