PGI Accelerator async clause results in error

Hi,
I use PGI Compiler 12.3 and try to use the async clause with PGI Accelerator directives (parallel regions and update) and different async handles (values 1 or 2). In some cases, my programs runs without any error. However, in a couple of times I get:

call to cuModuleGetGlobal returned error 500: Not found                                                                                                                                
CUDA driver version: 4020                                                                                                                                                              
Segmentation fault (core dumped)

Any ideas?
Thanks, Sandra

Hi Sandra,

I have not seen this error before but my best guess is that you have a “reflected” clause in your code. For some reason when it tries to look up a variable, it doesn’t exist yet. Maybe adding a “wait” directive would help.

What you can do is set the environment variable “NVDEBUG=1” to have the runtime print out every device call made. You can then seen when cuModuleGetGlobal is called and with which variables.

Granted, async is new so it could be a PGI issue as well. If you want, please send an example to PGI Customer Service and we can try an track down the error.

  • Mat

Hi Mat,
no, I don’t have a reflected clause in my code, but all my arrays are “restrict”. The structure of my code is:

int queue1 = 1
int queue2 = 2
#pragma acc data region vars
{
// more stuff
#pragma acc update device x

#pragma acc for region async(queue2)
#pragma acc for parallel
// loop with REDUCTION to "funval2"

#pragma acc for region copyin(bhat) async(queue1)
#pragma acc for parallel
// loop with REDUCTION to "funval1"

#pragma acc wait(queue1)
#pragma acc wait(queue2)
funval = funval1 + funval2

//more stuff
}

I am not sure whether I can really reduce it to send it to you. By the way, I call the code above several times in a row and I also have a loop within the data region that surrounds both kernel calls and the wait directive.

I tried to set NVDEBUG=1. But the output is overwhelming. If I get the seg fault, I will get output like the following:

[..]
                    __pgi_cu_init( file=<path>/pgiacc_eval_f.c, function=funval, line=93, startline=37, endline=177 )
__pgi_cu_module3( lineno=93 )
__pgi_cu_module3 module already loaded at 0x800b80
 0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000__pgi_cu_module_function( name=0x440d41=funval_96_gpu, lineno=96, argname=0x0=, argsize=72, SWcachesize=4459876 )

                     0xcccccccd 0x3ff4ccccFunction handle is 0x7f5f80
 0x1fa00000 0x00000002
__pgi_cu_module_function( name=0x440d50=funval_155_gpu_red, lineno=96, argname=0x0=, argsize=0, SWcachesize=0 )
dequeue operation 14:launch queue:2 position:109 dataoffset:192 datasize:208
Function handle is 0x817a60
__pgi_cu_launch_a(func=0x7fcd40, grid=1x1x1, block=256x1x1, lineno=84)
__pgi_cu_launch_a(func=0x7fcd40, params=0x8091f0, bytes=52, sharedbytes=0)
enqueuing operation 12:alloc flags:0x2 async:1 finfo[0x7fffa864e510:32] vinfo[(nil):0] frec:(nil)
enqueue operation 12:alloc queue[1] position[58] data[336]
First arguments are:
                    CPU waiting on ACC async queue:1
     512000          0    2097152          2     512000          0          0 1079574528
                     -858993459 1073007820  530579456          2       2000

                     0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000
                     0xcccccccd 0x3ff4cccc 0x1fa00000 0x00000002 0x000007d0
dequeue operation 12:alloc queue:1 position:58 dataoffset:336 datasize:32
__pgi_cu_alloc(size=1024,lineno=93,name=bhat)
__pgi_cu_alloc(1024) returns 0x21fb00000
dequeue operation 11:downloadC queue:2 position:110 dataoffset:400 datasize:48
__pgi_cu_downloadc( "b1", size=8, offset=0, lineno=80 )
call to cuModuleGetGlobal returned error 500: Not found
CUDA driver version: 4020
CPU done waiting on ACC async queue:1
enqueuing operation 6:upload flags:0x2 async:1 finfo[0x7fffa864e4e0:48] vinfo[0x7fffa864e600:48] frec:(nil)
enqueue operation 6:upload queue[1] position[59] data[368]
enqueuing operation 3:datadone flags:0x2 async:1 finfo[(nil):0] vinfo[(nil):0] frec:(nil)
enqueue operation 3:datadone queue[1] position[60] data[464]
__pgi_cu_alloc(size=8,lineno=96,name=)
__pgi_cu_alloc(8) returns 0x21fb00400
enqueuing operation 8:uploadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 8:uploadC queue[1] position[61] data[464]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:72] frec:(nil)
enqueue operation 14:launch queue[1] position[62] data[512]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:76] frec:(nil)
enqueue operation 14:launch queue[1] position[63] data[736]
enqueuing operation 11:downloadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 11:downloadC queue[1] position[64] data[960]
CPU waiting on ACC async queue:1
make: *** [run1] Segmentation fault (core dumped)

Could you help me in reading and interpreting that?
I see that the second kernel is called with “__pgi_cu_alloc”. But I don’t understand “__pgi_cu_downloadc(“b1”,size=8,offset=0,lineno=80)”. Line 80 correspond to the for-loop in my first kernel, but I don’t have a variable called “b1”. Might that be the a temporary variable because of the reduction?

Thanks. Sandra

Thanks Sandra, the NVDEBUG output was very helpful.

We think know what’s wrong. “b1” is the struct to hold the argument list to your kernels. “cuModuleGetGlobal” is called to get the address of “b1” so the kernel can access it. In asynchronous mode there are multiple modules so we think the wrong module handle is being used.

I’ve created a problem report (TPR#18723) and sent on to engineering. If you can send us a reproducing example that would be great since this will allow us to confirm this is indeed the problem and allow us to verify a fix, but we understand if you can’t.

Best Regards,
Mat

FYI, the workaround would be to disable asynchronous via setting the environment variable “ACC_SYNCHRONOUS” to 1.

  • Mat

Dear Mat,
Thanks for your notice. Yes, wit synchronous execution everything works fine. I didn’t know this ACC_SYNCHRONOUS variable so far. What will happen if I set it to 0? Will then all kenrels execute asychronously (within the same stream) or the like?
I will try to send you part of my program next week.
Sandra

What will happen if I set it to 0? Will then all kenrels execute asychronously (within the same stream) or the like?

Yes, provided that you use the async clause. This environment variable is mainly there for debugging purposes so you can disable async without having to change your code.


On a side note, I’m on the SPEC HPG committee and we’re starting to look for OpenACC applications that could be used as benchmarks. Do you or anyone in your group have codes that might be good candidates?

  • Mat