Hi Mat,
no, I don’t have a reflected clause in my code, but all my arrays are “restrict”. The structure of my code is:
int queue1 = 1
int queue2 = 2
#pragma acc data region vars
{
// more stuff
#pragma acc update device x
#pragma acc for region async(queue2)
#pragma acc for parallel
// loop with REDUCTION to "funval2"
#pragma acc for region copyin(bhat) async(queue1)
#pragma acc for parallel
// loop with REDUCTION to "funval1"
#pragma acc wait(queue1)
#pragma acc wait(queue2)
funval = funval1 + funval2
//more stuff
}
I am not sure whether I can really reduce it to send it to you. By the way, I call the code above several times in a row and I also have a loop within the data region that surrounds both kernel calls and the wait directive.
I tried to set NVDEBUG=1. But the output is overwhelming. If I get the seg fault, I will get output like the following:
[..]
__pgi_cu_init( file=<path>/pgiacc_eval_f.c, function=funval, line=93, startline=37, endline=177 )
__pgi_cu_module3( lineno=93 )
__pgi_cu_module3 module already loaded at 0x800b80
0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000__pgi_cu_module_function( name=0x440d41=funval_96_gpu, lineno=96, argname=0x0=, argsize=72, SWcachesize=4459876 )
0xcccccccd 0x3ff4ccccFunction handle is 0x7f5f80
0x1fa00000 0x00000002
__pgi_cu_module_function( name=0x440d50=funval_155_gpu_red, lineno=96, argname=0x0=, argsize=0, SWcachesize=0 )
dequeue operation 14:launch queue:2 position:109 dataoffset:192 datasize:208
Function handle is 0x817a60
__pgi_cu_launch_a(func=0x7fcd40, grid=1x1x1, block=256x1x1, lineno=84)
__pgi_cu_launch_a(func=0x7fcd40, params=0x8091f0, bytes=52, sharedbytes=0)
enqueuing operation 12:alloc flags:0x2 async:1 finfo[0x7fffa864e510:32] vinfo[(nil):0] frec:(nil)
enqueue operation 12:alloc queue[1] position[58] data[336]
First arguments are:
CPU waiting on ACC async queue:1
512000 0 2097152 2 512000 0 0 1079574528
-858993459 1073007820 530579456 2 2000
0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000
0xcccccccd 0x3ff4cccc 0x1fa00000 0x00000002 0x000007d0
dequeue operation 12:alloc queue:1 position:58 dataoffset:336 datasize:32
__pgi_cu_alloc(size=1024,lineno=93,name=bhat)
__pgi_cu_alloc(1024) returns 0x21fb00000
dequeue operation 11:downloadC queue:2 position:110 dataoffset:400 datasize:48
__pgi_cu_downloadc( "b1", size=8, offset=0, lineno=80 )
call to cuModuleGetGlobal returned error 500: Not found
CUDA driver version: 4020
CPU done waiting on ACC async queue:1
enqueuing operation 6:upload flags:0x2 async:1 finfo[0x7fffa864e4e0:48] vinfo[0x7fffa864e600:48] frec:(nil)
enqueue operation 6:upload queue[1] position[59] data[368]
enqueuing operation 3:datadone flags:0x2 async:1 finfo[(nil):0] vinfo[(nil):0] frec:(nil)
enqueue operation 3:datadone queue[1] position[60] data[464]
__pgi_cu_alloc(size=8,lineno=96,name=)
__pgi_cu_alloc(8) returns 0x21fb00400
enqueuing operation 8:uploadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 8:uploadC queue[1] position[61] data[464]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:72] frec:(nil)
enqueue operation 14:launch queue[1] position[62] data[512]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:76] frec:(nil)
enqueue operation 14:launch queue[1] position[63] data[736]
enqueuing operation 11:downloadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 11:downloadC queue[1] position[64] data[960]
CPU waiting on ACC async queue:1
make: *** [run1] Segmentation fault (core dumped)
Could you help me in reading and interpreting that?
I see that the second kernel is called with “__pgi_cu_alloc”. But I don’t understand “__pgi_cu_downloadc(“b1”,size=8,offset=0,lineno=80)”. Line 80 correspond to the for-loop in my first kernel, but I don’t have a variable called “b1”. Might that be the a temporary variable because of the reduction?
Thanks. Sandra