Hi Mat,
Thanks very much for your prompt reply. I got a lot of output after setting the debugging flag you suggested, which is progress :-). I have pasted below a small excerpt just before and including the crash point. Perhaps you can spot something revealing. My eye is untrained in these things as of yet, but it seems to me that the initial few lines of the excerpt show that some variables (pft_tmp, neighbours) were successfully copied to device from host. Then the arguments list for the offending kernel calc_force_des_404_gpu is set up. Then a launch is attempted which fails. If you need to look at more of the debugging output, let me know.
I know exactly which acc loop directive is causing the failure. In fact, I have reduced the number of loop directives now to just two. With just the first one, the code runs fine. But when I include the second loop directive it fails. This second loop indeed starts at line 404 as the debugging output indicates. It is a somewhat long and complicated for loop.
Next I can try your other suggestion of commenting out lines inside the for loop and narrow down which particular line, if any, of the for loop is causing trouble. It will of course be great if you can look at the code too. I think there aren’t any weird library dependencies in the code. There is a script that comes with the code that compiles it (not the standard GNU make or cmake). But still compiling should be painless, by running just that one script.
Let me know what is the best way to share the code with you. For example, I can look at whether you can be given access to the Pittsburgh Supercomputing Center machine I am using. That way, nothing needs to be transferred, and any needed dependencies are already in place. Once you have access to the code, I can tell you how to compile it.
pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x533e7f00
pgi_uacc_alloc(size=24,devid=1,threadid=1) returns 0xb01f03c00
map dev:0xb01f03c00 host:0x533e7f00 size:24 offset:0 data[dev:0xb01f03c00 host:0x533e7f00 size:24] (line:404 name:pft_tmp)
alloc done with devptr at 0xb01f03c00
MemHostRegister( 0x533e7f00, 24, 0 )
upload 0x533e7f00->0xb01f03c00 for 24 bytes stream (nil)
pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7f9ca8ee3860
pgi_uacc_alloc(size=1996488,devid=1,threadid=1) returns 0xb02500000
map dev:0xb02500000 host:0x7f9ca8ee3860 size:1996488 offset:0 data[dev:0xb02500000 host:0x7f9ca8ee3860 size:1996488] (line:404 name:neighbours) dims=18486x27
alloc done with devptr at 0xb02500000
MemHostRegister( 0x7f9ca8ee3860, 1996488, 0 )
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(sync on stream=(nil))
pgi_uacc_cuda_wait done
pgi_uacc_cuda_uploads(hostsrc=0x7fff50fdac60,size=24,offset=0,lineno=404) returns 0xb01f00000
pgi_uacc_launch funcnum=0 argptr=0x7fff50fdacf0 sizeargs=0x7fff50fdace0 async=-1 devid=1
reduction array of 370444 bytes at 0xb02700000
Arguments to function 0 calc_force_des_404_gpu:
18471 0 32505856 11 40894464 11 3 3
3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3
3 3 18471 1 1 1 1 1
1 499122 3589632 11 38797312 11 32521216 11
4194304 11 33629696 11 32506880 11 32513536 11
32514048 11 32514560 11 32513024 11 32515072 11
32515584 11 35651584 11 32511488 11 33481888 11
32510464 11 32509952 11 32509440 11 32508928 11
32508416 11 32506368 11 32507904 11 32518144 11
32519680 11 32507392 11 32520704 11 3145728 11
34603008 11 32520192 11
0x00004827 0x00000000 0x01f00000 0x0000000b 0x02700000 0x0000000b 0x00000003 0x00000003
0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
0x00000003 0x00000003 0x00004827 0x00000001 0x00000001 0x00000001 0x00000001 0x00000001
0x00000001 0x00079db2 0x0036c600 0x0000000b 0x02500000 0x0000000b 0x01f03c00 0x0000000b
0x00400000 0x0000000b 0x02012600 0x0000000b 0x01f00400 0x0000000b 0x01f01e00 0x0000000b
0x01f02000 0x0000000b 0x01f02200 0x0000000b 0x01f01c00 0x0000000b 0x01f02400 0x0000000b
0x01f02600 0x0000000b 0x02200000 0x0000000b 0x01f01600 0x0000000b 0x01fee4a0 0x0000000b
0x01f01200 0x0000000b 0x01f01000 0x0000000b 0x01f00e00 0x0000000b 0x01f00c00 0x0000000b
0x01f00a00 0x0000000b 0x01f00200 0x0000000b 0x01f00800 0x0000000b 0x01f03000 0x0000000b
0x01f03600 0x0000000b 0x01f00600 0x0000000b 0x01f03a00 0x0000000b 0x00300000 0x0000000b
0x02100000 0x0000000b 0x01f03800 0x0000000b
cuda_launch argument bytes=812, max=240 move 572 bytes at offset 240 to devaddr 0xb00200000
call to cuMemcpyDtoHAsync returned error 700: Launch failed