In looking at and trying to answer the question in this post, I decided to try out the ‘tdot’ program on page 16 of the PGI OpenACC Users Guide. I faithfully copy-and-pasted the code into tman.f90, built as instructed (using blas, not ACML) and then:
$ pgfortran -mp -acc tman.f90 -Minfo -lblas
tdot:
35, Parallel region activated
37, Parallel region terminated
49, Parallel region activated
52, Generating copyin(y(offs+1:nsec+offs))
Generating copyin(x(offs+1:nsec+offs))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
53, Loop is parallelizable
Accelerator kernel generated
53, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
54, Sum reduction generated for z
56, Parallel region terminated
59, sum reduction inlined
(821) $ ./a.out
Host Serial 2489.612915315796
upload CUDA data file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=1 variable=y bytes=40000
upload CUDA data file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=1 variable=x bytes=40000
upload CUDA data file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=0 variable=y bytes=40000
upload CUDA data file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=0 variable=x bytes=40000
launch CUDA kernel file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=53 device=1 grid=40 block=128 sharedbytes=2048
launch CUDA kernel file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=53 device=0 grid=40 block=128 sharedbytes=2048
call to cuEventSynchronize returned error 700: Launch failed
Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/tman.f90
tdot thread=0 NVIDIA devicenum=0
time(us): 238
52: compute region reached 1 time
52: data copyin reached 4 times
device time(us): total=238 max=72 min=45 avg=59
53: kernel launched 2 times
grid: [40] block: [128]
device time(us): total=0 max=0 min=0 avg=0
/home/mathomp4/F90Files/OMP-ACC/tman.f90
tdot thread=1 NVIDIA devicenum=1
time(us): 0
52: compute region reached 1 time
call to cuEventSynchronize returned error 700: Launch failed
This was run with PGI 13.5. So, I thought, well let’s try 13.4 (as that seems to be what Mat always does when I pass him errors :) :
pgfortran -V13.4 -mp -acc tman.f90 -Minfo -lblas
tdot:
35, Parallel region activated
37, Parallel region terminated
49, Parallel region activated
52, Generating copyin(y(offs+1:nsec+offs))
Generating copyin(x(offs+1:nsec+offs))
Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
53, Loop is parallelizable
Accelerator kernel generated
53, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
54, Sum reduction generated for z
56, Parallel region terminated
59, sum reduction inlined
(840) $ ./a.out
Host Serial 2489.612915315796
Multi-Device Parallel 2489.612915315794
Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/tman.f90
tdot NVIDIA devicenum=0
time(us): 89
52: data copyin reached 2 times
device time(us): total=57 max=32 min=25 avg=28
53: kernel launched 1 times
grid: [40] block: [128]
device time(us): total=24 max=24 min=24 avg=24
elapsed time(us): total=39 max=39 min=39 avg=39
53: reduction kernel launched 1 times
grid: [1] block: [256]
device time(us): total=8 max=8 min=8 avg=8
elapsed time(us): total=20 max=20 min=20 avg=20
/home/mathomp4/F90Files/OMP-ACC/tman.f90
tdot NVIDIA devicenum=1
time(us): 71
52: data copyin reached 2 times
device time(us): total=49 max=25 min=24 avg=24
53: kernel launched 1 times
grid: [40] block: [128]
device time(us): total=14 max=14 min=14 avg=14
elapsed time(us): total=28 max=28 min=28 avg=28
53: reduction kernel launched 1 times
grid: [1] block: [256]
device time(us): total=8 max=8 min=8 avg=8
elapsed time(us): total=19 max=19 min=19 avg=19
So, any idea what I did wrong? Do I have some odd control character in my code I can’t see from the cut-and-paste throwing this off? Did the ACC standard change between 13.4 and 13.5?
Matt