Error with Sample tdot Program with PGI 13.5

In looking at and trying to answer the question in this post, I decided to try out the ‘tdot’ program on page 16 of the PGI OpenACC Users Guide. I faithfully copy-and-pasted the code into tman.f90, built as instructed (using blas, not ACML) and then:

$ pgfortran -mp -acc tman.f90 -Minfo -lblas
tdot:
     35, Parallel region activated
     37, Parallel region terminated
     49, Parallel region activated
     52, Generating copyin(y(offs+1:nsec+offs))
         Generating copyin(x(offs+1:nsec+offs))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     53, Loop is parallelizable
         Accelerator kernel generated
         53, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     54, Sum reduction generated for z
     56, Parallel region terminated
     59, sum reduction inlined
(821) $ ./a.out
 Host Serial    2489.612915315796     
upload CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=1 variable=y bytes=40000
upload CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=1 variable=x bytes=40000
upload CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=0 variable=y bytes=40000
upload CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=52 device=0 variable=x bytes=40000
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=53 device=1 grid=40 block=128 sharedbytes=2048
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/tman.f90 function=tdot line=53 device=0 grid=40 block=128 sharedbytes=2048
call to cuEventSynchronize returned error 700: Launch failed

Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/tman.f90
  tdot  thread=0  NVIDIA  devicenum=0
    time(us): 238
    52: compute region reached 1 time
        52: data copyin reached 4 times
             device time(us): total=238 max=72 min=45 avg=59
        53: kernel launched 2 times
            grid: [40]  block: [128]
             device time(us): total=0 max=0 min=0 avg=0
/home/mathomp4/F90Files/OMP-ACC/tman.f90
  tdot  thread=1  NVIDIA  devicenum=1
    time(us): 0
    52: compute region reached 1 time
call to cuEventSynchronize returned error 700: Launch failed

This was run with PGI 13.5. So, I thought, well let’s try 13.4 (as that seems to be what Mat always does when I pass him errors :) :

pgfortran -V13.4 -mp -acc tman.f90 -Minfo -lblas
tdot:
     35, Parallel region activated
     37, Parallel region terminated
     49, Parallel region activated
     52, Generating copyin(y(offs+1:nsec+offs))
         Generating copyin(x(offs+1:nsec+offs))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     53, Loop is parallelizable
         Accelerator kernel generated
         53, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     54, Sum reduction generated for z
     56, Parallel region terminated
     59, sum reduction inlined
(840) $ ./a.out
 Host Serial    2489.612915315796     
 Multi-Device Parallel    2489.612915315794     

Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/tman.f90
  tdot  NVIDIA  devicenum=0
        time(us): 89
        52: data copyin reached 2 times
             device time(us): total=57 max=32 min=25 avg=28
        53: kernel launched 1 times
            grid: [40]  block: [128]
             device time(us): total=24 max=24 min=24 avg=24
            elapsed time(us): total=39 max=39 min=39 avg=39
        53: reduction kernel launched 1 times
            grid: [1]  block: [256]
             device time(us): total=8 max=8 min=8 avg=8
            elapsed time(us): total=20 max=20 min=20 avg=20
/home/mathomp4/F90Files/OMP-ACC/tman.f90
  tdot  NVIDIA  devicenum=1
        time(us): 71
        52: data copyin reached 2 times
             device time(us): total=49 max=25 min=24 avg=24
        53: kernel launched 1 times
            grid: [40]  block: [128]
             device time(us): total=14 max=14 min=14 avg=14
            elapsed time(us): total=28 max=28 min=28 avg=28
        53: reduction kernel launched 1 times
            grid: [1]  block: [256]
             device time(us): total=8 max=8 min=8 avg=8
            elapsed time(us): total=19 max=19 min=19 avg=19

So, any idea what I did wrong? Do I have some odd control character in my code I can’t see from the cut-and-paste throwing this off? Did the ACC standard change between 13.4 and 13.5?

Matt

Hi Matt,

So, any idea what I did wrong? Do I have some odd control character in my code I can’t see from the cut-and-paste throwing this off? Did the ACC standard change between 13.4 and 13.5?

No, this looks like a compiler error having to do multi-device support that starting to be added. If I compile with “-ta=nvidia” the test seems to work.

I’ve added TPR#19419 to address the problem.

Thanks,
Mat

Matt,

TPR 19419 has been fixed in the 13.10 release.

thanks,
dave