Hi Paul,
if you’ve tested it already
I hadn’t already since I was down in Austin attending a conference (I sat next to the new head of your computing center) to work on standardizing OpenACC benchmarks.
I just reran your code with 13.1 and indeed it appears that we solved what ever the problem was. I show the times approximately equal to the CUDA version. Overall it’s down from 14.6 seconds to around 9.
Here’s my PGI_ACC_TIME outputs from 12.10:
ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 14.624010
Accelerator Kernel Timing data
./solver.c
axpy
45: region entered 20000 times
time(us): total=445,465 init=822 region=444,643
kernels=186,463
w/o init: total=444,643 max=51,596 min=17 avg=22
48: kernel launched 20000 times
grid: [130] block: [128]
time(us): total=186,463 max=228 min=7 avg=9
./solver.c
vectorDot
29: region entered 20001 times
time(us): total=1,691,386 init=1,054 region=1,690,332
kernels=317,837
w/o init: total=1,690,332 max=14,750 min=2 avg=84
30: kernel launched 19965 times
grid: [130] block: [128]
time(us): total=317,837 max=186 min=9 avg=15
./solver.c
vectorDot
21: region entered 20001 times
time(us): total=3,305,567 init=794 region=3,304,773
kernels=469,003
w/o init: total=3,304,773 max=91,778 min=149 avg=165
23: kernel launched 20001 times
grid: [130] block: [128]
time(us): total=285,686 max=190 min=12 avg=14
24: kernel launched 20001 times
grid: [1] block: [256]
time(us): total=183,317 max=251 min=7 avg=9
./solver.c
vectorDot
19: region entered 20001 times
time(us): total=5,888,556 init=811 region=5,887,745
kernels=725
w/o init: total=5,887,745 max=106,772 min=271 avg=294
30: kernel launched 36 times
grid: [130] block: [128]
time(us): total=725 max=180 min=13 avg=20
./solver.c
cg
164: region entered 1 time
time(us): total=81,531 init= region=81,530
kernels=19
w/o init: total=81,530 max=81,530 min=81,530 avg=81,530
166: kernel launched 1 times
grid: [130] block: [128]
time(us): total=19 max=19 min=19 avg=19
./solver.c
nrm2
101: region entered 1 time
time(us): total=67,259
kernels=45
104: kernel launched 1 times
grid: [130] block: [128]
time(us): total=21 max=21 min=21 avg=21
105: kernel launched 1 times
grid: [1] block: [256]
time(us): total=24 max=24 min=24 avg=24
./solver.c
xpay
57: region entered 10001 times
time(us): total=270,678 init=359 region=270,319
kernels=91,370
w/o init: total=270,319 max=71,409 min=17 avg=27
60: kernel launched 10001 times
grid: [130] block: [128]
time(us): total=91,370 max=52 min=7 avg=9
./solver.c
matvec
76: region entered 10001 times
time(us): total=7,829,004 init=436 region=7,828,568
kernels=7,431,620
w/o init: total=7,828,568 max=281,892 min=740 avg=782
80: kernel launched 10001 times
grid: [16614] block: [128]
time(us): total=7,431,620 max=911 min=730 avg=743
./solver.c
cg
154: region entered 1 time
time(us): total=14,623,965
data=3,992
acc_init.c
acc_init
38: region entered 1 time
time(us): init=523,977
Again with 13.1:
ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 9.302466
Accelerator Kernel Timing data
./solver.c
vectorDot NVIDIA devicenum=0
time(us): 645,419
23: kernel launched 20001 times
grid: [130] block: [128]
device time(us): total=312,471 max=278 min=10 avg=15
elapsed time(us): total=458,380 max=285 min=19 avg=22
23: reduction kernel launched 20001 times
grid: [1] block: [256]
device time(us): total=166,978 max=174 min=6 avg=8
elapsed time(us): total=315,601 max=182 min=14 avg=15
30: kernel launched 20001 times
grid: [130] block: [128]
device time(us): total=165,970 max=172 min=6 avg=8
elapsed time(us): total=316,702 max=509 min=13 avg=15
./solver.c
axpy NVIDIA devicenum=0
time(us): 173,763
48: kernel launched 20000 times
grid: [130] block: [128]
device time(us): total=173,763 max=180 min=6 avg=8
elapsed time(us): total=327,764 max=2,092 min=13 avg=16
./solver.c
xpay NVIDIA devicenum=0
time(us): 81,667
60: kernel launched 10001 times
grid: [130] block: [128]
device time(us): total=81,667 max=33 min=7 avg=8
elapsed time(us): total=159,996 max=634 min=14 avg=15
./solver.c
matvec NVIDIA devicenum=0
time(us): 6,971,615
80: kernel launched 10001 times
grid: [16614] block: [128]
device time(us): total=6,971,615 max=1,007 min=692 avg=697
elapsed time(us): total=7,049,936 max=1,753 min=699 avg=704
./solver.c
nrm2 NVIDIA devicenum=0
time(us): 26
104: kernel launched 1 times
grid: [130] block: [128]
device time(us): total=17 max=17 min=17 avg=17
elapsed time(us): total=25 max=25 min=25 avg=25
104: reduction kernel launched 1 times
grid: [1] block: [256]
device time(us): total=9 max=9 min=9 avg=9
elapsed time(us): total=16 max=16 min=16 avg=16
./solver.c
cg NVIDIA devicenum=0
time(us): 2,935
154: data copyin reached 5 times
device time(us): total=2,865 max=1,870 min=17 avg=573
166: kernel launched 1 times
grid: [130] block: [128]
device time(us): total=16 max=16 min=16 avg=16
elapsed time(us): total=29 max=29 min=29 avg=29
223: data copyout reached 1 times
device time(us): total=54 max=54 min=54 avg=54