Mat,
I tried with both 12.8 and 13.2 compilers. The code generated by the new compiler is much faster, but is till three times slower than the CPU code. Execution time is listed below. All measured in seconds. For the accelerator code, no time spends in data move.
CPU 12.8 13.2
0.62484789 4.13616442 1.87525368
Compiler output from v12.8
convergencet.f90:
convergencet:
12, Accelerator kernel generated
12, CC 1.3 : 28 registers; 32 shared, 284 constant, 0 local memory bytes
CC 2.0 : 27 registers; 0 shared, 308 constant, 0 local memory bytes
13, !$acc loop gang ! blockidx%x
15, !$acc loop vector(256) ! threadidx%x
12, Generating present(t(:,:))
Generating present(bt(:,:))
Generating present(atn(:,:))
Generating present(ats(:,:))
Generating present(ate(:,:))
Generating present(atw(:,:))
Generating present(atp(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
15, Loop is parallelizable
Compiler output from v13.2
convergencet.f90:
convergencet:
12, Generating present(t(:,:))
Generating present(bt(:,:))
Generating present(atn(:,:))
Generating present(ats(:,:))
Generating present(ate(:,:))
Generating present(atw(:,:))
Generating present(atp(:,:))
Accelerator kernel generated
13, !$acc loop gang ! blockidx%x
15, !$acc loop vector(256) ! threadidx%x
12, Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
15, Loop is parallelizable
Profiler information:
=========Code 12.8=========
convergencet convergencet
12: region entered 10000 times
time(us): total=4,110,304 init=904 region=4,109,400
kernels=399,804
w/o init: total=4,109,400 max=865 min=404 avg=410
12: kernel launched 10000 times
grid: [128] block: [256]
time(us): total=309,510 max=37 min=28 avg=30
13: kernel launched 10000 times
grid: [2] block: [256]
time(us): total=90,294 max=15 min=8 avg=9
==========Code 13.2==========
convergencet NVIDIA devicenum=0
time(us): 393,772
12: kernel launched 10000 times
grid: [128] block: [256]
device time(us): total=315,369 max=59 min=29 avg=31
elapsed time(us): total=712,665 max=207 min=68 avg=71
12: reduction kernel launched 10000 times
grid: [2] block: [256]
device time(us): total=78,403 max=22 min=7 avg=7
elapsed time(us): total=464,016 max=129 min=45 avg=46
I have a couple of questions.
-
Compiler 12.8 generates information about registers, constants, shared memory, and local memory usage. Why this information disappears in v13.2? Can this information help optimize the code?
-
The actual kernel time for the code compiled by 12.8 is 0.3998 seconds, however the region time is 4.1 seconds, which is equal to my measured time. Where the majority of the time has been spend besides the kernels?
-
The actual kernel time for the code compiled by 13.2 is 0.3938 seconds, and total elapsed time is 1.1767 seconds. This is different from my measure 1.8752 seconds. Why they differ so much? Again, where has the time (1.1767-0.3938) been spend on?
Thanks,
Ping