Still no device side profile, but at least this gives us the host side timings.
Data transfer times look ok.
While I don’t have the times from the 1080, the times in Multigrid.for don’t look too bad. The biggest times are coming from Ult_inf_channel_b.for. The kernel at line 133 is taking ~37 seconds and line 206 is taking ~80 seconds.
Approximately how long did these run for on the 1080?
/home/johannesp/Test_Dirk_2/obj/../src/Ult_inf_channel_b.for
f_dep_gf NVIDIA devicenum=0
time(us): 3,287
132: compute region reached 10 times
133: kernel launched 10 times
grid: [65535] block: [128]
elapsed time(us): total=37,301,376 max=3,750,035 min=3,709,407 avg=3,730,137
152: data region reached 30 times
152: data copyin transfers: 120
device time(us): total=796 max=70 min=2 avg=6
821: data copyout transfers: 90
device time(us): total=1,449 max=273 min=4 avg=16
189: update directive reached 15 times
189: data copyout transfers: 75
device time(us): total=608 max=71 min=4 avg=8
192: update directive reached 15 times
192: data copyin transfers: 75
device time(us): total=434 max=82 min=2 avg=5
206: compute region reached 15 times
206: kernel launched 15 times
grid: [3-68] block: [128]
elapsed time(us): total=80,640,705 max=13,058,157 min=66,917 avg=5,376,047