I’ve been trying to get a benchmark to run that performs a modified matrix multiplication operation using DO CONCURRENT, and I am unsure why it is slower on a Grace-Hopper system than when run on an A100 GPU.

```
do k = 1, 3
call system_clock(c_start, c_rate)
do concurrent (i=1:N, j=1:N)
C(i,j) = sum(A(i,:)**2 * B(:,j))
end do
call system_clock(c_stop)
associate ( &
time => real(c_stop-c_start)/c_rate, &
flops => 3*real(N)**3 &
)
write(*, '(F8.3, 4X, 2(E12.7, 4x), E12.7, 4x, I8)') &
& time, flops, flops/time, sum(C), N
! sum(C) to ensure calculating to C isn't optimized away
end associate
end do
```

The same code is compiled with `-stdpar=gpu -Minfo`

on both systems.

On the node with an A100 (nvfortran-23.9), I get:

1.032 .2061584E+12 .1997901E+12 .1101053E+11 4096

0.577 .2061584E+12 .3570059E+12 .1101053E+11 4096

0.572 .2061584E+12 .3603525E+12 .1101053E+11 4096

On the Grace-Hopper system (nvfortran-24.5), I get:

8.365 .2061584E+12 .2464580E+11 .1148680E+11 4096

5.600 .2061584E+12 .3681183E+11 .1148680E+11 4096

5.596 .2061584E+12 .3683846E+11 .1148680E+11 4096

What could be causing the slowdown?

Additionally, I have created a version that uses 128 blocks and splits the work, which is significantly faster on both but still slow on the Grace-Hopper system.

```
do k = 1, 3
call system_clock(c_start, c_rate)
c = 0
do concurrent(k0=1:N:BLOCK_SIZE)
k1 = min(N, k0 + BLOCK_SIZE - 1)
do concurrent (j0=1:N:BLOCK_SIZE,i0=1:N:BLOCK_SIZE)
j1 = min(N, j0 + BLOCK_SIZE - 1)
i1 = min(N, i0 + BLOCK_SIZE - 1)
do concurrent(j=j0:j1,i=i0:i1)
c(i,j) = c(i,j) + sum(a(i,k0:k1)*b(k0:k1,j))
end do
end do
end do
call system_clock(c_stop)
associate ( &
time => real(c_stop-c_start)/c_rate, &
flops => 2*real(N)**3 &
)
write(*, '(F8.3, 4X, 2(E12.7, 4x), E12.7, 4x, I8)') &
& time, flops, flops/time, sum(C), N
end associate
end do
```

On the A100:

4.111 .7036874E+14 .1711514E+14 .1073742E+10 32768

3.364 .7036874E+14 .2092107E+14 .1073742E+10 32768

3.362 .7036874E+14 .2093160E+14 .1073742E+10 32768

On the Grace-Hopper:

14.131 .7036874E+14 .4979612E+13 .1717987E+11 32768

11.247 .7036874E+14 .6256861E+13 .1717987E+11 32768

11.248 .7036874E+14 .6255887E+13 .1717987E+11 32768

Is this an issue with how the system is set up? Is it a difference with the x86 vs. ARM architecture? What could I be missing? Thanks!