DO CONCURRENT matmul slow on Grace Hopper

I’ve been trying to get a benchmark to run that performs a modified matrix multiplication operation using DO CONCURRENT, and I am unsure why it is slower on a Grace-Hopper system than when run on an A100 GPU.

   do k = 1, 3
      call system_clock(c_start, c_rate)
      do concurrent (i=1:N, j=1:N)
         C(i,j) = sum(A(i,:)**2 * B(:,j))
      end do
      call system_clock(c_stop)
      associate ( &
           time => real(c_stop-c_start)/c_rate, &
           flops => 3*real(N)**3 &
           )
         write(*, '(F8.3, 4X, 2(E12.7, 4x), E12.7, 4x, I8)') &
             & time, flops, flops/time, sum(C), N
          ! sum(C) to ensure calculating to C isn't optimized away
      end associate
   end do

The same code is compiled with -stdpar=gpu -Minfo on both systems.
On the node with an A100 (nvfortran-23.9), I get:

1.032 .2061584E+12 .1997901E+12 .1101053E+11 4096
0.577 .2061584E+12 .3570059E+12 .1101053E+11 4096
0.572 .2061584E+12 .3603525E+12 .1101053E+11 4096
On the Grace-Hopper system (nvfortran-24.5), I get:
8.365 .2061584E+12 .2464580E+11 .1148680E+11 4096
5.600 .2061584E+12 .3681183E+11 .1148680E+11 4096
5.596 .2061584E+12 .3683846E+11 .1148680E+11 4096
What could be causing the slowdown?

Additionally, I have created a version that uses 128 blocks and splits the work, which is significantly faster on both but still slow on the Grace-Hopper system.

do k = 1, 3
      call system_clock(c_start, c_rate)
      c = 0
      do concurrent(k0=1:N:BLOCK_SIZE)
         k1 = min(N, k0 + BLOCK_SIZE - 1)
         do concurrent (j0=1:N:BLOCK_SIZE,i0=1:N:BLOCK_SIZE)
            j1 = min(N, j0 + BLOCK_SIZE - 1)
            i1 = min(N, i0 + BLOCK_SIZE - 1)
            do concurrent(j=j0:j1,i=i0:i1)
               c(i,j) = c(i,j) + sum(a(i,k0:k1)*b(k0:k1,j))
            end do
         end do
      end do
      call system_clock(c_stop)
      associate ( &
           time => real(c_stop-c_start)/c_rate, &
           flops => 2*real(N)**3 &
           )
         write(*, '(F8.3, 4X, 2(E12.7, 4x), E12.7, 4x, I8)') &
               & time, flops, flops/time, sum(C), N
      end associate
   end do

On the A100:

4.111 .7036874E+14 .1711514E+14 .1073742E+10 32768
3.364 .7036874E+14 .2092107E+14 .1073742E+10 32768
3.362 .7036874E+14 .2093160E+14 .1073742E+10 32768
On the Grace-Hopper:
14.131 .7036874E+14 .4979612E+13 .1717987E+11 32768
11.247 .7036874E+14 .6256861E+13 .1717987E+11 32768
11.248 .7036874E+14 .6255887E+13 .1717987E+11 32768

Is this an issue with how the system is set up? Is it a difference with the x86 vs. ARM architecture? What could I be missing? Thanks!

I am suspicious of your time measurements, GPU activity is asynchronous with respect to the host. If you could post a full codice, I could take a look.

Here’s the full version:

program main
   use, intrinsic :: iso_fortran_env, only: INT64
   implicit none

#ifndef N_VALUES
   integer, parameter :: N = 4096
#else // ifdef N_VALUES
   integer, parameter :: N = N_VALUES
#endif // N_VALUES

   real, dimension(N,N) :: A, B, C
   integer(kind=INT64) :: c_start, c_stop, c_rate
   integer :: i, j, k

   call random_number(A)
   call random_number(B)

   do k = 1, 3
      call system_clock(c_start, c_rate)
      do concurrent (i=1:N, j=1:N)
         C(i,j) = sum(A(i,:)**2 * B(:,j))
      end do
      call system_clock(c_stop)

      associate ( &
           time => real(c_stop-c_start)/c_rate, &
           flops => 3*real(N)**3 &
           )
         write(*, '(F8.3, 4X, 2(E12.7, 4x), E12.7, 4x, I8)') &
             & time, flops, flops/time, sum(C), N
      end associate
   end do
end program main

The other follows the same structure.

I am aware that the time measurement will include the overhead of moving data to/from the accelerator, and this is fine because it is a real cost I have to measure when judging when to use an accelerator. If there is a way to time the data transfer/kernel setup, execution on the kernels, and then transfer back/cleanup, this could be helpful as well. My goal is to learn the limits of DO CONCURRENT without CUDA calls, OpenMP, OpenACC, or MPI, with a future goal of finding which of these have the most impact in the least effort.

I am getting very different results from you, it may be how the system is set up.
Try to add -gpu=mem:managed to your compile line.

nvfortran -V

nvfortran 24.5-0 linuxarm64 target on aarch64 Linux -tp neoverse-v2 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

nvfortran -stdpar=gpu  -Minfo forum.F90 
main:
     20, Generating NVIDIA GPU code
         20, Loop parallelized across CUDA thread blocks collapse(2) ! blockidx%x collapsed-innermost
               ! blockidx%x auto-collapsed
         21, Loop parallelized across CUDA threads(128) ! threadidx%x
             Generating implicit reduction(+:a$r)
     20, Generating implicit copyin(b(:,:)) [if not already present]
         Generating implicit copyout(c(:,:)) [if not already present]
         Generating implicit copyin(a(:,:)) [if not already present]
     21, sum reduction inlined
     29, sum reduction inlined
 ~]$ ./a.out 
   0.798    .2061584E+12    .2583257E+12    .1101053E+11        4096
   0.410    .2061584E+12    .5033302E+12    .1101053E+11        4096
   0.410    .2061584E+12    .5032356E+12    .1101053E+11        4096

You can look at what the code is doing with nsys ( you can either generate a profile that can be visualized later with Nsight System with “nsys profile” or get also a summary with “nsys nvprof”)

$ nsys nvprof ./a.out
WARNING: a.out and any of its children processes will be profiled.

   0.733    .2061584E+12    .2811564E+12    .1101053E+11        4096
   0.412    .2061584E+12    .5008414E+12    .1101053E+11        4096
   0.411    .2061584E+12    .5010753E+12    .1101053E+11        4096
Generating '/tmp/nsys-report-32b3.qdstrm'
[1/7] [========================100%] report3.nsys-rep
[2/7] [========================100%] report3.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /global/home/users/mfatica/report3.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)          Name        
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  -------------------
     98.0    1,267,372,256          3  422,457,418.7  411,574,272.0  411,391,456  444,406,528  19,008,706.1  cuStreamSynchronize
      1.6       20,877,152          1   20,877,152.0   20,877,152.0   20,877,152   20,877,152           0.0  cuMemAllocManaged  
      0.1        1,331,488          1    1,331,488.0    1,331,488.0    1,331,488    1,331,488           0.0  cuModuleLoadDataEx 
      0.1        1,324,672          3      441,557.3       13,536.0       12,992    1,298,144     741,825.9  cuLaunchKernel     
      0.1        1,300,672          1    1,300,672.0    1,300,672.0    1,300,672    1,300,672           0.0  cuMemAlloc_v2      
      0.0          487,648          1      487,648.0      487,648.0      487,648      487,648           0.0  cuMemAllocHost_v2  
      0.0            1,600          3          533.3          320.0          192        1,088         484.6  cuCtxSetCurrent    

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)      Name    
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  -----------
    100.0    1,267,355,660          3  422,451,886.7  411,563,524.0  411,386,340  444,405,796  19,012,849.6  main_20_gpu

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: /global/home/users/mfatica/report3.sqlite does not contain GPU memory data.
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: /global/home/users/mfatica/report3.sqlite does not contain GPU memory data.
Generated:
    /global/home/users/mfatica/report3.nsys-rep
    /global/home/users/mfatica/report3.sqlite

As you can see, the do concurrent loops are all taking around .42s.
There is no data movement because the compiler is smart enough to allocate on GPU.

If you use the GUI, you will see a picture like this one:

You may also want to change to dynamic allocation, it is generally a bad idea to do a static allocation ( in particular with GPUs and if you are going to use multiple devices). In this case, it should make go away the slowdown you observed without passing additional compiler flags