Fortran OpenMP offloading painfully slow on NVIDIA architectures

I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I achieve a good speedup on this platform given the simulation parameters I need to use. The exact same version can also be compiled to target NVIDIA platforms, and explicit data transfer directives are activated. I use nvfortran with options -O3 -mp=gpu -Minfo=mp -gpu==cc90 (I am targeting H100 GPUs). The issue is that the kernel is painfully slow, but I cannot pinpoint the exact issue even after profiling with nsys and ncu. I also made a fine-grained check of data transfers with NV_ACC_NOTIFY to make sure that no implicit and unwanted transfer is done.

For this reason, I created a minimal example to compute the addition between two arrays, an embarrassingly parallel operation. Even if the array size is 1e9, the GPU version is slower. Here is the dummy program.

program vector_addition
  use omp_lib
  implicit none

  integer, parameter :: n = 1000000000
  real, allocatable, dimension(:) :: a, b, c, c_cpu
  real :: start_time, end_time
  integer i

  ! Allocate arrays
  allocate(a(n), b(n), c(n), c_cpu(n))

  ! Initialize arrays
  call random_number(a)
  call random_number(b)

  ! ==========================================================
  !        OpenMP CPU execution
  ! ==========================================================
  write(*,*) 'Starting CPU computation...'
  call cpu_time(start_time)

  do i = 1, n
     c_cpu(i) = a(i) + b(i)
  end do

  call cpu_time(end_time)

  print *, 'Time taken for CPU: ', end_time - start_time, ' seconds'

  ! ==========================================================
  !        OpenMP Offload to GPU
  ! ==========================================================
  call cpu_time(start_time)

  !$omp target teams distribute parallel do map(to:a,b) map(from:c)
  do i = 1, n
     c(i) = a(i) + b(i)
  end do
  !$omp end target teams distribute parallel do

  call cpu_time(end_time)
  print *, 'Time taken for GPU offload: ', end_time - start_time, ' seconds'

  ! compare the results
  do i = 1, n
     if (c(i) /= c_cpu(i)) then
        print *, 'Mismatch at index ', i, ': CPU = ', c(i), ', GPU = ', c_cpu(i)
        exit
     end if
  end do

  ! Deallocate arrays
  deallocate(a, b, c, c_cpu)

end program vector_addition  

The output is the following:

Time taken for CPU:    0.9353199      seconds
Time taken for GPU offload:     1.690033      seconds

Do you have any idea why even such a simple case is not working? Am I missing any fundamental concept here?

Looks like it’s all in data movement. The kernel itself looks ok.

From Nsight-systems running on a GH200:

% nvfortran -Ofast -mp=gpu test.F90 ; nsys profile --stats=true a.out
Collecting data...
 Starting CPU computation...
 Time taken for CPU:    0.5192339      seconds
 Time taken for GPU offload:     1.895453      seconds
Generating '/tmp/nsys-report-ec2c.qdstrm'
[1/8] [========================100%] report1.nsys-rep
[2/8] [========================100%] report1.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/mcolgrove/tmp/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
... cut due to length ...

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)           Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -----------------------
    100.0        4,799,680          1  4,799,680.0  4,799,680.0  4,799,680  4,799,680          0.0  nvkernel_MAIN__F1L36_2_

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ------------  ------------  ----------  ----------  -----------  ----------------------------
     60.4       23,482,688      6   3,913,781.3       1,184.0         768  11,757,472  6,061,603.7  [CUDA memcpy Host-to-Device]
     39.6       15,380,928      1  15,380,928.0  15,380,928.0  15,380,928  15,380,928          0.0  [CUDA memcpy Device-to-Host]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)   Med (MB)   Min (MB)   Max (MB)   StdDev (MB)           Operation
 ----------  -----  ---------  ---------  ---------  ---------  -----------  ----------------------------
  8,000.001      6  1,333.333      0.000      0.000  4,000.000    2,065.591  [CUDA memcpy Host-to-Device]
  4,000.000      1  4,000.000  4,000.000  4,000.000  4,000.000        0.000  [CUDA memcpy Device-to-Host]

Enabling Unified Memory helps:

% nvfortran -Ofast -mp=gpu -gpu=mem:unified test.F90 ; nsys profile --stats=true a.out
Collecting data...
 Starting CPU computation...
 Time taken for CPU:    0.5200350      seconds
 Time taken for GPU offload:    0.7387049      seconds
Generating '/tmp/nsys-report-a795.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/mcolgrove/tmp/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
.. cut due to length ..

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  -----------------------
    100.0      505,423,360          1  505,423,360.0  505,423,360.0  505,423,360  505,423,360          0.0  nvkernel_MAIN__F1L36_2_

There’s still some data movement in the kernel itself, but as you reuse data on the device, this will get amortized.

-Mat

1 Like