Fortran OpenMP offloading painfully slow on NVIDIA architectures

giorgio1.daneri · July 28, 2025, 2:20pm

I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I achieve a good speedup on this platform given the simulation parameters I need to use. The exact same version can also be compiled to target NVIDIA platforms, and explicit data transfer directives are activated. I use nvfortran with options -O3 -mp=gpu -Minfo=mp -gpu==cc90 (I am targeting H100 GPUs). The issue is that the kernel is painfully slow, but I cannot pinpoint the exact issue even after profiling with nsys and ncu. I also made a fine-grained check of data transfers with NV_ACC_NOTIFY to make sure that no implicit and unwanted transfer is done.

For this reason, I created a minimal example to compute the addition between two arrays, an embarrassingly parallel operation. Even if the array size is 1e9, the GPU version is slower. Here is the dummy program.

program vector_addition
  use omp_lib
  implicit none

  integer, parameter :: n = 1000000000
  real, allocatable, dimension(:) :: a, b, c, c_cpu
  real :: start_time, end_time
  integer i

  ! Allocate arrays
  allocate(a(n), b(n), c(n), c_cpu(n))

  ! Initialize arrays
  call random_number(a)
  call random_number(b)

  ! ==========================================================
  !        OpenMP CPU execution
  ! ==========================================================
  write(*,*) 'Starting CPU computation...'
  call cpu_time(start_time)

  do i = 1, n
     c_cpu(i) = a(i) + b(i)
  end do

  call cpu_time(end_time)

  print *, 'Time taken for CPU: ', end_time - start_time, ' seconds'

  ! ==========================================================
  !        OpenMP Offload to GPU
  ! ==========================================================
  call cpu_time(start_time)

  !$omp target teams distribute parallel do map(to:a,b) map(from:c)
  do i = 1, n
     c(i) = a(i) + b(i)
  end do
  !$omp end target teams distribute parallel do

  call cpu_time(end_time)
  print *, 'Time taken for GPU offload: ', end_time - start_time, ' seconds'

  ! compare the results
  do i = 1, n
     if (c(i) /= c_cpu(i)) then
        print *, 'Mismatch at index ', i, ': CPU = ', c(i), ', GPU = ', c_cpu(i)
        exit
     end if
  end do

  ! Deallocate arrays
  deallocate(a, b, c, c_cpu)

end program vector_addition

The output is the following:

Time taken for CPU:    0.9353199      seconds
Time taken for GPU offload:     1.690033      seconds

Do you have any idea why even such a simple case is not working? Am I missing any fundamental concept here?

MatColgrove · July 28, 2025, 4:17pm

Looks like it’s all in data movement. The kernel itself looks ok.

From Nsight-systems running on a GH200:

% nvfortran -Ofast -mp=gpu test.F90 ; nsys profile --stats=true a.out
Collecting data...
 Starting CPU computation...
 Time taken for CPU:    0.5192339      seconds
 Time taken for GPU offload:     1.895453      seconds
Generating '/tmp/nsys-report-ec2c.qdstrm'
[1/8] [========================100%] report1.nsys-rep
[2/8] [========================100%] report1.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/mcolgrove/tmp/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
... cut due to length ...

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)           Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -----------------------
    100.0        4,799,680          1  4,799,680.0  4,799,680.0  4,799,680  4,799,680          0.0  nvkernel_MAIN__F1L36_2_

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ------------  ------------  ----------  ----------  -----------  ----------------------------
     60.4       23,482,688      6   3,913,781.3       1,184.0         768  11,757,472  6,061,603.7  [CUDA memcpy Host-to-Device]
     39.6       15,380,928      1  15,380,928.0  15,380,928.0  15,380,928  15,380,928          0.0  [CUDA memcpy Device-to-Host]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)   Med (MB)   Min (MB)   Max (MB)   StdDev (MB)           Operation
 ----------  -----  ---------  ---------  ---------  ---------  -----------  ----------------------------
  8,000.001      6  1,333.333      0.000      0.000  4,000.000    2,065.591  [CUDA memcpy Host-to-Device]
  4,000.000      1  4,000.000  4,000.000  4,000.000  4,000.000        0.000  [CUDA memcpy Device-to-Host]

Enabling Unified Memory helps:

% nvfortran -Ofast -mp=gpu -gpu=mem:unified test.F90 ; nsys profile --stats=true a.out
Collecting data...
 Starting CPU computation...
 Time taken for CPU:    0.5200350      seconds
 Time taken for GPU offload:    0.7387049      seconds
Generating '/tmp/nsys-report-a795.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/mcolgrove/tmp/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
.. cut due to length ..

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  -----------------------
    100.0      505,423,360          1  505,423,360.0  505,423,360.0  505,423,360  505,423,360          0.0  nvkernel_MAIN__F1L36_2_

There’s still some data movement in the kernel itself, but as you reuse data on the device, this will get amortized.

-Mat

Topic		Replies	Views
NVFortran openmp offloading to multiple GPUs nvc, nvc++ and nvfortran	11	257	August 22, 2025
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	291	February 11, 2025
Severe performance regression in OpenMP offloading Fortran codes with driver version 590 nvc, nvc++ and nvfortran	9	143	February 24, 2026
IS Offloading Fortran to GPU with nvfortran on older GPU possible (CC61) nvc, nvc++ and nvfortran	4	891	February 4, 2022
Is unified memory (-gpu=managed) supported for OpenMP offloading (-mp=gpu)? nvc, nvc++ and nvfortran	5	1370	September 16, 2023
OpenMP Offload: One Cuda memcpy with small size executed for each kernel call nvc, nvc++ and nvfortran	3	96	September 26, 2025
GPU offload with OpenMP using NVFORTRAN is not possible in some cases nvc, nvc++ and nvfortran	6	164	April 10, 2025
Does nvc support GPU offloading with OpenMP nvc, nvc++ and nvfortran	2	984	December 14, 2020
Clang Openmp Offloading CUDA Programming and Performance	0	1166	March 30, 2018
OpenMP Offload: additional memory usage on GPU 0 for code running on other GPUs nvc, nvc++ and nvfortran	3	665	May 25, 2023

Fortran OpenMP offloading painfully slow on NVIDIA architectures

Related topics