Slower performance with GPU when using nvfortran, stdpar

PGI_C1 · September 21, 2024, 1:35pm

I installed nvhpc with CUDA 12.5 and tried to check the performance of saxpy example included in example directory.
However, it showed that GPU performance with DO CONCURRENT was much slower than naive DO loop.

How can I improve the parallelization with GPUs?

System Info:
RTX 4090 GPU
CUDA 12.5
nvhpc_2024_247
Ubuntu 22.04 LTS x86_64

Source code: [NVHPC Installed Directory]/Linux_x86_64/24.7/examples/stdpar/saxpy.f90
(I modified parameter n to 10000000)

Results:
9938 microseconds sequential
27308 microseconds parallel with stdpar
Test PASSED

Accelerator Kernel Timing data
/STORAGE_2/ph/PACKAGES/NVIDIA_HPC_SDK/Linux_x86_64/24.7/examples/stdpar/saxpy/saxpy.f90
saxpy_concurrent NVIDIA devicenum=0
time(us): 27,121
27: compute region reached 1 time
27: kernel launched 1 time
grid: [78125] block: [128]
device time(us): total=27,121 max=27,121 min=27,121 avg=27,121
elapsed time(us): total=27,175 max=27,175 min=27,175 avg=27,175
27: data region reached 2 times

( GPU usage was ~16% for a moment)

MatColgrove · September 23, 2024, 4:10pm

This is a toy example showing a very basic usage of DO CONNCURRENT and the slow performance is expected.

With Unified Memory, data gets copied to the GPU in the kernel where it’s first touched. Since there’s only the one GPU kernel, all the data movement time is being added to it’s runtime. In a real program, this data movement cost would get amortized across multiple kernels.

PGI_C1 · September 23, 2024, 4:16pm

Thank you for the kind apply! Also, I found the similar question before.

Topic		Replies	Views
Parallel with do concurrent is slower than sequential version for stdpar examples nvc, nvc++ and nvfortran	3	18	November 20, 2024
DO CONCURRENT matmul slow on Grace Hopper nvc, nvc++ and nvfortran	3	124	July 9, 2024
[Fortran][do concurrent] Runtime malloc error in saxpy.f90 with -stdpar flag nvc, nvc++ and nvfortran	3	592	April 28, 2023
Limited concurrency CUDA Programming and Performance	5	523	October 6, 2020
Performance Issue / End of Program Dump using Stdpar nvc, nvc++ and nvfortran gpu-computing	3	13	October 10, 2024
DO LOOP inside DO CONCURRENT nvc, nvc++ and nvfortran	4	487	December 30, 2020
[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance nvc, nvc++ and nvfortran cuda	3	689	May 2, 2023
OpenACC and CUFFT performance issues HPC CUDA Programming and Performance cuda , performance	1	373	December 1, 2023
Fortran DO CONCURRENT with GPUs are reductions allowed? nvc, nvc++ and nvfortran	4	862	May 28, 2021
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	149	July 7, 2024

Slower performance with GPU when using nvfortran, stdpar

Related topics