Slower performance with GPU when using nvfortran, stdpar

I installed nvhpc with CUDA 12.5 and tried to check the performance of saxpy example included in example directory.
However, it showed that GPU performance with DO CONCURRENT was much slower than naive DO loop.

How can I improve the parallelization with GPUs?

System Info:
RTX 4090 GPU
CUDA 12.5
nvhpc_2024_247
Ubuntu 22.04 LTS x86_64

Source code: [NVHPC Installed Directory]/Linux_x86_64/24.7/examples/stdpar/saxpy.f90
(I modified parameter n to 10000000)

Results:
9938 microseconds sequential
27308 microseconds parallel with stdpar
Test PASSED

Accelerator Kernel Timing data
/STORAGE_2/ph/PACKAGES/NVIDIA_HPC_SDK/Linux_x86_64/24.7/examples/stdpar/saxpy/saxpy.f90
saxpy_concurrent NVIDIA devicenum=0
time(us): 27,121
27: compute region reached 1 time
27: kernel launched 1 time
grid: [78125] block: [128]
device time(us): total=27,121 max=27,121 min=27,121 avg=27,121
elapsed time(us): total=27,175 max=27,175 min=27,175 avg=27,175
27: data region reached 2 times

( GPU usage was ~16% for a moment)

This is a toy example showing a very basic usage of DO CONNCURRENT and the slow performance is expected.

With Unified Memory, data gets copied to the GPU in the kernel where it’s first touched. Since there’s only the one GPU kernel, all the data movement time is being added to it’s runtime. In a real program, this data movement cost would get amortized across multiple kernels.

Thank you for the kind apply! Also, I found the similar question before.