I installed nvhpc with CUDA 12.5 and tried to check the performance of saxpy example included in example directory.
However, it showed that GPU performance with DO CONCURRENT was much slower than naive DO loop.
How can I improve the parallelization with GPUs?
System Info:
RTX 4090 GPU
CUDA 12.5
nvhpc_2024_247
Ubuntu 22.04 LTS x86_64
Source code: [NVHPC Installed Directory]/Linux_x86_64/24.7/examples/stdpar/saxpy.f90
(I modified parameter n to 10000000)
Results:
9938 microseconds sequential
27308 microseconds parallel with stdpar
Test PASSED
Accelerator Kernel Timing data
/STORAGE_2/ph/PACKAGES/NVIDIA_HPC_SDK/Linux_x86_64/24.7/examples/stdpar/saxpy/saxpy.f90
saxpy_concurrent NVIDIA devicenum=0
time(us): 27,121
27: compute region reached 1 time
27: kernel launched 1 time
grid: [78125] block: [128]
device time(us): total=27,121 max=27,121 min=27,121 avg=27,121
elapsed time(us): total=27,175 max=27,175 min=27,175 avg=27,175
27: data region reached 2 times
( GPU usage was ~16% for a moment)