I tested the stdpar examples jacobi/ and saxpy/ in nvidia/hpc_sdk/Linux_x86_64/24.11/examples. And I built them with -fast -stdpar flags. The results shows
447974 microseconds on parallel with do concurrent
37051 microseconds on sequential
for jacobi.f90, and
149 microseconds sequential
11889 microseconds parallel with stdpar
for saxpy.f90
So the speed of the parallel version is much slower than the sequential version. I hope to know why these would happen?
System OS: Ubuntu 22.04.5 LTS on WSL2
CPU: AMD R9-7940HX
GPU: RTX 4070 laptop, Driver Version: 560.94
CUDA version: 12.6
NVHPC SDK version: 24.11
The help information of stdpar shows that -stdpar=gpu is the default value.
-stdpar[=gpu|multicore]
Enable (ISO Fortran 2018) do-concurrent
gpu Enable Fortran do-concurrent acceleration on the GPU (default); please refer to -gpu for target specific options
multicore Enable Fortran do-concurrent acceleration on multicore; please refer to -gpu for target specific options
I have also tried -stdpar=gpu, the results are the same
These are very small functional examples not intended for performance evaluation. The kernels are very small and the data movement to/from the device dominates at about 99% of the device time. For large applications with more computation and device memory re-use, you’ll see the same performance benefit as the directive based solutions (OpenACC, OpenMP) and CUDA Fortran.