Parallel with do concurrent is slower than sequential version for stdpar examples

I tested the stdpar examples jacobi/ and saxpy/ in nvidia/hpc_sdk/Linux_x86_64/24.11/examples. And I built them with -fast -stdpar flags. The results shows
447974 microseconds on parallel with do concurrent
37051 microseconds on sequential
for jacobi.f90, and
149 microseconds sequential
11889 microseconds parallel with stdpar
for saxpy.f90
So the speed of the parallel version is much slower than the sequential version. I hope to know why these would happen?

System OS: Ubuntu 22.04.5 LTS on WSL2
CPU: AMD R9-7940HX
GPU: RTX 4070 laptop, Driver Version: 560.94
CUDA version: 12.6
NVHPC SDK version: 24.11

Did you use the -stdpar=gpu or just -stdpar? I think in the latter case the compiler only generates host code.

The help information of stdpar shows that -stdpar=gpu is the default value.

-stdpar[=gpu|multicore]
                    Enable (ISO Fortran 2018) do-concurrent
    gpu             Enable Fortran do-concurrent acceleration on the GPU (default); please refer to -gpu for target specific options
    multicore       Enable Fortran do-concurrent acceleration on multicore; please refer to -gpu for target specific options

I have also tried -stdpar=gpu, the results are the same

Hi Winstorm,

These are very small functional examples not intended for performance evaluation. The kernels are very small and the data movement to/from the device dominates at about 99% of the device time. For large applications with more computation and device memory re-use, you’ll see the same performance benefit as the directive based solutions (OpenACC, OpenMP) and CUDA Fortran.

POT3d is good example as discussed in this article: Using Fortran Standard Parallel Programming for GPU Acceleration | NVIDIA Technical Blog

POT3d’s source can be found at: GitHub - predsci/POT3D: POT3D: High Performance Potential Field Solver

-Mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.