Parallel with do concurrent is slower than sequential version for stdpar examples

Winstorm · November 20, 2024, 3:41am

I tested the stdpar examples jacobi/ and saxpy/ in nvidia/hpc_sdk/Linux_x86_64/24.11/examples. And I built them with -fast -stdpar flags. The results shows
447974 microseconds on parallel with do concurrent
37051 microseconds on sequential
for jacobi.f90, and
149 microseconds sequential
11889 microseconds parallel with stdpar
for saxpy.f90
So the speed of the parallel version is much slower than the sequential version. I hope to know why these would happen?

System OS: Ubuntu 22.04.5 LTS on WSL2
CPU: AMD R9-7940HX
GPU: RTX 4070 laptop, Driver Version: 560.94
CUDA version: 12.6
NVHPC SDK version: 24.11

christian.weiss · November 20, 2024, 1:51pm

Did you use the -stdpar=gpu or just -stdpar? I think in the latter case the compiler only generates host code.

Winstorm · November 20, 2024, 2:14pm

The help information of stdpar shows that -stdpar=gpu is the default value.

-stdpar[=gpu|multicore]
                    Enable (ISO Fortran 2018) do-concurrent
    gpu             Enable Fortran do-concurrent acceleration on the GPU (default); please refer to -gpu for target specific options
    multicore       Enable Fortran do-concurrent acceleration on multicore; please refer to -gpu for target specific options

I have also tried -stdpar=gpu, the results are the same

MatColgrove · November 20, 2024, 5:12pm

Hi Winstorm,

These are very small functional examples not intended for performance evaluation. The kernels are very small and the data movement to/from the device dominates at about 99% of the device time. For large applications with more computation and device memory re-use, you’ll see the same performance benefit as the directive based solutions (OpenACC, OpenMP) and CUDA Fortran.

POT3d is good example as discussed in this article: Using Fortran Standard Parallel Programming for GPU Acceleration | NVIDIA Technical Blog

POT3d’s source can be found at: GitHub - predsci/POT3D: POT3D: High Performance Potential Field Solver

-Mat

system · December 4, 2024, 5:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slower performance with GPU when using nvfortran, stdpar nvc, nvc++ and nvfortran	2	40	September 23, 2024
DO LOOP inside DO CONCURRENT nvc, nvc++ and nvfortran	4	510	December 30, 2020
[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance nvc, nvc++ and nvfortran cuda	3	730	May 2, 2023
accelerator parallization issues Legacy PGI Compilers	18	26772	April 12, 2010
Nvc++ doesn't parallelise across cpu cores if -stdpar is not specified nvc, nvc++ and nvfortran	1	480	August 6, 2020
Nvc++ seems to ignore std::execution::par if --stdpar is not specified nvc, nvc++ and nvfortran	0	352	July 13, 2020
Does StdPar speed up native loops? nvc, nvc++ and nvfortran	4	586	May 3, 2023
Translating FORTRAN to C++ to CUDA advice CUDA Programming and Performance	19	23340	February 1, 2010
erratic results, but C always beats FORTRAN Legacy PGI Compilers	8	5142	April 5, 2011
coce_parallel_loop Legacy PGI Compilers	1	1706	February 22, 2012

Parallel with do concurrent is slower than sequential version for stdpar examples

Related topics