Does StdPar speed up native loops?

sanderkorteweg · December 5, 2022, 10:31am

I have a question regarding the nvc++ -compiler that I’m using to compile my Monte Carlo pricing engine. I’m running the following native loop:

for (size_t PathIdx = 0; PathIdx < NumberOfPaths; PathIdx++)
	{
		int NormalIdx = PathIdx * NumberOfSteps;
		float Spot = S0;
		float Vol = v0;
		float zero = 0.0;
		for (size_t StepIdx = 0; StepIdx < NumberOfSteps; StepIdx++)
		{
			Spot *= exp((r - 0.5 * Vol) * dt + sqrt(std::max(Vol, zero)) * sqrdt * SpotRandoms.at(NormalIdx + StepIdx));
			Vol += kappa * (vbar - Vol) * dt + zeta * sqrt(std::max(Vol, zero)) * sqrdt * VolRandoms.at(NormalIdx + StepIdx);
		}
	}

I.e., nothing special. Just simulating Heston price paths, SpotRandoms and VolRandoms are std::vector’s that I have generated before. I have compiled and ran this program with the following compilers and settings:

g++ compiler - ran in 5464 ms.
nvc++ compiler, no -stdpar flag, ran in 1586 ms.
nvc++ compiler, -stdpar=gpu flag, ran in 26ms.
nvc++ compiler, -stdpar=multicore flag, ran in 34ms.

I was unaware that just using the -stdpar compiler will yield a speedup of native loops, I was just aware of the speedup of stl-algorithms using the execution policies. Does anybody have a clear view of what’s going on under the hood of the nvc++ compiler with regards to native loops? For example, are they indeed ported to the GPU? I have profiled the application in Nsight Systems and the CUDA trace seems to indicate not, but I could not find a definitive answer using the compiler documentation.

MatColgrove · December 5, 2022, 4:34pm

Hi sanderkorteweg,

What other compiler flags are you using?

STDPAR would only be enabled when using the parallel STL constructs such as transform, for_each, reduce, etc. I assume you’re not using these elsewhere, so my one thought is that -stdpar does enable higher optimization levels (-O2), which includes auto-vectorization. If you’re not using optimization or a low optimization flag (-O0/-O1), this could explain the difference.

Do you see the same improved performance if you use -O2, -O3, or -fast?

If not, can you please provide a minimal reproducing example so I can investigate?

-Mat

christian.weiss · May 2, 2023, 7:38am

I encounter a similar issue. This Fortran program executes a simple saxpy kernel:

program test
  implicit none

  real(8), dimension(:), allocatable :: x, y
  real(8), parameter :: a = 1.5
  integer, parameter :: N = 100000000
  integer, parameter :: M = 100
  integer :: i, j
  real(8) :: t1, t2

  allocate (x(N), source=1.5)
  allocate (y(N), source=0)

  call cpu_time(t1)
  !!!do j = 1, M
     do i = 1, N
        y(i) = y(i) + a * x(i)
     end do
  !!!end do
  call cpu_time(t2)
  print *, 'delta_t: ', t2 - t1

  print *, 'y(1): ', y(1)
end program test

Compiling only with -stdpar, I get delta_t: 0.119, in contrast to delta_t: 0.458 without any compiler flag. I checked the output of -Minfo=all, and with -stdpar, I additionally get

     11, Memory set idiom, loop replaced by call to __c_mset8
     12, Memory zero idiom, loop replaced by call to __c_mzero8

However, this seems to be irrelevant as compiling with -Mnoidiom gives the same runtime.
I tried out optimization levels from -O0 to -O3. Interestingly, for each level, the runtime with -stdpar is slightly longer than the one without. Moreover, none of the measurements corresponds to the values of the program without any -O option.
On top of that, I observed that the program compiled with -stdpar has a process running on the GPU. This is found out by activating the outer loop to increase the runtime and checking nvidia-smi while it runs. There is GPU memory allocated (around 400MB), but the GPU utilization is kept at 0%. This is contrary to the assumption that the loop is offloaded implicitly without any directives.

MatColgrove · May 2, 2023, 3:11pm

Hi Christian,

What’s happening is that CUDA Unified Memory is being used for your allocate, thus causing a bit of extra overhead and a CUDA context being created on the device. However, the do loop itself is not being offloaded.

You can confirm this my using Nisght-systems:

% nvfortran test.F90 -fast -stdpar ; nsys profile a.out
 delta_t:    7.8229904174804688E-002
 y(1):     2.250000000000000
Generating '/tmp/nsys-report-b2e8.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
    /local/home/mcolgrove/report1.nsys-rep
% nsys stats report1.nsys-rep
Generating SQLite file report1.sqlite from report1.nsys-rep
Exporting 28339 events: [==================================================100%]
Processing [report1.sqlite] with [/proj/nv/Linux_x86_64/228535-dev/profilers/Nsight_Systems/host-linux-x64/reports/nvtx_sum.py]...
SKIPPED: report1.sqlite does not contain NV Tools Extension (NVTX) data.

Processing [report1.sqlite] with [/proj/nv/Linux_x86_64/228535-dev/profilers/Nsight_Systems/host-linux-x64/reports/osrt_sum.py]...

... cut due to length ...

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)    Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  ---------  ----------  ------------  ----------------------
     75.2       20,700,855          2  10,350,427.5  10,350,427.5    188,948  20,511,907  14,370,502.1  cuMemAllocManaged
     15.4        4,245,978          1   4,245,978.0   4,245,978.0  4,245,978   4,245,978           0.0  cuMemAllocHost_v2
      8.6        2,364,451          1   2,364,451.0   2,364,451.0  2,364,451   2,364,451           0.0  cuMemAlloc_v2
      0.5          142,784        383         372.8         331.0        170       7,264         378.8  cuGetProcAddress_v2
      0.3           77,437          1      77,437.0      77,437.0     77,437      77,437           0.0  cuLibraryLoadData
      0.0            2,787          4         696.8         556.5        311       1,363         459.4  cuCtxSetCurrent
      0.0            1,112          1       1,112.0       1,112.0      1,112       1,112           0.0  cuInit
      0.0              311          1         311.0         311.0        311         311           0.0  cuModuleGetLoadingMode

... cut due to length ..

No kernel launches are shown in the profile.

-Mat

christian.weiss · May 3, 2023, 1:48pm

Hi Mat,

thanks for your answer. It’s good to know that there are indeed no kernel launches. However, I still wonder which optimizations lead to the program compiled with -stdpar running faster, since -Minfo=all shows exactly the same in both versions. I have also set OMP_NUM_THREADS=1 to be sure that no OpenMP parallelization is done.

Best regards,
Christian

Topic		Replies	Views
Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type nvc, nvc++ and nvfortran	7	359	April 2, 2024
Device code generated from -stdpar versus thrust nvc, nvc++ and nvfortran	12	2466	June 13, 2022
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1400	July 28, 2023
Contents of loop failing to translate/compile/run? nvc, nvc++ and nvfortran cuda	25	746	February 11, 2023
Difference in Performance CUDA Programming and Performance	13	9740	August 20, 2008
Does nvfortran -stdpar=gpu support two GPUs with NVLink? nvc, nvc++ and nvfortran	12	65	April 2, 2025
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	748	December 30, 2022
Nvc++ PSTL: scan failed to synchronize: cudaErrorIllegalAddress nvc, nvc++ and nvfortran	4	812	March 22, 2021
Nvfortran -stdpar triggers OpenACC directives to be evaluated nvc, nvc++ and nvfortran	1	14	November 22, 2024
Nvc++ doesn't parallelise across cpu cores if -stdpar is not specified nvc, nvc++ and nvfortran	1	443	August 6, 2020

Does StdPar speed up native loops?

Related topics