Does StdPar speed up native loops?

I have a question regarding the nvc++ -compiler that I’m using to compile my Monte Carlo pricing engine. I’m running the following native loop:

for (size_t PathIdx = 0; PathIdx < NumberOfPaths; PathIdx++)
		int NormalIdx = PathIdx * NumberOfSteps;
		float Spot = S0;
		float Vol = v0;
		float zero = 0.0;
		for (size_t StepIdx = 0; StepIdx < NumberOfSteps; StepIdx++)
			Spot *= exp((r - 0.5 * Vol) * dt + sqrt(std::max(Vol, zero)) * sqrdt * + StepIdx));
			Vol += kappa * (vbar - Vol) * dt + zeta * sqrt(std::max(Vol, zero)) * sqrdt * + StepIdx);

I.e., nothing special. Just simulating Heston price paths, SpotRandoms and VolRandoms are std::vector’s that I have generated before. I have compiled and ran this program with the following compilers and settings:

  • g++ compiler - ran in 5464 ms.
  • nvc++ compiler, no -stdpar flag, ran in 1586 ms.
  • nvc++ compiler, -stdpar=gpu flag, ran in 26ms.
  • nvc++ compiler, -stdpar=multicore flag, ran in 34ms.

I was unaware that just using the -stdpar compiler will yield a speedup of native loops, I was just aware of the speedup of stl-algorithms using the execution policies. Does anybody have a clear view of what’s going on under the hood of the nvc++ compiler with regards to native loops? For example, are they indeed ported to the GPU? I have profiled the application in Nsight Systems and the CUDA trace seems to indicate not, but I could not find a definitive answer using the compiler documentation.

Hi sanderkorteweg,

What other compiler flags are you using?

STDPAR would only be enabled when using the parallel STL constructs such as transform, for_each, reduce, etc. I assume you’re not using these elsewhere, so my one thought is that -stdpar does enable higher optimization levels (-O2), which includes auto-vectorization. If you’re not using optimization or a low optimization flag (-O0/-O1), this could explain the difference.

Do you see the same improved performance if you use -O2, -O3, or -fast?

If not, can you please provide a minimal reproducing example so I can investigate?


I encounter a similar issue. This Fortran program executes a simple saxpy kernel:

program test
  implicit none

  real(8), dimension(:), allocatable :: x, y
  real(8), parameter :: a = 1.5
  integer, parameter :: N = 100000000
  integer, parameter :: M = 100
  integer :: i, j
  real(8) :: t1, t2

  allocate (x(N), source=1.5)
  allocate (y(N), source=0)

  call cpu_time(t1)
  !!!do j = 1, M
     do i = 1, N
        y(i) = y(i) + a * x(i)
     end do
  !!!end do
  call cpu_time(t2)
  print *, 'delta_t: ', t2 - t1

  print *, 'y(1): ', y(1)
end program test

Compiling only with -stdpar, I get delta_t: 0.119, in contrast to delta_t: 0.458 without any compiler flag. I checked the output of -Minfo=all, and with -stdpar, I additionally get

     11, Memory set idiom, loop replaced by call to __c_mset8
     12, Memory zero idiom, loop replaced by call to __c_mzero8

However, this seems to be irrelevant as compiling with -Mnoidiom gives the same runtime.
I tried out optimization levels from -O0 to -O3. Interestingly, for each level, the runtime with -stdpar is slightly longer than the one without. Moreover, none of the measurements corresponds to the values of the program without any -O option.
On top of that, I observed that the program compiled with -stdpar has a process running on the GPU. This is found out by activating the outer loop to increase the runtime and checking nvidia-smi while it runs. There is GPU memory allocated (around 400MB), but the GPU utilization is kept at 0%. This is contrary to the assumption that the loop is offloaded implicitly without any directives.

Hi Christian,

What’s happening is that CUDA Unified Memory is being used for your allocate, thus causing a bit of extra overhead and a CUDA context being created on the device. However, the do loop itself is not being offloaded.

You can confirm this my using Nisght-systems:

% nvfortran test.F90 -fast -stdpar ; nsys profile a.out
 delta_t:    7.8229904174804688E-002
 y(1):     2.250000000000000
Generating '/tmp/nsys-report-b2e8.qdstrm'
[1/1] [========================100%] report1.nsys-rep
% nsys stats report1.nsys-rep
Generating SQLite file report1.sqlite from report1.nsys-rep
Exporting 28339 events: [==================================================100%]
Processing [report1.sqlite] with [/proj/nv/Linux_x86_64/228535-dev/profilers/Nsight_Systems/host-linux-x64/reports/]...
SKIPPED: report1.sqlite does not contain NV Tools Extension (NVTX) data.

Processing [report1.sqlite] with [/proj/nv/Linux_x86_64/228535-dev/profilers/Nsight_Systems/host-linux-x64/reports/]...

... cut due to length ...

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)    Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  ---------  ----------  ------------  ----------------------
     75.2       20,700,855          2  10,350,427.5  10,350,427.5    188,948  20,511,907  14,370,502.1  cuMemAllocManaged
     15.4        4,245,978          1   4,245,978.0   4,245,978.0  4,245,978   4,245,978           0.0  cuMemAllocHost_v2
      8.6        2,364,451          1   2,364,451.0   2,364,451.0  2,364,451   2,364,451           0.0  cuMemAlloc_v2
      0.5          142,784        383         372.8         331.0        170       7,264         378.8  cuGetProcAddress_v2
      0.3           77,437          1      77,437.0      77,437.0     77,437      77,437           0.0  cuLibraryLoadData
      0.0            2,787          4         696.8         556.5        311       1,363         459.4  cuCtxSetCurrent
      0.0            1,112          1       1,112.0       1,112.0      1,112       1,112           0.0  cuInit
      0.0              311          1         311.0         311.0        311         311           0.0  cuModuleGetLoadingMode

... cut due to length ..

No kernel launches are shown in the profile.


Hi Mat,

thanks for your answer. It’s good to know that there are indeed no kernel launches. However, I still wonder which optimizations lead to the program compiled with -stdpar running faster, since -Minfo=all shows exactly the same in both versions. I have also set OMP_NUM_THREADS=1 to be sure that no OpenMP parallelization is done.

Best regards,