Nvc++ & external CUDA-thrust conflicts for -stdpar offload

I think I’ve uncovered a subtle interaction between NVC++ and an external CUDA installation when using -stdpar=gpu offload.

We have a module environment as is typical on many HPC installations. When I purge my environment and load only nvhpc, -stdpar=gpu works as expected.

However, when I have a cuda environment loaded as well, -stdpar=gpu has various runtime issues: sometimes runtime errors with failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered; but perhaps most troubling sometimes no runtime errors detected, just erroneous results in a simple std::sort().

I suspect this is an interaction with the thrust detected in the external cuda installation. Any guidance is appreciated!

-Ben

nvhpc only, correct:

$ wget https://raw.githubusercontent.com/benkirk/paradigms_playground/master/parallel_stl_sort.C
$ module purge && module load nvhpc && module list
Currently Loaded Modules:
  1) ncarenv/22.10 (S)   2) craype/2.7.17 (S)   3) nvhpc/22.7

$ nvc++ -stdpar -o parallel_stl_sort parallel_stl_sort.C && ./parallel_stl_sort 
input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::seq: 43.127 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::par: 0.632754 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::par_unseq: 0.073256 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

** nvhpc+cuda, incorrect:**

$ wget https://raw.githubusercontent.com/benkirk/paradigms_playground/master/parallel_stl_sort.C
$ module purge && module load nvhpc cuda && module list
Currently Loaded Modules:
  1) ncarenv/22.10 (S)   2) craype/2.7.17 (S)   3) nvhpc/22.7   4) cuda/11.4.4

$ nvc++ -stdpar -o parallel_stl_sort parallel_stl_sort.C && ./parallel_stl_sort 
input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::seq: 42.8227 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 0 0 0 0 0 0 0 0 0 0 0 ...
after unique: v.size()=59328; 765600696 352808383 3641236997 4016398694 1279020192 465826551 864301009 822663315 3257882672 1989160727 4086747794 ...
==> ERROR: size mismatch from serial algorithm!
std::copy() / std::sort() / std::unique() / std::execution::par: 0.61413 sec. 
final: v.size()=59328; 765600696 352808383 3641236997 4016398694 1279020192 465826551 864301009 822663315 3257882672 1989160727 4086747794 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 0 0 0 0 0 0 0 0 0 0 0 ...
after unique: v.size()=643648; 2435803498 2970823809 3485536073 2755831796 3868881694 2623710790 2458607871 3552076208 607421919 2528345273 2013025721 ...
==> ERROR: size mismatch from serial algorithm!
std::copy() / std::sort() / std::unique() / std::execution::par_unseq: 0.606054 sec. 
final: v.size()=643648; 2435803498 2970823809 3485536073 2755831796 3868881694 2623710790 2458607871 3552076208 607421919 2528345273 2013025721 ...

Hi Ben,

You’re most likely correct. I’m assuming the cuda/11.4.4 module is setting the “CUDA_HOME” environment variable in which case nvc++ will be picking up the Thrust from CUDA 11.4. There’s a few incompatibilities there that we fixed in the Thurts we ship that didn’t get picked-up by the main Thrust until later CUDA versions. Hence, this is not expected to work.

Please try unsetting the CUDA_HOME environment variable before compiling.

Note that in the 22.9 release, we no longer use CUDA_HOME, instead we started using NVHPC_CUDA_HOME to allow users to switch to using there own CUDA install. Should help eliminate these types of issues.

-Mat

1 Like

Absolutely correct, this resolves the issue!! Thanks a lot, and we’ll look forward to deploying 22.9+

$ echo $CUDA_HOME 
/glade/u/apps/dav/opt/cuda/11.4.0/
$ unset CUDA_HOME 
$ nvc++ -stdpar -o parallel_stl_sort parallel_stl_sort.C && ./parallel_stl_sort 
input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::seq: 49.7554 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::par: 0.781871 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

input: v.size()=500000000; 3499211612 581869302 3890346734 3586334585 545404204 4161255391 3922919429 949333985 2715962298 1323567403 418932835 ...
after sort: v.size()=500000000; 5 15 19 22 31 55 60 61 63 88 95 ...
after unique: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...
std::copy() / std::sort() / std::unique() / std::execution::par_unseq: 0.0912849 sec. 
final: v.size()=471992679; 5 15 19 22 31 55 60 61 63 88 95 ...

Would it be possible to reference the required Thrust by installation prefix, e.g.

#include </full/path/from/the/install/path.h>

to avoid any environment variable fragility? Just an idea.

-Ben

Doubtful. The path is relative based on the installation directory, NVHPC compiler version, and CUDA version. Can’t be hard coded. Plus this was a short term problem as the Thrust we shipped was ahead of the CUDA 11.4 version.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.