NVC++ using external libraries

I have written a small program using std::for_each(std::execution::par_unseq in main and it is working fine.
Then I have moved this part out of main into a class called by main and it is still working fine.
Last step I have moved this class into a library, also compiled with -stdpar. Both, library and application could be compiled and linked, but at runtime it chrashes with “segmentation fault”.

Are there any restrictions using libraries dealing with parallel macros?

Background: We have lots of libraries doing complex calculations, all written in pure C++, compatible to standard 17, which we now want to be acceleratde using NVC++ compiler and GPUs.

Hi ubehr,

Are you creating static or dynamic libraries? Static should work fine, but we have a known issue with creating dynamic libraries that we’re investigating. We’ve added your issue to problem report TPR #29694 to track.

Thanks,
Mat

Hi Mat,

Thanks a lot for the quick reply. Yes, it was a dynamic library. So, for the moment I will use a static one.

Best regards

ubehr

image001.jpg

Hi Mat,

This solution was working fine and I was able to run lots of performance tests with amazing results.

Now I have tried the same with a real world example from my company. Some small libraries.

Its central point was a simple loop.

Doing it in serial, everything is working well.

std::for_each(stationen.begin(), stationen.end(),

(ApplicationsPaar& ap) { CalculationBase(ap); });

Doing it in parallel, either “par” or “par_unseq”, the source code was compiled and linked into static libraries.

std::for_each(std::execution::par, stationen.begin(), stationen.end(),

(ApplicationsPaar& ap) { CalculationBase(ap); });

But calling them from a simple application leads to an error.

nvlink error : Undefined reference to ‘_ZN7StationDIEv’ in ‘hostpwd/ …

Looks like undefined symbols.

So I have checked the static library with “nm”.

Yes, these symbols are really undefined.

What’s going wrong here?

Both with serial loop and with parallel loop libraries are compiling and linking without any message.

But in the case of parallel loop this was not correct. There are undefined symbols.

Is it a problem with the method called? Following your documentation, a static method with a reference to a simple class as argument is allowed for parallel processes on GPU.

And in my former examples this idea was working well.

But not here, so I am confused.

The header-file:

class MY_LIB_PUBLIC CalculatorBase

{

public:

CalculatorBase();

CalculatorBase(const CalculatorBase& toCopy);

virtual ~CalculatorBase();

CalculatorBase& operator =(const CalculatorBase& av);

protected:

void copyAttributes(const CalculatorBase& av);

public:

void CalculationBaseSeriell();

void CalculationBaseParallelCPU();

void CalculationBaseParallelGPU();

void CalculationIM();

public:

static void CalculationBase(ApplicationsPaar& AP);

public:

std::vector stationen;

};

And the relevant cpp-part. Simple distance predictions hidden behind an *-operator, marked in yellow.

void CalculatorBase::CalculationBase(ApplicationsPaar& stationen)

{

int i, ni;

int s, n_s = stationen.S.STATION.size();

int g, n_g = stationen.G.STATION.size();

for (s = 0; s < n_s; s++)

{

Station S = stationen.S.STATION[s];

GeometrieBasis* GBs = NULL;

if (S.sitePoint != NULL)

{

GBs = new GeometrieDatenKreis();

((GeometrieDatenKreis)GBs).S.Laenge_grad = S.sitePoint->SID_int_DECIMAL;

((GeometrieDatenKreis)GBs).S.Breite_grad = S.sitePoint->SID_LAT_DECIMAL;

((GeometrieDatenKreis)GBs).r_m = 0.0;

}

else if (S.siteCircle != NULL)

{

GBs = new GeometrieDatenKreis();

((GeometrieDatenKreis)GBs).S.Laenge_grad = S.siteCircle->SID_int_DECIMAL;

((GeometrieDatenKreis)GBs).S.Breite_grad = S.siteCircle->SID_LAT_DECIMAL;

((GeometrieDatenKreis)GBs).r_m = S.siteCircle->ST_AREA_RADIUS;

}

else if (S.siteVector != NULL)

{

switch (S.siteVector->SID_TYPE)

{

case SiteVector::Type::closed:

ni = S.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenKontur();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = S.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = S.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenKontur)GBs).PL.PL.push_back(GK);

}

break;

case SiteVector::Type::list:

ni = S.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenPunktListe();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = S.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = S.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenPunktListe)GBs).PL.PL.push_back(GK);

}

((GeometrieDatenPunktListe)GBs).r_m = 0.0;

break;

case SiteVector::Type::open:

ni = S.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenPolygon();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = S.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = S.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenPolygon)GBs).PL.PL.push_back(GK);

}

break;

}

}

Geometrie Gs;

Gs.pSDB = GBs;

for (g = 0; g < n_g; g++)

{

Station G = stationen.G.STATION[g];

GeometrieBasis* GBg = NULL;

if (G.sitePoint != NULL)

{

GBg = new GeometrieDatenKreis();

((GeometrieDatenKreis)GBg).S.Laenge_grad = G.sitePoint->SID_int_DECIMAL;

((GeometrieDatenKreis)GBg).S.Breite_grad = G.sitePoint->SID_LAT_DECIMAL;

((GeometrieDatenKreis)GBg).r_m = 0.0;

}

else if (G.siteCircle != NULL)

{

GBg = new GeometrieDatenKreis();

((GeometrieDatenKreis)GBg).S.Laenge_grad = G.siteCircle->SID_int_DECIMAL;

((GeometrieDatenKreis)GBg).S.Breite_grad = G.siteCircle->SID_LAT_DECIMAL;

((GeometrieDatenKreis)GBg).r_m = G.siteCircle->ST_AREA_RADIUS;

}

else if (G.siteVector != NULL)

{

switch (G.siteVector->SID_TYPE)

{

case SiteVector::Type::closed:

ni = G.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenKontur();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = G.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = G.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenKontur)GBg).PL.PL.push_back(GK);

}

break;

case SiteVector::Type::list:

ni = G.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenPunktListe();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = G.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = G.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenPunktListe)GBg).PL.PL.push_back(GK);

}

((GeometrieDatenPunktListe)GBs).r_m = 0.0;

break;

case SiteVector::Type::open:

ni = G.siteVector->SID_VECTOR.size();

GBs = new GeometrieDatenPolygon();

for (i = 0; i < ni; i++)

{

GeoKoordinate GK;

GK.Laenge_grad = G.siteVector->SID_VECTOR[i].SID_int_DECIMAL;

GK.Breite_grad = G.siteVector->SID_VECTOR[i].SID_LAT_DECIMAL;

((GeometrieDatenPolygon)GBg).PL.PL.push_back(GK);

}

break;

}

}

Geometrie Gg;

Gg.pSDB = GBg;

stationen.Abstandsmatrix[s][g] = Gs * Gg;

}

}

}

Best wishes

Uli

image001.jpg

While I’m not sure, this looks like a constructor for the Station class. Probably as part of:

for (s = 0; s < n_s; s++)
{
Station S = stationen.S.STATION[s];   << the copy constructor is getting called here
GeometrieBasis* GBs = NULL;

How are Station’s constructors defined? I’m wondering if they aren’t presented in a way that compiler can auto create the device versions.

-Mat

Hi Mat,

The station is defined by:

#ifndef Station

#define Station

#include “Receiver.h”

#include “Transmitter.h”

#include “…/GeoCalculator/Geometrie.h”

#include

#include

class SitePoint;

class SiteCircle;

class SiteVector;

class MY_LIB_PUBLIC Station

{

public: enum SiteType { point, circle, vector };

public:

Station();

Station(SiteType ST);

Station(const Station& toCopy);

virtual ~Station();

Station& operator =(const Station& av);

public:

void ErzeugeGeometrie();

protected:

void copyAttributes(const Station& av);

void clear();

public:

std::string ST_REF;

std::string ST_NAME;

std::string ST_CALLSIGN;

std::vector Transmitters;

std::vector Receivers;

SitePoint* sitePoint;

SiteCircle* siteCircle;

SiteVector* siteVector;

std::string SID_DESC;

std::string SID_COUNTRY;

public:

Geometrie* G;

};

#endif

In the mean time I have found a way to make the code compile and link:

If I don’t use variables with types like “Station” in the static method called by the loop then this error message doesn’t occur anymore.

But it is allowed if it is a member of the object which is the in-parameter of that method.

But I don’t understand why types like that should not be allowed here but be allowed there.

Best regards

Still just guessing, but we don’t yet support virtual functions on the device, so it’s possible that the missing symbol is coming from the destructor. Does the destructor need to be virtual?

Hi Mat,

Thanks a lot. That was the real problem.

I have skipped out all “virtual” and now it is compiling and linking well.

Perfect.

Best regards

Uli

image002.jpg

Hi Mat,

I have found a really strange behavior of the NVC+±compiler. The source code shown down below is an example for.
A static method used inside a parallel for each is having calling 16 others methods in a if…then…else cascade.
Here I have subdivided it into 4 if-loops having again 4 if-loops inside.
But this is only a detail because each variant is leading to the same problem (if with 16 subroutines, switch with 16 case, … all the same):

double CalculatorCheck::Distance(Geometriy& g1, Geometriy& g2)
{
double r = 1.0e10;
if (g1.T == Geometry::Type::Circle)
{
if (g2.T == Geometry::Type::Circle)
{
r = DistanceCircleCircle(g1,g2);
}
if (g2.T == Geometry::Type::Kontur)
{
r = DistanceCircleContour(g1,g2);
}
if (g2.T == Geometry::Type::Polygon)
{
r = DistanceCirclePolygon(g1,g2);
}
if (g2.T == Geometry::Type::PointList)
{
r = DistanceCirclePointList(g1,g2);
}
}
if (g1.T == Geometry::Type::Contour)
{
if (g2.T == Geometry::Type::Circle)
{
r = DistanceContourCircle(g1,g2);
}
if (g2.T == Geometry::Type::Contour)
{
r = DistanceContourContour(g1,g2);
}
if (g2.T == Geometry::Type::Polygon)
{
r = DistanceContourPolygon(g1,g2);
}
if (g2.T == Geometry::Type::PointList)
{
r = DistanceContourPointList(g1,g2);
}
}
if (g1.T == Geometry::Type::Polygon)
{
if (g2.T == Geometry::Type::Circle)
{
r = DistancePolygonCircle(g1,g2);
}
if (g2.T == Geometry::Type::Contour)
{
r = DistancePolygonContour(g1,g2);
}
if (g2.T == Geometry::Type::Polygon)
{
r = DistancePolygonPolygon(g1,g2);
}
if (g2.T == Geometry::Type::PointList)
{
r = DistancePolygonPointList(g1,g2);
}
}
if (g1.T == Geometry::Type::PointList)
{
if (g2.T == Geometry::Type::Circle)
{
r = DistancePointListCircle(g1,g2);
}
if (g2.T == Geometry::Type::Contour)
{
r = DistancePointListContour(g1,g2);
}
if (g2.T == Geometry::Type::Polygon)
{
r = DistancePointListPolygon(g1,g2);
}
if (g2.T == Geometry::Type::PointList)
{
r = DistancePointListPointList(g1,g2);
}
}

if (r < 0.0)
{
    r = 0.0;
}

return r;

}

And the problem is the following error message:

NVC+±F-0155-Compiler failed to translate accelerator region (see –Minfo messages): Unsupported operation

There was no additional information inside “Minfo”.

But if I use only any three of these 16 subroutines, NVC++ is compiling without any problem. Any trial with 4 or more fails.
That sounds that none of these subroutines is causing the problem “unsupported operation”, it sound that “size” of the if-cascade is causing that problem.

Is there any restriction in if-cascades inside a parallel loop?
How to overcome them?

Best regards
Uli

Sorry Uli, no idea what’s wrong. It doesn’t seem likely that number of if conditions would cause something like this, but if so, it’s not expected. Though the issue could be due to something else as well.

Would it be possible for you to create a complete small reproducer that we could use to determine the issue?

Thanks,
Mat

Hi Mat,

That might be complex.

In behind there is a set of libraries and the part used for parallel processing is at top. So all these libraries are needed to compile it, or the other way around extracting only the parallel part is not really possible.

I have also the feeling that it is not really the number of if clauses, but the number of levels of static functions calling other static functions:

Call from outside:

void CalculationCheckParallelGPU();

Level 1: First split into geometry based and other predictions (which are without any interest here):

void CalculatorCheck::CalculationCheckParallelGPU()

{

if (isInit == false)

{

std::for_each(std::execution::par_unseq, CSListen.begin(), CSListen.end(),

(CSStationsListe& CSListe) { WerteGeometrieAus(CSListe); });

isInit=true;

}

std::for_each(std::execution::seq, CSListen.begin(), CSListen.end(),

(CSStationsListe& CSListe) { CalculationL1ParallelGPU(CSListe); });

}

Level 2:

static void WerteGeometrieAus(CSStationsListe& CSListe);

void CalculatorCheck::WerteGeometrieAus(CSStationsListe& CSListe)

{

double d = Abstand(CSListe.GeometrieTx,CSListe.GeometrieRx,CSListe.vd);

}

Level 3: There are 4 different geometries, so there are sixteen pairs of geometries which has to be handled here:

static double Abstand(Geometrie& G1, Geometrie& G2, std::vector& vd);

Level 4: Here are these 16 handler:

static double AbstandKreisKreis(Geometrie& g1, Geometrie& g2, std::vector& vd3clear);

static double AbstandKreisK****ontur(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKreisPolygon(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKreisPunktListe(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKonturKreis(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKonturKontur(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKonturPolygon(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandKonturPunktListe(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPolygonKreis(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPolygonKontur(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPolygonPolygon(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPolygonPunktListe(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPunktListeKreis(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPunktListeKontur(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPunktListePolygon(Geometrie& g1, Geometrie& g2, std::vector& vd);

static double AbstandPunktListePunktListe(Geometrie& g1, Geometrie& g2, std::vector& vd);

Level 5,6,…: All functions of level 4 are calling some more static functions doing the specific prediction.

The problem occurs on level 3.

double CalculatorCheck::Abstand(Geometrie& g1, Geometrie& g2, std::vector& vd)

{

double r = 1.0e10;

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::Kreis))

{

r = AbstandKreisKreis(g1,g2,vd);

}

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::Kontur))

{

r = AbstandKreisKontur(g1,g2,vd);

}

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::PolygonZug))

{

r = AbstandKreisPolygon(g1,g2,vd);

}

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::PunktListe))

{

r = AbstandKreisPunktListe(g1,g2,vd);

}

}

But the same error occurs also by calling the functions in line (which for sure doesn’t make any sense … it’s only for clarification):

double CalculatorCheck::Abstand(Geometrie& g1, Geometrie& g2, std::vector& vd)

{

double r = 1.0e10;

r = AbstandKreisKreis(g1,g2,vd);

r = AbstandKreisKontur(g1,g2,vd);

r = AbstandKreisPolygon(g1,g2,vd);

r = AbstandKreisPunktListe(g1,g2,vd);

}

But everything compiles fine, without any problem, by calling only three of the functions of level 4. Independent from which three out of the sixteen I choose.

(Ok, again this restriction doesn’t make any sense, it’s only for clarification).

double CalculatorCheck::Abstand(Geometrie& g1, Geometrie& g2, std::vector& vd)

{

double r = 1.0e10;

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::Kreis))

{

r = AbstandKreisKreis(g1,g2,vd);

}

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::Kontur))

{

r = AbstandKreisKontur(g1,g2,vd);

}

if ((g1.T == Geometrie::Type::Kreis)&&(g2.T == Geometrie::Type::PolygonZug))

{

r = AbstandKreisPolygon(g1,g2,vd);

}

}

Worth to mention that the same source code is working fine if I compile the project with gcc/tbb (also on Linux) or with Visual Studio compiler (on Windows).

Best wishes

Uli

image001.jpg

Hi Mat,

appended you will find a simplified example showing most of the problems we currently have.

Three cpp files and two h files.

And a script which is showing my way to compile it.

The first problem is the “unsupported method in acceleration part”-message stopping the compilation.

The second problem is (same source code but the parameter “par” and “par_unseq” changed to “seq”) the performance.

Worth to mention this source code is running on Windows/VisualStudioCompiler and on gcc/tbb (ok, “par_unseq” doesn’t have any impact there, it is used like “par” … anyway).

The Windows part is running on a laptop with 8 cores. The serial prediction takes 192s, while the parallel one takes 40s. A factor of around 5 better. Ok, lots of data are to transfer, so the theoretical limit of 8 (related to the number of cores) can’t be reached. But anyway, a really good result.

The Linux part is running on a GPU-server. A significant more performing hardware than my laptop.

2 GPUs:

NVIDIA GeForce RTX 3080 (10GB RAM 8.704-CUDA Cores)

NVIDIA GeForce RTX 3080 (10GB RAM 8.704-CUDA Cores)

Processor:

Intel Xeon W-2235 Prozessor (12-Core, 3.80-4.60 GHz)

Memory

64GB DDR4-RAM 2933MHz ECC Memory Reg. (4x16GB)

But with gcc/tbb compiled, it takes 132s in serial mode, but 208s in parallel. First surprise: parallel mode is slower than serial. Next surprise: also the fastest mode on the much better hardware is a factor of 3 slower than my poor laptop.

We have seen the same behavior also for similar programs which are running both on gcc/tbb and NVC++. Also the “par_unseq” mode was only a bit faster than “par”, but much slower than “seq”.

Only for the extremely reduced test (transferring a list of objects of ten “float”, each object is doing billions of floating point operations) we have seen the expected behavior: The parallel part is faster by the number of cores and the GPU part is faster by a factor of some hundred or thousand.

So, what is wrong in my code?

Is it the way I use the parallel part?

Are some includes missing?

Is it the way I compile it?

Any advice is highly welcome. Thanks a lot in advance.

Best regards

Uli

image001.jpg

TestNVC.7Z (6.02 KB)

Profile.7z (9.64 MB)

The only thing I needed to do to your code was to change “powl” to “pow” since we don’t have long double support on the device. This allowed it to compile for me.

Running on my V100 gets the following time:

% ./CudaTest
Zeit (seriell) t=207.912421s
Zeit (parallel / CPU) t=28.629315s
Zeit (parallel / GPU) t=28.676151s

Note that both par and par_unseq will get offloaded hence why the time is the same.

For a multicore CPU run (i.e. the flag -stdpar=multicore) I see the following times:

% time ./CudaTest
Zeit (seriell) t=188.972906s
Zeit (parallel / CPU) t=691.355047s
Zeit (parallel / GPU) t=715.606422s
1522.706u 78.034s 3:31.64 756.3%        0+0k 0+171488io 0pf+0w

Notice that the wall clock for “time” doesn’t match the time reported by the program? I suspect the timer you’re using is accumulating the time from all threads so the actual time is much lower. This might account for the difference you’re seeing.

-Mat

Hi Mat,

Thanks a lot. Yes, with this little change from powl to pow the compiler problem is solved now. And the program is running well on our AI box.

But I can’t reproduce the effect of parallel processing you have shown.

150s for serial process and 163s for both parallel modes on our AI box.

So, to be sure that I have understood you well: You have also used that command to compile my code

nvc++ -stdpar -O3 CudaTest.cpp ModellTest.cpp Profil.cpp -lstdc++ -lm -ldl -o CudaTest

Best regards

Uli

image001.jpg

Above, I did use “-lstdc++ -lm -ldl” since those are implicitly added to the link and aren’t needed. But just in case, I re-ran with them but no change: Note that I did update the code so it prints the times to the screen instead of just the file, but that shouldn’t effect anything.

% nvc++ -stdpar -O3 CudaTest.cpp ModellTest.cpp Profil.cpp -lstdc++ -lm -ldl -o CudaTest -V21.7 -w
CudaTest.cpp:
ModellTest.cpp:
Profil.cpp:
% ./CudaTest
Zeit (seriell) t=188.346712s
Zeit (parallel / CPU) t=30.683846s
Zeit (parallel / GPU) t=28.610314s

I don’t have access to a RTX 3080 so can’t check if the difference is due to hardware.

Hi Mat,

Thanks for clarification. So I will update next week to the new compiler version 21.7 (currently we are using 21.2).

Best regards

Uli

Hi Mat,

(1)

We have now downloaded the latest available compiler version, i.e. 21.5. Unfortunately the performance problem are still the same as before with V21.2.

We are using it inside a Docker image from NVIDIA started that way:

(for compilation)

sudo docker run --gpus all -it --rm -v $(pwd):/host_pwd -w /host_pwd nvcr.io/nvidia/nvhpc:21.5-devel-cuda_multi-ubuntu20.04

(for delivery)

sudo docker run --gpus all -it --rm -v $(pwd):/host_pwd -w /host_pwd nvcr.io/nvidia/nvhpc:21.5-runtime-cuda11.0-ubuntu20.04

· Is this the correct way to start it?

· Or are there better options?

· Or is it better to work without Docker image? (to be honest, we have tried to install NVC++ directly, without Docker, but it was never successful … too much things have to be in line for).

The Cuda drivers on our AI box are V11.1 and this version is needed for our AI environment. So we don’t like to change it.

(2)

Is it possible for you to run my test program also on a RTX 3080? To be sure, that it is not a hardware problem …

Best regards

Uli

image001.jpg

Since we focus on HPC, we only test on the Tesla products (V100, A100) so I don’t have an RTX3080 to test. So while this could be contributing factor, I doubt it. Given your times are basically the same between the serial and parallel, it seems more likely that the code isn’t actually running on the GPU.

I do use our container on occasion when running on our Selene cluster, but unfortunately don’t have enough experience here to say if you’re using it correctly. Though, you’re commands match those we have in our docs (NVIDIA HPC SDK | NVIDIA NGC) so I’m going to assume that it’s good.

Are you able to run interactively? If so, then it would be interesting to see if you can reproduce the issue (i.e. it’s not an issue with using the ‘runtime’ image), and if the GPUs are being recognized (i.e. run the ‘nvaccelinfo’ utility)