Class pointer crash in release mode only

I have a host caller class with a callback to create a pointer to a class on the GPU. In this callback, a kernel is called to create the instance of the class on the GPU. I properly allocate memory on the GPU for this class pointer.

Next, a method on this class is called via a second kernel and the result is copied back to the host.

All this seems to work fine in debug mode but crashes in release mode. This occurs both in Windows and in Linux. This happens whether I compile the 2 .cu files as part of one executable or as a static (or even dynamic) lib and an executable linking to the library.

Only way I can make it work in release mode is to not go through the caller class to execute the kernels but unfortunately, this does not work for my application. I would think that I am not doing anything wrong in terms of GPU memory management since it all works fine in debug mode.

I have come up with a simple example that replicates the problem and I am including the various pieces of code below. One can just compile both .cu files in Windows or Linux as part of a unique executable to replicate the issue. Alternatively, one can compile StatcLib.cu as a static lib and Caller.cu as the main linking in StaticLib.

Thank you very much in advance for your help and best regards.

function.cuh

class function {

public:

    __device__ function() {}
    virtual __device__ ~function() {}

    virtual __device__ double calc(const double& x) const = 0;
};

class pointer_derived : public pointer_base
{
    function** m_p;

public:

    pointer_derived(function** p) : m_p(p) {}

    operator function** () { return m_p; }
};

StaticLib.h

#pragma once

#include "Caller.h"

double run_function(Caller* f, const double& x, bool use_local);

StaticLib.cu

#include "StaticLib.h"
#include "function.cuh"

class myfunction_local : public function {

public:

    __device__ myfunction_local() {}
    __device__ ~myfunction_local() {}

    __device__ double calc(const double& x) const
    {
        return x * x;
    }

};

__global__ void invoke_function(double* p, function** func, double x)
{
    *p = (*func)->calc(x);
}

__global__ void invoke_local_builder(function** ptr)
{
    *ptr = new myfunction_local();
}

double run_function(Caller* c, const double& x, bool use_local)
{
    function** func_p;

    cudaError_t cudaStatus = cudaMalloc(&func_p, sizeof(function*));

    pointer_derived ptr(func_p);

    if (use_local)
        invoke_local_builder<<<1,1>>>(ptr);
    else
        c->get_function(ptr);

    double* p;

    cudaStatus = cudaMalloc(&p, sizeof(double));

    invoke_function<<<1,1>>>(p,func_p,x);

    double res;

    cudaMemcpy(&res, p, sizeof(double), cudaMemcpyDeviceToHost);

    return res;
}

Caller.h

#pragma once

class pointer_base {

public:

    pointer_base() {}
    virtual ~pointer_base() {}

};

class Caller {

public:

    Caller() {}
    ~Caller() {}

    void get_function(pointer_base& ptr);

};

Caller.cu

#include "Caller.h"
#include "function.cuh"

class myfunction : public function {

public:

    __device__ myfunction() {}
    __device__ ~myfunction() {}

    __device__ double calc(const double& x) const
    {
        return x * x;
    }

};


__global__ void invoke_builder(function** ptr)
{
    *ptr = new myfunction();
}


void Caller::get_function(pointer_base& func_p)
{
    function** ptr = dynamic_cast<pointer_derived&>(func_p);
    invoke_builder<<<1,1>>>(ptr);

}

#include <iostream>
#include "StaticLib.h"

int main()
{
    Caller caller;

    // use_local = true - works both in release and config
    // use_local = false - crashes in release mode but works in debug mode

    bool use_local = true; 

    double res = run_function(&caller, 2.0, use_local);

    std::cout << res << std::endl;

}

Hi regisvan1,

Can please provide additional information such as the compiler and compiler version you’re using as well as the compiler flags being used and the GPU you’re targeting.

I just tried your code with the nvcc 12.2 that we ship with the 23.9 NVHPC SDK using “-O3” targeting a H100 on Linux x86_64. It seemed to compile and run correctly.

% nvcc Caller.cu StaticLib.cu -O3 -gencode arch=compute_90,code=sm_90
% a.out
4

Thanks,
Mat

Hi Mat,

Thank you for getting back to me. If I use the same command on Linux (Ubuntu wsl2) as you, debug runs but does not work either and returns 0 instead of 4:

nvcc StaticLib.cu Caller.cu -I. -I…/StaticLib -G -gencode arch=compute_90,code=sm_90

./a.out

0

If I do not specify gencode, only debug works:

nvcc StaticLib.cu Caller.cu -I. -I…/StaticLib -G
./a.out
4

In both cases, release does not work, it hangs or crashes.

Here is nvcc version:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Thank you again for your help,

Regis

| MatColgrove Moderator
November 7 |

  • | - |

Hi regisvan1,

Can please provide additional information such as the compiler and compiler version you’re using as well as the compiler flags being used and the GPU you’re targeting.

I just tried your code with the nvcc 12.2 that we ship with the 23.9 NVHPC SDK using “-O3” targeting a H100 on Linux x86_64. It seemed to compile and run correctly.

% nvcc Caller.cu StaticLib.cu -O3 -gencode arch=compute_90,code=sm_90
% a.out
4

Thanks,
Mat

Also please note that you have to change the flag to false to get the crash:

bool use_local = true; // This does not crash

whereas this crashes in release mode:

bool use_local = false; // This crashes only in release mode

Both perform the same task except one goes through a class which is needed by my use case.

Thank you.

Hi Mat,

Thank you for getting back to me. If I use the same command on Linux (Ubuntu wsl2) as you, debug runs but does not work either and returns 0 instead of 4:

nvcc StaticLib.cu Caller.cu -I. -I…/StaticLib -G -gencode arch=compute_90,code=sm_90

./a.out

0

If I do not specify gencode, only debug works:

nvcc StaticLib.cu Caller.cu -I. -I…/StaticLib -G
./a.out
4

In both cases, release does not work, it hangs or crashes.

Here is nvcc version:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Thank you again for your help,

Regis

| MatColgrove Moderator
November 7 |

  • | - |

Hi regisvan1,

Can please provide additional information such as the compiler and compiler version you’re using as well as the compiler flags being used and the GPU you’re targeting.

I just tried your code with the nvcc 12.2 that we ship with the 23.9 NVHPC SDK using “-O3” targeting a H100 on Linux x86_64. It seemed to compile and run correctly.

% nvcc Caller.cu StaticLib.cu -O3 -gencode arch=compute_90,code=sm_90
% a.out
4

Thanks,
Mat

What device are you using? Does the code work if you set the code generation to match the compute capability of this device?

I’m using a H100 which has compute capability 9.0 which may not be suitable for your device. Without setting the “gencode” flag, nvcc will default to using cc52 so will work on older devices.

Sorry, missed that and now can recreate the hang.

Looks like the code needs to use the device linker, nvlink, in order to resolve the Caller function. The code work for me after adding “-rdc=true” which will link the device code.

% nvcc Caller.cu StaticLib.cu -rdc=true -O3 -gencode arch=compute_90,code=sm_90; a.out
4

Hi Mat,

Thank you, this is very useful but there is still an issue with this.

I changed the implementation of the function slightly to return (x+0.2) * x when use_local = false and (x+0.3) * x when use_local = true.

Everything behaves as expected when building the 2 .cu as part of a unique executable or as a executable and a static library but not when trying to use a shared library.

When using a dynamic library, the program returns the result of the local function even if use_local=false, which is really puzzling.

This is what I tried:

Executable

nvcc -rdc=true -c Caller.cu -o cuda64/release/Caller.o -O3
nvcc -rdc=true -c StaticLib.cu -o cuda64/release/StaticLib.o -O3
nvcc -rdc=true -o exe cuda64/release/Caller.o cuda64/release/StaticLib.o -O3
./exe
=> 4.4 <= which is correct

Executable with separate compilation/linking of device code

This is based on the information in this link: https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/

nvcc -rdc=true -c Caller.cu -o cuda64/release/Caller.o -O3
nvcc -rdc=true -c StaticLib.cu -o cuda64/release/StaticLib.o -O3
nvcc -dlink cuda64/release/Caller.o cuda64/release/StaticLib.o -o cuda64/release/Caller_link.o
g++ -o exe2 cuda64/release/Caller.o cuda64/release/StaticLib.o cuda64/release/Caller_link.o -O3 -lcudart_static
./exe2
=> 4.4 <= which is correct

Executable and static library

nvcc -rdc=true -c StaticLib.cu -o cuda64/release/StaticLib.o -O3
nvcc -rdc=true -c Caller.cu -o cuda64/release/Caller.o -O3
nvcc -rdc=true --lib -o libStaticLib.a cuda64/release/StaticLib.o
nvcc -rdc=true -o static_exe cuda64/release/Caller.o -O3 -L. -lStaticLib
./static_exe
=> 4.4 <= which is correct

Executable and dynamic library

nvcc --compiler-options ‘-fPIC’ -rdc=true -c StaticLib.cu -o cuda64/release/StaticLib.o -O3
nvcc -rdc=true -c Caller.cu -o cuda64/release/Caller.o -O3
nvcc -rdc=true --shared -o libDynamicLib.so cuda64/release/StaticLib.o
nvcc -rdc=true -o dyn_exe cuda64/release/Caller.o -O3 -L. -lDynamicLib
LD_LIBRARY_PATH=. ./dyn_exe
=> 4.6 <= which is not right

**Executable and static library with separate device code compilation/linking **

Again based on info here https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/

nvcc -rdc=true -c StaticLib.cu -o cuda64/release/StaticLib.o -O3
nvcc -rdc=true -dlink cuda64/release/StaticLib.o -o cuda64/release/StaticLib_link.o -O3
rm -f libStaticLib.a
ar rvs libStaticLib.a cuda64/release/StaticLib.o cuda64/release/StaticLib_link.o
nvcc -rdc=true -c Caller.cu -o cuda64/release/Caller.o -O3
nvcc -dlink cuda64/release/Caller.o -o cuda64/release/Caller_link.o -O3
g++ -o static_exe2 cuda64/release/Caller.o cuda64/release/Caller_link.o -O3 -L. -lStaticLib -lcudart_static
./static_exe2
=> 4.6 <= which is not right

What is even more puzzling is that if I remove the implementation of the local function and remove the flag, I get 4 as a result with a behavior that seems to be as follows.

 (x+0.2)*x returns x*x = 4 instead of 4.4
 (x+0.49)*x becomes x = 4 instead of 4.98
  (x+0.5)*x becomes (x+0.5)*x = 5 as expected.

New code for Caller.cu and StaticLib.cu is below:

Caller.cu

#include "Caller.h"
#include "function.cuh"

class myfunction : public function {

public:

    __device__ myfunction() {}
    __device__ ~myfunction() {}

    __device__ double calc(const double& x) const
    {
        return (x+0.2) * x; // Change made to function to  return something different
    }

};


__global__ void invoke_builder(function** ptr)
{
    *ptr = new myfunction();
}


void Caller::get_function(pointer_base& func_p)
{
    function** ptr = dynamic_cast<pointer_derived&>(func_p);
    invoke_builder<<<1,1>>>(ptr);

}

#include <iostream>
#include "StaticLib.h"

int main()
{
    Caller caller;

    bool use_local = false;

    double res = run_function(&caller, 2.0, use_local);

    std::cout << res << std::endl;

}

StaticLib.cu

#include "StaticLib.h"
#include "function.cuh"

class myfunction_local : public function {

public:

    __device__ myfunction_local() {}
    __device__ ~myfunction_local() {}

    __device__ double calc(const double& x) const
    {
        return (x+0.3) * x; // Changed local function to return something different
    }

};


__global__ void invoke_function(double* p, function** func, double x)
{
    *p = (*func)->calc(x);
}

__global__ void invoke_local_builder(function** ptr)
{
    *ptr = new myfunction_local();
}

double run_function(Caller* c, const double& x, bool use_local)
{
    function** func_p;

    cudaError_t cudaStatus = cudaMalloc(&func_p, sizeof(function*));

    pointer_derived ptr(func_p);

    if (use_local)
        invoke_local_builder<<<1,1>>>(ptr);
    else
        c->get_function(ptr);

    double* p;

    cudaStatus = cudaMalloc(&p, sizeof(double));

    invoke_function<<<1,1>>>(p,func_p,x);

    double res;

    cudaMemcpy(&res, p, sizeof(double), cudaMemcpyDeviceToHost);

    return res;
}

Hi Regis,

On my H100, I see correct results but can reproduce your erroneous results on an A100 and V100.

What appears to me to be happening is that in the failing case the “calc” in StaticLib.cu being bound, while in the passing case the one in Caller.cu is bound.

I support the NVHPC Compilers (nvc, nvc++, and nvfortran) but usually can help with general CUDA C questions, However, this is beyond my area of expertise to explain why this is occurring.

I’d suggest posting the follow-up question over on the CUDA forum to see if they can help.

-Mat

Thank you, Mat very much for your help. I have done as you recommended.