NVCC creates host symbols for device functions after explicit class template instantiation

I have a template classes, which I instantiate explicitly. In this case, nvcc creates host symbols for __device__ functions in the object file, which can be linked to other host code, making the program exit with return code 1 immediately.

A reproducer is this test.cu file:

#include <cuda.h>

template <class T> class c
{
public:
  inline int test();
  __device__ int test2();
};

template <class T> int c<T>::test()
{
  return 0;
}

template <class T> __device__ int c<T>::test2()
{
  return 0;
}

template class c<int>;

Compiled with CUDA 12.6.1 via: nvcc -c test.cu -o test.o

I get from nm -C -g test.o

                 U atexit
                 U __cudaInitModule
                 U __cudaRegisterFatBinary
                 U __cudaRegisterFatBinaryEnd
                 U __cudaUnregisterFatBinary
                 U exit
                 U __stack_chk_fail
0000000000000000 W c<int>::test()
0000000000000000 W c<int>::test2()

The test() function should be there as host symbol, but not test2().

1 Like

Actually, after checking a bit, I see the same also for normal functions, not class member functions, if they are not defined inline.
E.g. __device__ int foo() {return 0;} creates a host symbol, which will exit the application with return code 1.
I assume this is intended. Is there a way to disable it?

Not sure what you mean by that. You haven’t actually provided a complete application that I can see here. Do you mean it will exit if you call the __device__ function from host code?

Because in typical usage, that would be a compile error.

Hi Robert,

sorry if my post was not fully clear. To give a bit more context: sure, if I call the device function from host code directly, it gives a compile error. But in our case I had the problem with linking object files or libraries together, which use the same function names for device and for host code.

Here are 2 full examples, that are failing. One for the template case using static linking, and one in the non-template case with shared object files.

For convenience, I have also uploaded all files as tarball here: link

Example 1:
main.cpp:

template <class T> class c
{
public:
  int foo();
};

int main(int, char**)
{
  c<int> x;
  x.foo();
  return 0;
}

test.cu:

#include <cuda.h>

template <class T> class c
{
public:
  __device__ inline int foo();
};

template <class T> __device__ inline int c<T>::foo()
{
  return 0;
}

template class c<int>;

test2.cpp:

#include <cstdio>

template <class T> class c
{
public:
  inline int foo();
};

template <class T> inline int c<T>::foo()
{
  printf("foo from test2.cpp\n");  
  return 0;
}

template class c<int>;

compilation and exection: test.sh:

nvcc -c test.cu -o test.o --compiler-bindir gcc-13 -O0 -g
c++ -c test2.cpp -o test2.o -O0 -g
c++ -c main.cpp -o main.o -O0 -g
c++ -o test main.o test.o test2.o -L /opt/cuda/lib64 -lcuda -lcudart -g
./test
echo RETURN VALUE $?

Now, when I run it, I get:

qon@qon ~/tmp4/example_1 $ ./test.sh 
RETURN VALUE 1

And expected would be:

foo from test2.cpp
RETURN VALUE 0

and with nm I get:

qon@qon ~/tmp4/example_1 $ nm -C -U -g test.o
0000000000000000 W c<int>::foo()

Example 2
main.cpp:

int foo();

int main(int, char**)
{
  foo();
  return 0;
}

test.cu:

#include <cuda.h>

__device__ int foo()
{
  return 0;
}

test2.cpp:

#include <cstdio>

int foo()
{
  printf("foo from test2.cpp");  
  return 0;
}

test.sh

nvcc test.cu --shared -o libtest.so --compiler-bindir gcc-13 -O0 -g
c++ --shared test2.cpp -o libtest2.so -O0 -g
c++ main.cpp -o test -O0 -g -L. -ltest -ltest2 -L /opt/cuda/lib64 -lcuda -lcudart
LD_LIBRARY_PATH+=:. ./test
echo RETURN VALUE $?

Also here, the device function symbol is called, exiting with return value 1:

qon@qon ~/tmp4/example_2 $ ./test.sh 
RETURN VALUE 1

And from nm I get:

qon@qon ~/tmp4/example_2 $ nm -C -U -g libtest.so
000000000000ac00 T foo()

So in both cases, nvcc created a host-visible symbol for the device function. I disassembled it, and it essentially calls exit() with return code 1, which is exactly what happens. I tried the same using clang for CUDA compilation instead of nvcc, and then it behaves as expected.

That’s not permissible in CUDA.

Fair enough, but I am having the problem with different .so files, one of them being not even CUDA. Now, I agree that 2 different host functions in 2 different non-CUDA shared object files would also collide. But I would not have expected device function in one library to affect host linking of another library.

Also, the behavior is not consistent. nvcc creates these host symbols for non-inline functions, but not for inlined functions. But then it suddenly creates them for inlined template member functions if the class is explicitly instantiated. I don’t really understand what is the purpose of this symbol. I’d rather like the linking to fail if the host symbol is missing, than to link with something that just exits the application. This actually took me some time to understand what is going on.

A function that can be and is inlined does not behave like a function in numerous respects. This doesn’t surprise me at all that the symbol disappears in this case.

I suspect that the underlying behavior is arising from needs of the CUDA compilation trajectory. I won’t be able to explain every nuance.

You’re welcome to file a bug. It wouldn’t surprise me if the bug is not actionable due to some specific need of the CUDA compilation trajectory.

Thx for the explanation, so for my use case, I started to use objcopy -L and a linker version-script to demote all symbols except those I want to LOCAL, which fully solves the problem for me.

I am still a bit concerned about the inline functions becoming global symbols after explicit template instantiation.
I’ll file a bug report for this as you indicated.

This maps to ticket ID 4853553 . We will get back the conclusion here once it is completed internally .

Due to implementation constraints, ‘shadow’ __device__ functions and function template definitions are present in the code sent to the host compiler. This is required for some corner cases e.g. if the address of a __device__ function is initializing a template default argument (templates are usually preserved in the code sent to the host compiler).

As a result, a program cannot have the same symbol defined to be __host__ in one translation unit and __device__ in another i.e. the execution space annotations (__host__ / __device__) must match for all declarations of the symbol across translation units.

One workaround would be to make the device functions have internal linkage by adding ‘static’ or placing them inside unnamed namespace.

Hope this helps . We are closing this as Not a Bug .