Debugging "invalid device function" error (compiled for correct arch)

My card definitely supports the arch I compiled for, sm_70. I am working on a large scientific program that uses relocatable device code (done under CMake), and have not encountered this issue before. There is one small kernel I have which is using an automatically generated device move constructor from “= default”. Upon launch, I get a 0x92 error from cudaLaunchKernel, i.e. “invalid device function”.

I’m really confused by this. If every source file was compiled for sm_70, how could I possibly get this error at run time? No other kernels seem to suffer from this. I’m stumped on how to debug this.

This other post investigated the generated fatbin files to verify the correct arch code was generated. I’ve done the same and do see sm_70 is what I’ve got in the linked library I generated:

thebeast% cuobjdump -ptx lib/libopenmc.so

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

OK, nice. Now I try dumping my sass for the whole shared library I’ve created. A related kernel that I’m NOT running into the issue with is here, nice. Here are the first few lines:

        Function : _ZN6openmc30run_move_constructor_on_deviceINS_19IncoherentElasticXSEEEvPT_
    .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                    /* 0x00000a0000017a02 */
                                                                             /* 0x003fde0000000f00 */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;             /* 0x000000fffffff389 */
                                                                             /* 0x000fe200000e00ff */
        /*0020*/                   IADD3 R1, R1, -0x20, RZ ;                 /* 0xffffffe001017810 */
                                                                             /* 0x003fde0007ffe0ff */
        /*0030*/                   S2R R0, SR_LMEMHIOFF ;                    /* 0x0000000000007919 */
                                                                             /* 0x00321e0000003700 */
        /*0040*/                   ISETP.GE.U32.AND P0, PT, R1, R0, PT ;     /* 0x000000000100720c */
                                                                             /* 0x003fde0003f06070 */
        /*0050*/               @P0 BRA 0x70 ;                                /* 0x0000001000000947 */
                                                                             /* 0x003fde0003800000 */
        /*0060*/                   BPT.TRAP 0x1 ;                            /* 0x000000040000795c */
                                                                             /* 0x003fde0000300000 */
        /*0070*/                   IADD3 R0, R1, RZ, RZ ;                    /* 0x000000ff01007210 */

So, I’m looking for one called run_move_constructor_on_device<IncoherentInelasticAE>. Grepping for a regex that might reveal the mangled name for that has not turned up anything. What’s strange is that the template instantiation that generated the above SASS code (run_move_constructor_on_device<IncoherentElasticXS>) is in the same source file where I should be instantiating run_move_constructor_on_device<IncoherentInelasticAE>. A possible compiler bug?

Here’s what the kernel looks like.

// This sets up vtables in device memory
template<typename T>
__global__ void run_move_constructor_on_device(T* __restrict__ dist)
{
  // TODO: it may be faster to allocate a buffer in memory here.
  static_assert(std::is_move_constructible<T>::value,
    "Polymorphic objects to be put on device must be move constructible.");
  char buffer[sizeof(T) + alignof(T)];
  char* aligned_buffer =
    buffer + alignof(T) - reinterpret_cast<intptr_t>(buffer) % alignof(T);
  T* tmp = new (aligned_buffer) T(std::move(*dist));
  new (dist) T(std::move(*tmp));
}

Yes I know that polymorphism is not optimal to use. But all this does is set up the vtables on the device. I could also share some details on what “T” here is, but am not sure it would help. Source code is here. Unfortunately it is big and I can provide guidance as necessary.

PS: I need to push some new commits to this to reproduce this error.

thebeast% nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

As a first step, you should create a minimal code example which shows the issue. The kernel looks fine.

Hi @gavin.keith.ridley ! I solved my problem by linking cudart to one of our .so libraries in addition to the libraries where it was linked to earlier. Do you link cudart to all binaries that have CUDA code? Is run_move_constructor_on_device defined in a header file or a .cpp? Can you try the latter?
One issue I see with your code is that you don’t convert to void* inside placement new. Can you try void* aligned_buffer and new (reinterpret_cast<void*>(dist)) T(std::move(*tmp)); ?
Another problem is that you seem to pass a host pointer dist to the kernel, and then move-construct something at that pointer. Are you sure that the unified virtual memory is enabled? Or is dist a device pointer?
Finally, is it called in static initialization or destruction? CUDA hits undefined behavior when its functions are called or kernels are launched at those stages in the program.
What happens if you step-into this kernel in cuda-gdb?

Thanks Striker, but sometimes it would take as much time to create an MWE as it would to solve the problem. I am unsure what the underlying cause for this might be, so isolating it into an MWE is something I was hoping to do after getting some guidance from others…

Thanks Serge for the tips! I did try linking in cudart, but it didn’t do the trick. Darn.

run_move_constructor_on_device is a template function in a header being instantiated in a source file.

I should indeed be converting to void* in placement new. Thanks. Although I have instantiated this template for many other classes successfully without any issues, so I am hesitant to think this is the root cause here.

dist is a device pointer from plain old cudaMalloc, so no worries there.

This is not called in initialization of a static or a destructor of one. If that were the case, the CUDA runtime API would instead be giving a specific error reporting that the CUDA runtime API is not running, which I have run into before.

Stepping into this kernel with cuda-gdb is a fantastic idea… Going to try that!

OK, I have identified what I believe is a compiler bug… Here’s my fix for this. It’s extremely surprising. A simple compiler macro seems to be affecting whether the template instantiation gets device code generated.

So, the template gets instantiated in this block of code:

ThermalData::ThermalData(hid_t group)
{
#ifdef __CUDA_ARCH__
  printf("WTF???\n");
  __trap();
#else

  // Coherent/incoherent elastic data
  if (object_exists(group, "elastic")) {
    // Read cross section data
    hid_t elastic_group = open_group(group, "elastic");

    // Read elastic cross section
    elastic_.xs = read_function(elastic_group, "xs");

    // Read angle-energy distribution
    hid_t dgroup = open_group(elastic_group, "distribution");
    std::string temp;
    read_attribute(dgroup, "type", temp);
    if (temp == "coherent_elastic") {
      auto xs = dynamic_cast<CoherentElasticXS*>(elastic_.xs.get());
      elastic_.distribution = make_unique<CoherentElasticAE>(*xs);
    } else {
      if (temp == "incoherent_elastic") {
        elastic_.distribution = make_unique<IncoherentElasticAE>(dgroup);
      } else if (temp == "incoherent_elastic_discrete") {
        auto xs = dynamic_cast<Tabulated1D*>(elastic_.xs.get());
        elastic_.distribution =
          make_unique<IncoherentElasticAEDiscrete>(dgroup, xs->x());
      }
    }

    close_group(elastic_group);
  }

  // Inelastic data
  if (object_exists(group, "inelastic")) {
    // Read type of inelastic data
    hid_t inelastic_group = open_group(group, "inelastic");

    // Read inelastic cross section
    inelastic_.xs = read_function(inelastic_group, "xs");

    // Read angle-energy distribution
    hid_t dgroup = open_group(inelastic_group, "distribution");
    std::string temp;
    read_attribute(dgroup, "type", temp);
    if (temp == "incoherent_inelastic") {
      inelastic_.distribution = make_unique<IncoherentInelasticAE>(dgroup);
    } else if (temp == "incoherent_inelastic_discrete") {
      auto xs = dynamic_cast<Tabulated1D*>(inelastic_.xs.get());
      inelastic_.distribution =
        make_unique<IncoherentInelasticAEDiscrete>(dgroup, xs->x());
    }

    close_group(inelastic_group);
  }
#endif
}

There’s no need to understand what’s going on here, other than the fact that ThermalData::ThermalData is host-only code, and these make_unique<> functions are instantiating that templated kernel within them, then calling the kernel.

The REALLY CRAZY behavior I’ve observed is that simply removing the macro-based sanity check at the top of the constructor causes the cuda compiler to successfully create the device code I expect! This is totally unexpected, because __CUDA_ARCH__ should not ever be set to anything in a host-only function…

Great. Now, you could try to create a standalone reproducer. You will be asked for it when reporting this bug.

(Personally I have doubts that this is the root cause of the error. But it’s hard to tell without the full code)