cudaErrorInvalidDeviceFunction

Hi,

I am stuck on a cuda runtime error that I haven’t been able to figure out. I hope someone can point me in the right direction!

The error is:

CUDA error at ./src/beamform/SetBform.cu:127 code=98(cudaErrorInvalidDeviceFunction) “cudaPeekAtLastError()”

And I see from the online documentation the meaning of the error code:

The requested device function does not exist or is not compiled for the proper device architecture.

So I tried to verify that I compiled for the correct architecture. Here is nvidia-smi output on my laptop:

joe@westward$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6

And output of cuobjdump of my executable:

joe@westward$ cuobjdump -lelf blueberry
ELF file 1: lto.sm_86.cubin
ELF file 2: blueberry.2.sm_86.cubin

I don’t know how to verify that assembly for the kernel I am calling exists in the executable, but the whole thing links so it seems to me it should be in there. I also tried to create a minimum example that reproduces the error, but when I write standalone code that calls my kernel it runs just fine. The error also exists if I comment out all the device code inside the kernel and just call an empty function. If I comment out the kernel call completely, the program runs and calls to other kernels seem to work as expected. This problem occurs both on my laptop and on a Jetson Orin AGX.

Thanks in advance for any help you can provide. Please let me know what other information I could provide that would be helpful in debugging this problem.

  • Joe

Jetson Orin AGX is compute capability 8.7. I’m not sure, but I wouldn’t expect an cc8.6 object to run on it.

You’re on your way to creating a minimal reproducible example. Continue to remove things from your main application until the problem disappears. That alone may instruct you as to the issue.

Note that the error code covers essentially 2 cases:

  • the requested device function does not exist
  • not compiled for the proper device architecture

Based on the info shared so far, the 2nd case doesn’t seem to be at issue here. The first case is harder to track down. Basically, a device function can fail to load if there is some other issue preventing it to load. The compilation and linking occur correctly, but a problem arises at runtime loading of the function.

I haven’t seen this very often but have seen it in a few cases where the kernel in question references a symbol that triggers a load failure. For example a __device__ symbol, statically allocated, which exceeds a limit.

I doubt that is the case here, so I’m only mentioning it as a general idea. I’m not suggesting that absolutely this is a device symbol issue. I don’t really know what the issue is, but that is as far as I can get based on what you’ve indicated.

The only other thing I can suggest to move it along is to continue to try to create a minimal example that can be shared. In the process of doing that, you may discover the root of the issue yourself.

You can do this also with cuobjdump tool. May want to read the documentation for it, or try cuobjdump -sass blueberry

Thank you @Robert_Crovella for pushing me to put more effort into a minimal reproducible example. The result follows. I now see that the problem occurs when I try to preallocate a pool of memory with global scope. What is your recommendation for how to organize this program to work around the error?

#include <cuda.h>
#include <helper_cuda.h>

template<class T, const unsigned int kCnt> class Pool {
 public:
  Pool() : top(kCnt-1) {
    for (unsigned int ii = 0; ii < kCnt; ii++) {
      index_stack[ii] = ii;
    }
  }

  T* get() {
    int return_index = index_stack[top];
    index_stack[top--] = -1;
    return &pool[return_index];
  }

  void release(T* t) {
    auto release_index = static_cast<int>(t-pool);
    index_stack[++top] = release_index;
  }

 private:
  int top;
  int index_stack[kCnt];
  T pool[kCnt];

};

__global__ void myKernel() {}

class MyClass {
 public: 
  MyClass() { checkCudaErrors(cudaMalloc((void**)&dat, 256*sizeof(float))); };
  ~MyClass() { checkCudaErrors(cudaFree(dat)); };
  void kernelCall() {
    myKernel<<<1, 1>>>();
    checkCudaErrors( cudaPeekAtLastError() );
    checkCudaErrors( cudaDeviceSynchronize() );
  };

 private:
  float *dat;
};

Pool<MyClass, 15> myClassPool;

int main() {
  MyClass* a = myClassPool.get();
  a->kernelCall();
  myClassPool.release(a);

  return 0;
}

Here are the build commands I am using:

mkdir -p ./build/objs
nvcc -I./src -I/usr/local/cuda-11.4/samples/common/inc/ -I/usr/local/cuda-11.4/targets/x86_64-linux/include -std=c++17 -dlto -rdc=true -arch=sm_86 -g -O0 -DDEBUG_BUILD -c ./src/main.cu -o build/objs/main.cu.o
nvcc -dlto -g -O0 -DDEBUG_BUILD -arch=sm_86 --device-link ./build/objs/main.cu.o --output-file build/objs/device_link.o
g++ ./build/objs/main.cu.o ./build/objs/device_link.o -o ./build/blueberry -Wl,--no-as-needed -ldl -Wl,-Bdynamic -L/usr/local/cuda-11.4/targets/x86_64-linux/lib -lcuda -Wl,-Bstatic -static-libstdc++ -pthread -lcudart_static -lrt -lcudadevrt

I also noticed that it doesn’t even matter if I actually use the Pool, only that it is declared. For example, the error still occurs if I change main() to:

int main() {
  MyClass* a = new MyClass;
  a->kernelCall();
  delete a;

  return 0;
}

Thanks in advance for any help!

I’m able to reproduce the invalid device function error, I happened to be using CUDA 11.4, it appears you are also.

I note that if I drop the -dlto switches from the first two lines of your compilation sequence, that the error disappears.

My suggestions:

  • retest with the latest available CUDA toolchain
  • if the problem persists, file a bug

I just tested it on CUDA 11.7 and the problem seems to have disappeared. Update to CUDA 11.7. (Is my suggestion.)

Thanks again for your excellent advice. I am ultimately developing for Jetson AGX Orin, and I’m not sure if 11.7 is supported for that platform. I will look into the possibility. Otherwise, I am hopeful that if I reorganize my code to avoid the use of global variables (perhaps a static Pool with function local scope, accessible via static methods) I might be able to avoid the bug.

I tried compiling my larger application without -dlto and I still have the error there, even though it works for my minimum example. Perhaps actually having multiple translation units makes a difference.

In any case I think you’ve put me on the right track!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.