cudaErrorInvalidDeviceFunction: kernel fails to load, but stuck there

Hello,

I am posting there because I spent too much time looking on this forum and the rest of the Internet, and could not find an answer to my problem.

So I have of software that loads data in device memory, and then runs a kernel, then fetches back data. The “classic” use of a GPU.

But recently, after changing who knows what (I will come back on that later), none of my kernels would load.

Specifically, I get the “cudaErrorInvalidDeviceFunction” when I try to launch them. To make sure it was not something else entirely, I tried to call cudaFuncGetAttributes on the kernel and I got the same error. I also tried to comment all the code inside the kernel I’m calling, but it wouldn’t work either. So my guess is something is preventing the kernel to load.

The issue is that I did not change a thing in the kernels (or the device functions they call) between the moment it worked and the moment it stopped working. I can compare an earlier, working version with the broken one and I honestly don’t get what could have triggered that.

What I changed was: setting up memory pooling (that I disabled when I noticed the problem with the kernels), grouping allocations and copies to the same places (and only that, no big reorganization of code).

The code is running on an A1000 GPU with CUDA 12.5 installed (compute capability is set to 86).

Thanks for any help or suggestions

can you run a cuda sample code like vectorAdd successfully?

Could you post the build flags and also try out without lto link-time optimization, please?

Hello,

I tried running vectorAdd in the same compilation environment, it did fail on the same error. Running it outside of my build environment (in the cuda-samples git repo) did NOT fail. I guess the issue is with my build configuration then…

Disabled lto (with -fno-lto), didn’t help either.

Here are my build flags:

set(CMAKE_C_FLAGS_DEBUG “-O0 -g -DDEBUG -fno-lto”)
set(CMAKE_CXX_FLAGS_DEBUG “-O0 -g -DDEBUG -fno-lto”)
set(CMAKE_CUDA_FLAGS_DEBUG “-G -O0 -g -DDEBUG -fno-lto”)

set(CMAKE_CUDA_ARCHITECTURES “86”)

Following the success of the kernel run OUTSIDE of my project made me try some stuff (disabling the compilation of other unused files and see which one impacted the whole GPU)… and finally found out that it was my MemoryPool initialization (which was then unused) that caused the issue.

So here’s my code sample:

DeviceMemoryPool *memoryPool = new DeviceMemoryPool(32 * 32);
dev_alloc_t DEVICE_ALLOC_NULL = {.buffer = nullptr, .alloc_size = 0, .memory_block_desc = nullptr};

DeviceMemoryPool::DeviceMemoryPool(size_t indiv_count) {

    size_t sizes[] = {
            128, 1024, 4096, 16384, 65536, 131072, 1048576
    };
    size_t amounts[] = {
            indiv_count * 4, indiv_count * 2, indiv_count * 2, indiv_count, indiv_count, indiv_count / 2, indiv_count / 100
    };

    for(size_t i = 0; i < 6; i++) {
        add_block(sizes[i], amounts[i]);
    }
}

dev_alloc_block_t* DeviceMemoryPool::add_block(size_t buffer_size, size_t buffer_count) {
    size_t effective_buffer_size = buffer_size + (buffer_size % 64 == 0 ? 0 : 64 - (buffer_size % 64));
    size_t alloc_size = effective_buffer_size * buffer_count;

    printf("Adding block: %lu * %lu bytes \n", buffer_count, effective_buffer_size);

    void *block_ptr;
    CHECK_CUDA_CALL(cudaMalloc(&block_ptr, alloc_size));

    size_t freemap_size = (buffer_count + 31) / 32;
    auto freemap = (uint32_t*) malloc(freemap_size * sizeof(uint32_t));
    memset(freemap, 0, freemap_size * sizeof(uint32_t));

    auto memory_block = new dev_alloc_block_t {
        .block_ptr = block_ptr,
        .block_total_size = alloc_size,
        .buffer_size = effective_buffer_size,
        .free_map = freemap,
        .is_full = false
    };

    // Sorted insertion (based on Buffer Size)
    auto it = memory_blocks.begin();
    for(; it != memory_blocks.end(); it++) {

        if((*it)->buffer_size < effective_buffer_size) {
            continue;
        }

        if((*it)->buffer_size > effective_buffer_size) {
            it--;
        }

        break;
    }

    memory_blocks.insert(it, memory_block);
    return memory_block;
}

So this code somehow prevents further kernels from loading. Commenting the constructor call (on line 1) prevents the issue from occurring (commenting the constructor code also works).

I am going to look further to identify which specific line(s) causes the issue. More to come. Still, it does not seem obvious to me why it fails.

Thanks for the ideas (and any future help)

So unsurprisingly it the cudaMallloc call in add_block(size_t, size_t) that triggers the issue.

Even one single call to it (as opposed to the list passed by the constructor) is enough to fail. The size of the buffer does not seem to have an impact (I tried several).

The things that is surprising is that if add_block is disabled, all other cudaMalloc calls (that end up being triggered from the same .cu file) do not fail…

So I guess my problem is partly solved?

Figured out that doing a cudaMalloc from a global object constructor probably breaks CUDA afterwards. So I just used another method to initialize stuff. This feels more like a workaround than an actual solution to the issue, but it works.

Although that behaviour does seem strange and probably unexpected, I didn’t really find any reference about it

Generally speaking, the CUDA runtime is not properly available for use until the opening curly brackets of main() and should not be used after the closing curly brackets of main().

If you attempt to make use of it outside of that domain, which can happen if you have CUDA runtime API calls in constructors or destructors of global-scope objects, then you may have trouble. So the general advice is, don’t do that. You can find similar reports e.g. here and here

Speaking for me, personally, I was not able to deduce that you had that situation, or that the cudaMalloc call at issue was in a global-scope constructor. Its tough to debug code problems without code.