When calling a kernel from within a kernel, I get undefined symbol: __fatbinwrap_f6e73cba_22_cuda_device_runtime_cu_945c48ec_33040

I’m porting some code to CUDA. It works fine if I have one simple kernel call like this
However if, inside that kernel call, I call another kernel, then I get this linker error:
lld-link : error : undefined symbol: __fatbinwrap_f6e73cba_22_cuda_device_runtime_cu_945c48ec_33040
2>>>> referenced by CUDAVideoSig.lib(CUDAVideoSig.device-link.obj):($LN16)
2>Done building project “VideoSig.vcxproj” – FAILED.
========== Build: 1 succeeded, 1 failed, 111 up-to-date, 0 skipped ==========
========== Build completed at 5:27 PM and took 11.477 seconds ==========

The code compiles and links if I change it so that no kernels are called inside the top-level kernel (doing everything on 1 thread).
How do I make it link properly when a kernel is called from within a kernel?
These are my compile options


And these are my linker options


I’d post screenshots of the code, but apparently, I’m limited to 2 screenshots per post.

cudavideosig.txt (5.7 KB)
Here is the full .cu file

Please don’t do that on these forums. The forums are designed to accept code which will be nicely formatted when pasted in as text and selected for code formatting. After pasting in the code, select the code in the edit window, then press the </> button at the top of the edit window, then save your changes.

This is also preferred for several reasons to posting code as an attachment.

As a diagnostic, my suggestions would be to:

  1. select a CDP (CUDA Dynamic Parallelism - the term given to the situation when you are calling a kernel from another kernel) project from the CUDA sample codes such as cdpSimplePrint
  2. make sure you can import the associated VS project and build the sample code as-is correctly in your environment
  3. drop your source code into the (only) .cu source file in that project, and then try building again.

A “typical” reason for the error you are reporting is not linking against cudadevrt.lib as described here (and in other posts.)

That doesn’t seem to be the issue based on what you have shown in your project configuration, but its hard to be certain based on the config screens. Since that is not obviously the case, I’m suggesting the steps above.

I have the samples. I tried to compile it but it is complaining about unsupported gpu architecture compute_100


I upgraded to CUDA 13.0 but I don’t think it’ll take effect until I restart my computer. I’m in the middle of a lengthy uninterruptible download so it’ll have to wait until that finishes. I’ll let you know how it goes.

The CUDA samples codes have recently converted to using CMake instead of the “ordinary” VS project method. Which is fine; the diagnostic I mentioned is still useful. I suspect it will work for you and if you are ok building things that way, then proceed.

But some time ago the samples also were set up to provide ordinary VS project files, so by using an older branch such as here you could do that as well, if you want something closer to what the method you posted here.

I got the CMake version (latest) compiling, but when I ran it, I got this

starting Simple Print (CUDA Dynamic Parallelism)
CUDA error at C:\Users\freel\source\repos\cuda-samples\Common\helper_cuda.h:816 code=801((null)) “cudaGetDeviceCount(&device_count)”

C:\Users\freel\Source\Repos\cuda-samples\out\build\x64-Debug\Samples\3_CUDA_Features\cdpSimplePrint\cdpSimplePrint.exe (process 13932) exited with code 1.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

So, operation not supported.

Weird. I didn’t make any modifications to the solution. I’m running on a fresh CUDA 13 install.
My machine has a GeForce RTX 4080 SUPER.

I agree its weird.

I guess if I were working on it, I would also try the deviceQuery sample code. If that also failed I would conclude that it is indicating an improper CUDA setup or broken CUDA install. My suspicion would revolve around the driver install, primarily.

I’m also assuming you’re not doing something unanticipated like trying to run this in virtual machine or virtual box, or some other setting like that.

DeviceQuery gave me this

C:\Users\freel\Source\Repos\cuda-samples\out\build\x64-Debug\Samples\1_Utilities\deviceQuery\deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
 → (null)
Result = FAIL

C:\Users\freel\Source\Repos\cuda-samples\out\build\x64-Debug\Samples\1_Utilities\deviceQuery\deviceQuery.exe (process 41740) exited with code 1.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

DeviceQueryDrv.exe gave me this

C:\Users\freel\Source\Repos\cuda-samples\out\build\x64-Debug\Samples\1_Utilities\deviceQueryDrv\deviceQueryDrv.exe Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 4080 SUPER”
CUDA Driver Version:                           12.6
CUDA Capability Major/Minor version number:    8.9
Total amount of global memory:                 16376 MBytes (17170956288 bytes)
(80) Multiprocessors, (128) CUDA Cores/MP:     10240 CUDA Cores
GPU Max Clock rate:                            2610 MHz (2.61 GHz)
Memory Clock rate:                             11501 Mhz
Memory Bus Width:                              256-bit
L2 Cache Size:                                 67108864 bytes
Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
Texture alignment:                             512 bytes
Maximum memory pitch:                          2147483647 bytes
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Concurrent kernel execution:                   Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA):      Yes
Device supports Managed Memory:                Yes
Device supports Compute Preemption:            Yes
Supports Cooperative Kernel Launch:            Yes
Supports MultiDevice Co-op Kernel Launch:      No
Device PCI Domain ID / Bus ID / location ID:   0 / 202 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

C:\Users\freel\Source\Repos\cuda-samples\out\build\x64-Debug\Samples\1_Utilities\deviceQueryDrv\deviceQueryDrv.exe (process 33680) exited with code 0.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

It seems noteworthy that it wasn’t picking up on the 13.0 installation that I installed yesterday.
Perhaps I need to uninstall 12.6 to force it up to 13.0?

As for the machine itself, no weirdness. Just an absurdly overpowered home PC with 512GB of RAM and a 56 core Xeon processor. It’s running Windows 10

Without knowing how you installed things I wouldn’t be able to answer any of those questions. But certainly your current driver that supports up through CUDA 12.6 is not the right one to use with a CUDA 13.0 install. You can find a recent driver installer for your GPU here. If you use the CUDA download site you will note that the driver install is now a separate step and tool from the CUDA toolkit install. So if you installed CUDA toolkit 13 and didn’t do anything else, then you skipped/missed the driver install step. It is a separate step.

Using the separate driver install page seems to have fixed it. I can now run the simplePrint app and the devicequery with expected results. I’ll try reverse engineering the simplePrint to fix my own app and let you know if I have any more problems.

One possible approach: download the version of simplePrint that includes the VS project (that I linked), make sure you can build/run it, and then drop your code into the cu file in that project, and make sure it builds properly.

You can then compare that project settings to the one you showed at the beginning of this thread to find which differences matter.

Ok, so I’ve been working on this, and I’m running into a strange problem.

Suppose I have this code

template <class descriptor_type>
__global__ void bulk_nearest_neighbor(CUDAVideoSig<descriptor_type> const *const p_shrterSig, CUDAVideoSig<descriptor_type> const *const p_longerSig, size_t const numUsedFrames, ForwardMatchStruct *const p_indices)
{
    if (size_t const globalId = size_t(blockIdx.x) * size_t(blockDim.x) + size_t(threadIdx.x); globalId < numUsedFrames)
    {
        ForwardMatchStruct &matchStruct = p_indices[globalId];
        nearest_neighbor_match(p_shrterSig->get_keyframe(matchStruct.m_shrterIndex), p_longerSig->get_keyframe(matchStruct.m_longerIndex), &matchStruct.m_score);
    }
}

template <class descriptor_type>
__global__ void gpu_static_match(                           //
    CUDAVideoSig<descriptor_type> const *const p_shrterSig, //
    CUDAVideoSig<descriptor_type> const *const p_longerSig, //
    double const threshold,                                 //
    CUDAStaticVideoSigMatcher<descriptor_type> *pd_matcher  //
)
{
    assert(pd_matcher);
    assert(p_shrterSig->get_num_frames());
    assert(p_shrterSig->get_num_descriptors());
    assert(p_longerSig->get_num_frames());
    assert(p_longerSig->get_num_descriptors());
    assert(p_shrterSig->get_num_frames() <= p_longerSig->get_num_frames());
    double const shrterDuration = p_shrterSig->get_duration();
    double const skipRatio = pd_matcher->skip_ratio();
    double nextShrterPos = shrterDuration * skipRatio;
    size_t const numShrterFrames = p_shrterSig->get_num_frames();
    if (size_t currShrterIndex = p_shrterSig->get_index_nearest_time(nextShrterPos, 0); currShrterIndex < numShrterFrames)
    {
        double nextLongerPos = p_longerSig->get_duration() * skipRatio;
        size_t const numLongerFrames = p_longerSig->get_num_frames();
        if (size_t currLongerIndex = p_longerSig->get_index_nearest_time(nextLongerPos, 0); currLongerIndex < numLongerFrames)
        {
            double const keyFrameInterval = std::max<double>(std::numeric_limits<double>::epsilon(), shrterDuration * pd_matcher->key_frame_interval_ratio());
            {
                CudaVector<ForwardMatchStruct> indices;
                // iterate through the video and identify which frames must be compared to which frames
                for (;;)
                {
                    indices.push_back(ForwardMatchStruct{currShrterIndex, currLongerIndex, 0});
                    if ((currShrterIndex = p_shrterSig->get_index_nearest_time(nextShrterPos += keyFrameInterval, currShrterIndex)) >= numShrterFrames)
                    {
                        break;
                    }
                    // TODO experiment with calling get_index_nearest_time_iterate_forward and get_index_nearest_time_iterate_backward
                    if ((currLongerIndex = std::max(currLongerIndex + 1, p_longerSig->get_index_nearest_time(nextLongerPos += keyFrameInterval, currLongerIndex))) >= numLongerFrames)
                    {
                        break;
                    }
                }
                size_t const numUsedFrames = indices.size();
                ForwardMatchStruct *const p_indices = indices.data();
                // compute the scores
                #if 0
                for (size_t index = 0; index < numUsedFrames; ++index)
                {
                    ForwardMatchStruct &matchStruct = p_indices[index];
                    nearest_neighbor_match(p_shrterSig->get_keyframe(matchStruct.m_shrterIndex), p_longerSig->get_keyframe(matchStruct.m_longerIndex), &matchStruct.m_score);
                }
                #else
                bulk_nearest_neighbor<<<int((numUsedFrames + 1023) / 1024), int(std::min<size_t>(1024, numUsedFrames))>>>(p_shrterSig, p_longerSig, numUsedFrames, p_indices);
                #endif
                // average them
                double similarityResult = 0;
                for (size_t index = 0; index < numUsedFrames; ++index)
                {
                    similarityResult += p_indices[index].m_score;
                }
                *pd_matcher->result_ptr()=(similarityResult / numUsedFrames);
            }
            return;
        }
    }
    *pd_matcher->result_ptr() = (threshold);
}

Inside this code, you’ll notice an #if 0 bit. As it is now, it calls “bulk_nearest_neighbor”, which is another kernel. If I run the code like this, the value stored at *pd_matcher→result_ptr() is wrong.
If I switch it to use a standard for loop, the value is correct.
I’m trying to understand why when I call a kernel from another kernel, the result is different than if I had just called a for-loop inside that kernel.
Let me know if I need to show more code, but I think this is enough.

I did some more testing. It’s basically like the parent kernel gpu_static_match calls the child kernel bulk_nearest_neighbor but doesn’t wait for all the threads to finish. How do I tell the parent kernel to wait until the child kernel is actually finished? Keep in mind that this is all running on the device side.

With CDP 2.0 (i.e. modern CDP - CUDA Dynamic Parallelism - the act of calling a kernel from device code) that is basically not possible. But there are workarounds. You will find various threads discussing this notion, here are a few: 1 2 3

Yes, I did eventually discover the method of just calling another kernel immediately after on the same stream.
I assume this has considerable overhead

//the original parallel bit
bulk_nearest_neighbor<<<int((numUsedFrames + (n_block_dim - 1)) / n_block_dim), int(std::min<size_t>(n_block_dim, numUsedFrames)), 0, cudaStreamTailLaunch>>>(p_shrterSig, p_longerSig, numUsedFrames, p_indices);
//take the for loop that adds up the results and stuff it into this kernel
add_up_nearest_neighbor<<<1, 1, 0, cudaStreamTailLaunch>>>(numUsedFrames, p_indices, pd_matcher->result_ptr());

This produces the correct results, but it really annoys me that I haven’t yet found a better workaround.
Unfortunately, when I ran some performance tests, the CUDA version of this algorithm is about 4 times slower than the CPU variant. It remains to be seen whether I can just run the CPU variant and the GPU variant simultaneously and simply make the most of the hardware (since the GPU was sitting idle before).

For now, I’m facing a new challenge, It seems that any structs I store in device arrays need to be aligned to 32-byte boundaries. I don’t know why, exactly, but experimentation shows that if I don’t, I run into incorrect results. So I’ve got to re-tool a lot of things to ensure it all works. That’s a matter for another thread. Thank you for your help on this matter.