cuDNN Bug Report: Backend Graph API Conv+Bias Fusion Returns NOT_SUPPORTED
Severity: High - Prevents use of Graph API fusion features
Component: cuDNN Backend Graph API - Operation Fusion
Summary: cuDNN Backend Graph API successfully creates and builds operation graphs for fused convolution+bias operations, but all execution engines fail finalization with CUDNN_STATUS_NOT_SUPPORTED (error code 3000). Diagnostic testing confirms that convolution-only operations work correctly, indicating the issue is specific to fusion with virtual tensors.
This completely prevents use of modern Graph API conv+bias fusion, requiring fallback to cuDNN’s legacy API with separate operations (which has various disadvantages, such as ~10% performance penalty, risk of being depreciated, etc.).
Description of behavior: When creating a fused convolution+bias operation graph using virtual tensors, the operation graph is built successfully and the heuristics are able to find compatible engines. However, none of the engines are able to successfully finalize an execution plan, and thus everything else fails after that.
ALL engines fail execution plan finalization with CUDNN_STATUS_NOT_SUPPORTED (3000).
=== Trying Engines ===
Testing engine 1/5...
Finalizing execution plan...
✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 2/5...
Finalizing execution plan...
✗ Engine 2 finalize failed: 3000 (NOT_SUPPORTED)
[... continues for all engines ...]
ERROR: All 5 engines failed to finalize
Last error code: 3000
My environment setup:
Operating System: Windows 11 Pro (64-bit)
CUDA Toolkit: 12.4 (also tested: 12.9)
cuDNN Version: 9.0 (also tested: 9.13, 9.14)
GPU: NVIDIA RTX 4080 Desktop (16GB VRAM) and Laptop (8GB VRAM)
GPU Driver: 560.94 (latest stable as of report date)
Compute Capability: 8.9
Compiler: Visual Studio 2022, MSVC 19.39
Platform: MATLAB R2025a with C++ MEX interface
Memory Available: 8GB+ free GPU memory verified via nvidia-smi
here is my verification of correct installation of everything: (1) CUDA installed at: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4, (2) cuDNN installed at: C:\Program Files\NVIDIA\CUDNN\v9.0_cuda12.4' , (3) All headers accessible, e.g., cudnn.h, cudnn_backend.h, (4) All libraries accessible, e.g., cudnn.lib, cudart.lib`, (5) cuDNN handle creation succeeds, and (6) GPU accessible and verified functional.
Minimal Reproduction Code
The attached minimal_reproduction.cpp (see at the bottom of this post) demonstrates the issue with the simplest possible case: “cubic” dimensions (H=W=C_in=C_out=N, all = 16) to eliminate dimension-related confusion. I use a single 3x3 convolution with stride=1, padding=1, canonical NCHW memory layout with standard strides, and FP32 precision throughout.
Simple test configuration:
cpp
// Input tensor: [N=16, C=16, H=16, W=16]
dims=[16,16,16,16], strides=[4096,256,16,1] // Canonical NCHW
// Weight tensor: [C_out=16, C_in=16, kH=3, kW=3]
dims=[16,16,3,3], strides=[144,9,3,1] // Canonical
// Virtual convolution output (enables fusion)
dims=[16,16,16,16], strides=[4096,256,16,1], isVirtual=true
// Bias tensor: [1, C_out=16, 1, 1]
dims=[1,16,1,1], strides=[16,1,1,1]
// Output tensor: [N=16, C=16, H=16, W=16]
dims=[16,16,16,16], strides=[4096,256,16,1]
// Convolution parameters
stride=[1,1], prePadding=[1,1], postPadding=[1,1], dilation=[1,1]
Operation graph:
cpp
Operation 1: CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR
X (UID=100) + W (UID=101) → Virtual (UID=102)
Operation 2: CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR (ADD)
Virtual (UID=102) + Bias (UID=103) → Y (UID=104)
To isolate the issue, I implemented two tests with identical tensor configurations:
Test A: Convolution + Bias Fusion (FAILS)
Graph Structure: Conv operation → Virtual tensor → Bias operation → Real output. Virtual tensor enables fusion:
INSTANT mode: Found 3 engines (status: 0)
Retrieved 3 engine configs (status: 0)
Testing engine 1/3...
✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 2/3...
✗ Engine 2 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 3/3...
✗ Engine 3 finalize failed: 3000 (NOT_SUPPORTED)
ERROR: All 3 engines failed to finalize
Test B: Convolution Only, No Fusion (SUCCEEDS)
Graph Structure: Conv operation → Real output (no virtual tensor). No bias operation, no fusion:
INSTANT mode: Found 36 engines (status: 0)
Retrieved 36 engine configs (status: 0)
Testing engine 1/36...
✓ Engine 1 SUCCESS! Workspace: 42048 bytes (0.04 MB)
✓ Execution completed successfully
These two tests have the same configuration; the only difference is fusion. 36 engines accept convolution-only, but 0 engines (out of 3-5) accept conv+bias fusion. This seems to strongly indicate that the issue is specifically with the fusion path, not with system setup, descriptor configuration, memory layout, or graph API in general (although I’m not positive on this).
Reproducibility I’ve observed of the error:
I’ve reproduced this error across multiple different computers / graphics cards, different versions of CUDA/cuDNN, different variations of the code that calls the graph API / cuDNN functionalities, etc.
cuDNN Versions: cuDNN 9.0 with CUDA 12.4 , also tried cuDNN 9.13 with CUDA 12.4 , also tried cuDNN 9.14 with CUDA 12.9 .
Hardware: RTX 4080 Desktop (16GB) , also tried RTX 4080 Laptop (8GB) .
Different test configurations: notably failed the minimal “cubic” configuration (N=16, C=16, H=16, W=16), but I’ve tried various others as well, e.g., (N=256, C=64, H=32, W=32). I’ve also tried different stride patterns, such as the canonical NCHW strides , and custom strides (MATLAB column-major). I’ve also tried different heuristic modes, such as CUDNN_HEUR_MODE_INSTANT and CUDNN_HEUR_MODE_A (both find engines, all reject).
I’ve also implemented proper RAII cleanup and no premature destruction, and used a canonical memory layout (explicit transpose to NCHW with standard strides), and I’ve used dimension verification in my tests (confirmed correct parsing and propagation).
Despite all of this, it fails consistently in all tested scenarios with the same error.
Additional Context: background of my goals:
I’m writing these files to perform the forward (and eventually backward) pass for the linear convolution + bias operation in a single convolutional layer in a CNN. I am implementing greedy layerwise neural network training in MATLAB with: (1) Large batch sizes (N ≥ 256, typically N ≥ 1024), (2) Single precision (FP32) throughout, and (3) custom gradient computation (only w.r.t. parameters, not inputs).
My goal is to not rely on automatic differentiation routines that need to pass through the forward and backward pass of the linear convolution, instead using cuDNN functionalities (cuDNN-based MEX implementation for these functionalities) that directly calculate the forward and backward pass using the necessary quantities.
I could always go to using the legacy API functions for this (e.g. “cudnnConvolutionForward”). But I want to use the newer graph-API functionalities instead for their benefits, such as conv+bias fusion, but also because of optimization potential vs legacy API, and most notably because the graph-API won’t be depreciated and removed anytime soon, unlike the legacy API functions.
Why I believe this is a bug on NVIDIA’s end:
-
Basic functionality fails - Conv+bias is the most fundamental fusion operation. Why is something this simple failing?
-
Unanimous rejection - ALL engines reject, not just some.
-
Convolution works - Same system/config succeeds without fusion.
-
Proper implementation - Follows cuDNN documentation and examples.
-
Highly reproducible - Fails identically across versions, hardware, attempts, configurations, etc…
-
No working examples - In researching this issue, I couldn’t find any Windows C++ examples of working Graph API fusion…
Questions for NVIDIA / forum readers:
-
Is conv+bias fusion supported on Windows with Graph API in cuDNN 9.x?
-
Are there undocumented requirements for virtual tensors or pointwise operations in fusion?
-
Can you provide a working example of Graph API conv+bias fusion on Windows in C++?
-
Is this a known issue in cuDNN 9.0-9.14 on Windows?
-
Will this be fixed, or should I continue with legacy API long-term? It seems like you will depreciate and remove much of the legacy API, so I’m hoping the graph-API will be fixed (or I can see what I’m doing wrong on my end).
Request (ideally one of the following):
-
Bug Fix: Resolution of the conv+bias fusion issue in a future cuDNN update
-
Working Example: Windows C++ example demonstrating Graph API conv+bias fusion
-
Documentation: Clear documentation of any limitations or requirements I may have missed
-
Workaround Guidance: Official recommendation for fusion on Windows if Graph API is not the solution
Contact Information
My primary email is bengabr1 at umbc dot edu. But I can also be accessed at gabrben11 at gmail dot com. I am happy to provide additional diagnostic information, test patches or proposed fixes, try alternative configurations you suggest, provide remote access for debugging if needed, or anything else you think would be helpful for me to assist in solving this issue.
Attachments (use these to verify the bug):
-
minimal_reproduction.cpp - Standalone C++ reproduction case. Demonstrates both failing fusion and working conv-only. While I’ve mostly been testing this in MATLAB, these files (including this one) is designed to be compiled and run independently of MATLAB. This is a self-contained test showing the bug.
minimal_reproduction.txt (19.8 KB)
-
compile_minimal_reproduction.bat - Windows batch file for compilation. This sets up Visual Studio environment, compiles the minimal reproduction with correct path, and is provided for easy reproduction by NVIDIA engineers or others on the forum.
compile_minimal_reproduction.txt (1.7 KB)
-
diagnostic_output.txt - Complete output showing the bug. Shows Test 1 (Conv+Bias Fusion): ALL engines fail with error 3000, then Test 2 (Conv Only): SUCCESS with 36 engines, engine 1 works. This confirms issue is specifically with fusion, not Graph API in general.
diagnostic_output.txt (1.6 KB)
Optional Files (available upon request):
- full_mex_implementation.zip - Complete MATLAB MEX implementation. If requested, this will provide my full implementation with all helper classes and utilities. It will also include test scripts showing reproducibility across configurations, and contains detailed diagnostic tests and memory management code.
How to Reproduce
Quick Test (5 minutes):
-
Save
minimal_reproduction.cppandcompile_minimal_reproduction.bat -
Edit batch file to match your CUDA/cuDNN paths
-
Run:
compile_minimal_reproduction.bat -
Run:
minimal_reproduction.exe -
Observe: Test 1 fails (all engines reject fusion), Test 2 succeeds (conv-only works)
Expected Output:
=== TEST 1: Convolution + Bias Fusion ===
Found 3 engines
Testing engine 1/3...
✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED=YES)
[... all engines fail ...]
✗✗✗ FAILURE: ALL 3 ENGINES REJECTED FUSION ✗✗✗
=== TEST 2: Convolution Only (No Fusion) ===
Found 36 engines
✓ Engine 1 SUCCESS!
✓ TEST PASSED
Thank you for your attention to this issue!