cuDNN Bug Report: Backend Graph API Conv+Bias Fusion Returns NOT_SUPPORTED

bengabr1 · October 13, 2025, 9:17pm

cuDNN Bug Report: Backend Graph API Conv+Bias Fusion Returns NOT_SUPPORTED

Severity: High - Prevents use of Graph API fusion features
Component: cuDNN Backend Graph API - Operation Fusion

Summary: cuDNN Backend Graph API successfully creates and builds operation graphs for fused convolution+bias operations, but all execution engines fail finalization with CUDNN_STATUS_NOT_SUPPORTED (error code 3000). Diagnostic testing confirms that convolution-only operations work correctly, indicating the issue is specific to fusion with virtual tensors.
This completely prevents use of modern Graph API conv+bias fusion, requiring fallback to cuDNN’s legacy API with separate operations (which has various disadvantages, such as ~10% performance penalty, risk of being depreciated, etc.).

Description of behavior: When creating a fused convolution+bias operation graph using virtual tensors, the operation graph is built successfully and the heuristics are able to find compatible engines. However, none of the engines are able to successfully finalize an execution plan, and thus everything else fails after that.

ALL engines fail execution plan finalization with CUDNN_STATUS_NOT_SUPPORTED (3000).

=== Trying Engines ===
Testing engine 1/5...
  Finalizing execution plan...
  ✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 2/5...
  Finalizing execution plan...
  ✗ Engine 2 finalize failed: 3000 (NOT_SUPPORTED)
[... continues for all engines ...]

ERROR: All 5 engines failed to finalize
Last error code: 3000

My environment setup:

Operating System: Windows 11 Pro (64-bit)
CUDA Toolkit: 12.4 (also tested: 12.9)
cuDNN Version: 9.0 (also tested: 9.13, 9.14)
GPU: NVIDIA RTX 4080 Desktop (16GB VRAM) and Laptop (8GB VRAM)
GPU Driver: 560.94 (latest stable as of report date)
Compute Capability: 8.9
Compiler: Visual Studio 2022, MSVC 19.39
Platform: MATLAB R2025a with C++ MEX interface
Memory Available: 8GB+ free GPU memory verified via nvidia-smi

here is my verification of correct installation of everything: (1) CUDA installed at: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4, (2) cuDNN installed at: C:\Program Files\NVIDIA\CUDNN\v9.0_cuda12.4' , (3) All headers accessible, e.g., cudnn.h, cudnn_backend.h, (4) All libraries accessible, e.g., cudnn.lib, cudart.lib`, (5) cuDNN handle creation succeeds, and (6) GPU accessible and verified functional.

Minimal Reproduction Code

The attached minimal_reproduction.cpp (see at the bottom of this post) demonstrates the issue with the simplest possible case: “cubic” dimensions (H=W=C_in=C_out=N, all = 16) to eliminate dimension-related confusion. I use a single 3x3 convolution with stride=1, padding=1, canonical NCHW memory layout with standard strides, and FP32 precision throughout.

Simple test configuration:

cpp

// Input tensor: [N=16, C=16, H=16, W=16]
dims=[16,16,16,16], strides=[4096,256,16,1]  // Canonical NCHW

// Weight tensor: [C_out=16, C_in=16, kH=3, kW=3]  
dims=[16,16,3,3], strides=[144,9,3,1]  // Canonical

// Virtual convolution output (enables fusion)
dims=[16,16,16,16], strides=[4096,256,16,1], isVirtual=true

// Bias tensor: [1, C_out=16, 1, 1]
dims=[1,16,1,1], strides=[16,1,1,1]

// Output tensor: [N=16, C=16, H=16, W=16]
dims=[16,16,16,16], strides=[4096,256,16,1]

// Convolution parameters
stride=[1,1], prePadding=[1,1], postPadding=[1,1], dilation=[1,1]

Operation graph:

cpp

Operation 1: CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR
  X (UID=100) + W (UID=101) → Virtual (UID=102)
  
Operation 2: CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR (ADD)
  Virtual (UID=102) + Bias (UID=103) → Y (UID=104)

To isolate the issue, I implemented two tests with identical tensor configurations:

Test A: Convolution + Bias Fusion (FAILS)

Graph Structure: Conv operation → Virtual tensor → Bias operation → Real output. Virtual tensor enables fusion:

INSTANT mode: Found 3 engines (status: 0)
Retrieved 3 engine configs (status: 0)
Testing engine 1/3...
  ✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 2/3...
  ✗ Engine 2 finalize failed: 3000 (NOT_SUPPORTED)
Testing engine 3/3...
  ✗ Engine 3 finalize failed: 3000 (NOT_SUPPORTED)

ERROR: All 3 engines failed to finalize

Test B: Convolution Only, No Fusion (SUCCEEDS)

Graph Structure: Conv operation → Real output (no virtual tensor). No bias operation, no fusion:

INSTANT mode: Found 36 engines (status: 0)
Retrieved 36 engine configs (status: 0)
Testing engine 1/36...
  ✓ Engine 1 SUCCESS! Workspace: 42048 bytes (0.04 MB)
✓ Execution completed successfully

These two tests have the same configuration; the only difference is fusion. 36 engines accept convolution-only, but 0 engines (out of 3-5) accept conv+bias fusion. This seems to strongly indicate that the issue is specifically with the fusion path, not with system setup, descriptor configuration, memory layout, or graph API in general (although I’m not positive on this).

Reproducibility I’ve observed of the error:

I’ve reproduced this error across multiple different computers / graphics cards, different versions of CUDA/cuDNN, different variations of the code that calls the graph API / cuDNN functionalities, etc.

cuDNN Versions: cuDNN 9.0 with CUDA 12.4 , also tried cuDNN 9.13 with CUDA 12.4 , also tried cuDNN 9.14 with CUDA 12.9 .

Hardware: RTX 4080 Desktop (16GB) , also tried RTX 4080 Laptop (8GB) .

Different test configurations: notably failed the minimal “cubic” configuration (N=16, C=16, H=16, W=16), but I’ve tried various others as well, e.g., (N=256, C=64, H=32, W=32). I’ve also tried different stride patterns, such as the canonical NCHW strides , and custom strides (MATLAB column-major). I’ve also tried different heuristic modes, such as CUDNN_HEUR_MODE_INSTANT and CUDNN_HEUR_MODE_A (both find engines, all reject).

I’ve also implemented proper RAII cleanup and no premature destruction, and used a canonical memory layout (explicit transpose to NCHW with standard strides), and I’ve used dimension verification in my tests (confirmed correct parsing and propagation).

Despite all of this, it fails consistently in all tested scenarios with the same error.

Additional Context: background of my goals:

I’m writing these files to perform the forward (and eventually backward) pass for the linear convolution + bias operation in a single convolutional layer in a CNN. I am implementing greedy layerwise neural network training in MATLAB with: (1) Large batch sizes (N ≥ 256, typically N ≥ 1024), (2) Single precision (FP32) throughout, and (3) custom gradient computation (only w.r.t. parameters, not inputs).

My goal is to not rely on automatic differentiation routines that need to pass through the forward and backward pass of the linear convolution, instead using cuDNN functionalities (cuDNN-based MEX implementation for these functionalities) that directly calculate the forward and backward pass using the necessary quantities.

I could always go to using the legacy API functions for this (e.g. “cudnnConvolutionForward”). But I want to use the newer graph-API functionalities instead for their benefits, such as conv+bias fusion, but also because of optimization potential vs legacy API, and most notably because the graph-API won’t be depreciated and removed anytime soon, unlike the legacy API functions.

Why I believe this is a bug on NVIDIA’s end:

Basic functionality fails - Conv+bias is the most fundamental fusion operation. Why is something this simple failing?
Unanimous rejection - ALL engines reject, not just some.
Convolution works - Same system/config succeeds without fusion.
Proper implementation - Follows cuDNN documentation and examples.
Highly reproducible - Fails identically across versions, hardware, attempts, configurations, etc…
No working examples - In researching this issue, I couldn’t find any Windows C++ examples of working Graph API fusion…

Questions for NVIDIA / forum readers:

Is conv+bias fusion supported on Windows with Graph API in cuDNN 9.x?
Are there undocumented requirements for virtual tensors or pointwise operations in fusion?
Can you provide a working example of Graph API conv+bias fusion on Windows in C++?
Is this a known issue in cuDNN 9.0-9.14 on Windows?
Will this be fixed, or should I continue with legacy API long-term? It seems like you will depreciate and remove much of the legacy API, so I’m hoping the graph-API will be fixed (or I can see what I’m doing wrong on my end).

Request (ideally one of the following):

Bug Fix: Resolution of the conv+bias fusion issue in a future cuDNN update
Working Example: Windows C++ example demonstrating Graph API conv+bias fusion
Documentation: Clear documentation of any limitations or requirements I may have missed
Workaround Guidance: Official recommendation for fusion on Windows if Graph API is not the solution

Contact Information

My primary email is bengabr1 at umbc dot edu. But I can also be accessed at gabrben11 at gmail dot com. I am happy to provide additional diagnostic information, test patches or proposed fixes, try alternative configurations you suggest, provide remote access for debugging if needed, or anything else you think would be helpful for me to assist in solving this issue.

Attachments (use these to verify the bug):

minimal_reproduction.cpp - Standalone C++ reproduction case. Demonstrates both failing fusion and working conv-only. While I’ve mostly been testing this in MATLAB, these files (including this one) is designed to be compiled and run independently of MATLAB. This is a self-contained test showing the bug.

minimal_reproduction.txt (19.8 KB)
compile_minimal_reproduction.bat - Windows batch file for compilation. This sets up Visual Studio environment, compiles the minimal reproduction with correct path, and is provided for easy reproduction by NVIDIA engineers or others on the forum.

compile_minimal_reproduction.txt (1.7 KB)
diagnostic_output.txt - Complete output showing the bug. Shows Test 1 (Conv+Bias Fusion): ALL engines fail with error 3000, then Test 2 (Conv Only): SUCCESS with 36 engines, engine 1 works. This confirms issue is specifically with fusion, not Graph API in general.

diagnostic_output.txt (1.6 KB)

Optional Files (available upon request):

full_mex_implementation.zip - Complete MATLAB MEX implementation. If requested, this will provide my full implementation with all helper classes and utilities. It will also include test scripts showing reproducibility across configurations, and contains detailed diagnostic tests and memory management code.

How to Reproduce

Quick Test (5 minutes):

Save minimal_reproduction.cpp and compile_minimal_reproduction.bat
Edit batch file to match your CUDA/cuDNN paths
Run: compile_minimal_reproduction.bat
Run: minimal_reproduction.exe
Observe: Test 1 fails (all engines reject fusion), Test 2 succeeds (conv-only works)

Expected Output:

=== TEST 1: Convolution + Bias Fusion ===
Found 3 engines
Testing engine 1/3...
  ✗ Engine 1 finalize failed: 3000 (NOT_SUPPORTED=YES)
[... all engines fail ...]
✗✗✗ FAILURE: ALL 3 ENGINES REJECTED FUSION ✗✗✗

=== TEST 2: Convolution Only (No Fusion) ===
Found 36 engines
✓ Engine 1 SUCCESS!
✓ TEST PASSED

Thank you for your attention to this issue!

bengabr1 · November 3, 2025, 9:56pm

EDIT: this issue has been solved!

The issue was that my codes were trying to use NCHW layout, but conv+bias fusion at least for my setup does not support this layout, it must be in NHWC layout.

If I kept the NCHW layout, it finds 8 engines but retrieves 0 compatible engines. But if I use NHWC layout, it still finds 8 engines but now at least 1 of the engines are compatible.

See the attached “test_variant_1_nhwc.cpp” for a test that shows conv+bias fusion works as long as you use NHWC.

This post has been resolved.
test_variant_1_nhwc.txt (13.1 KB)

Topic		Replies	Views
Cudnn backend api for fused op cuDNN cudnn	8	2247	September 13, 2021
Cudnn fused conv+bias cuDNN	3	2109	December 9, 2021
cuDNN error when run on Windows GPU-Accelerated Libraries	0	1857	October 24, 2014
Fuse Operators cuDNN	6	2463	July 21, 2021
Cudnn conv+bias fusion using backend cuDNN	1	797	March 31, 2023
Question regarding fusion engine in cuDNN frontend library cuDNN	2	1768	August 30, 2021
cuDNN v6.0 failure of filter and workspace initialization for 3D convolution (CUDNN_STATUS_NOT_SUPPORTED) GPU-Accelerated Libraries	0	1669	May 3, 2017
Variations in heuristics for selecting conv algorithm in different APIs cuDNN cudnn	3	978	October 12, 2021
cuDNN v8 backend API for Convolution cuDNN	11	1933	August 21, 2020
cuDNN functions undocumented return value GPU-Accelerated Libraries	0	1315	July 6, 2015

cuDNN Bug Report: Backend Graph API Conv+Bias Fusion Returns NOT_SUPPORTED

cuDNN Bug Report: Backend Graph API Conv+Bias Fusion Returns NOT_SUPPORTED

My environment setup:

Minimal Reproduction Code

Test A: Convolution + Bias Fusion (FAILS)

Test B: Convolution Only, No Fusion (SUCCEEDS)

Reproducibility I’ve observed of the error:

Additional Context: background of my goals:

Why I believe this is a bug on NVIDIA’s end:

Questions for NVIDIA / forum readers:

Request (ideally one of the following):

Contact Information

Attachments (use these to verify the bug):

Optional Files (available upon request):

How to Reproduce

Related topics