cuDNN Bug Report: Conv3d Performance Regression with bfloat16/float16 on H100

adrienl · December 19, 2025, 1:35pm

cuDNN Bug Report: Conv3d Performance Regression with bfloat16/float16 on H100

Summary

Conv3d operations exhibit a performance degradation when using bfloat16 or float16 inputs compared to float32 on H100 GPUs, for specific input shapes commonly used in Vision Transformer patch embedding layers.

Environment

GPU: NVIDIA H100 PCIe (Compute Capability 9.0)
CUDA Version: 12.8
cuDNN Version: 9.1.0.02 (91002)
PyTorch Version: 2.9.0+cu128
OS: Ubuntu 24.04 (Linux)
Driver Version: 570.153.02

Minimal Reproducer

import torch
import time

def benchmark_conv3d(dtype, warmup=3, iterations=10):
    """Benchmark Conv3d with specified dtype."""
    # Conv3d configuration matching Qwen3-VL vision encoder patch_embed
    conv = torch.nn.Conv3d(
        in_channels=3,
        out_channels=1024,
        kernel_size=(2, 16, 16),
        stride=(2, 16, 16),
        bias=True
    ).cuda()
    
    if dtype != torch.float32:
        conv = conv.to(dtype)
    
    # Input shape: 64 images × 144 patches = 9216 batch elements
    # Each element: 3 channels × 2 temporal × 16×16 spatial
    x = torch.randn(9216, 3, 2, 16, 16, dtype=dtype, device='cuda')
    
    # Warmup
    for _ in range(warmup):
        with torch.no_grad():
            _ = conv(x)
        torch.cuda.synchronize()
    
    # Benchmark
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(iterations):
        with torch.no_grad():
            _ = conv(x)
        torch.cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    return elapsed

if __name__ == "__main__":
    print("Conv3d Benchmark: [9216, 3, 2, 16, 16] -> [9216, 1024, 1, 1, 1]")
    print("=" * 60)
    
    for dtype, name in [
        (torch.float32, "float32"),
        (torch.float16, "float16"),
        (torch.bfloat16, "bfloat16"),
    ]:
        try:
            elapsed = benchmark_conv3d(dtype)
            print(f"{name:>10}: {elapsed*1000:>10.2f} ms")
        except Exception as e:
            print(f"{name:>10}: ERROR - {e}")
    
    # Calculate and display regression
    t_f32 = benchmark_conv3d(torch.float32)
    t_bf16 = benchmark_conv3d(torch.bfloat16)
    print("=" * 60)
    print(f"Regression factor (bfloat16/float32): {t_bf16/t_f32:.0f}x slower")

Observed Results

Data Type	Time per Forward Pass	Relative Performance
float32	2.2 ms	1.0x (baseline)
float16	35,621 ms	16,191x slower
bfloat16	~35,000 ms	~16,000x slower

Expected Behavior

Half-precision operations (bfloat16/float16) should be equal or faster than float32 on H100, which has native support for these data types via Tensor Cores.

Impact

This issue affects production workloads using Vision-Language Models, specifically:

Qwen3-VL (Alibaba) - Uses this exact Conv3d configuration in Qwen3VLVisionPatchEmbed
Qwen2-VL - Same architecture
Any ViT-style model that processes patches through Conv3d with large batch dimensions

The workaround (forcing float32 for Conv3d) increases memory usage and prevents full utilization of H100’s half-precision capabilities.

Analysis

The issue appears to be in cuDNN’s algorithm selection heuristics. When the combination of:

Large batch dimension (9216)
Small spatial output (1×1×1)
Half-precision dtype (bf16/fp16)

…is encountered, cuDNN selects a catastrophically slow algorithm.

Evidence that this is an algorithm selection issue (not a fundamental limitation):

float32 with identical shapes runs in ~70ms
H100 Tensor Cores natively support bf16/fp16
The mathematical operations are identical

Workaround

Force float32 computation and cast output back:

# Slow (36 seconds)
output = conv(x.bfloat16())

# Fast (0.07 seconds)  
with torch.autocast(device_type="cuda", enabled=False):
    output = conv(x.float()).to(torch.bfloat16)

Request

Please investigate the algorithm selection logic for Conv3d with:

Large batch dimensions (>1000)
1×1×1 spatial output
Half-precision inputs

The heuristic should prefer the same fast algorithm used for float32 inputs.

Additional Information

Happy to provide additional profiling data, NSight traces, or run specific diagnostic commands if helpful.

Disclaimer: LLM helped me write this issue in a clearer manner

Topic		Replies	Views
Bfloat16 has worse performance than float16 for conv2d in Pytorch CUDA Programming and Performance cuda , kernel , pytorch , python	4	3127	July 6, 2022
How to get better conv performance with cudnn? cuDNN	1	795	September 25, 2023
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2609	January 28, 2021
TensorRT inference time much faster than cuDNN TensorRT	5	1805	February 22, 2022
TX2 cuDNN TRUE_HALF_CONFIG can't be faster than float32 cuDNN	0	556	March 14, 2019
Cudnn TF32 performs no better than FP32 on RTX3090 TensorRT	1	734	January 15, 2021
Depthwise convolution in cudnn fp16 is slow than fp32 Jetson AGX Xavier cudnn	6	1449	October 18, 2021
Does TensorRT support conv3d with Tensor Core ? TensorRT	13	2070	April 26, 2021
Different FP16 inference with tensorrt and pytorch TensorRT	5	4628	October 25, 2021
Int8 is 30% slower than fp16 in cudnn_samples_v8/conv_sample cuDNN	4	847	February 8, 2023

cuDNN Bug Report: Conv3d Performance Regression with bfloat16/float16 on H100

cuDNN Bug Report: Conv3d Performance Regression with bfloat16/float16 on H100

Summary

Environment

Minimal Reproducer

Observed Results

Expected Behavior

Impact

Analysis

Workaround

Request

Additional Information

Related topics