### Describe the issue
Hello, when I process a data shape in (70000,16) durin…g inference, it has error message:
2023-09-04 10:54:38.941019182 [E:onnxruntime:Model, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
2023-09-04 10:54:38.941211047 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
terminate called after throwing an instance of 'Ort::Exception'
what(): Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
It looks like the BatchNormalization implemented by cuDNN in onnxruntime doesn't support data size like this. I try to reduce the number of channels but not help, and then I reduce the number of data in each channel below 50000 then it works.
I wonder that is there any exact limitations on the data scale when using batch normalization? And how to solve if I really need to use data in this scale in inference process.
### To reproduce
This issue will happen in both C++ and Python API.
Here is one case in Python:
```
import numpy as np
import torch
import torch.nn as nn
import onnx
import onnxruntime as ort
# 1. Generate random data
data = np.random.rand(68000, 16).astype(np.float32)
# 2. Define a simple model in PyTorch with a batch normalization layer
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.batch_norm = nn.BatchNorm1d(16)
def forward(self, x):
return self.batch_norm(x)
model = SimpleModel().cuda()
# Convert data to PyTorch tensor
tensor_data = torch.tensor(data).cuda()
# 3. Export the PyTorch model to ONNX format
torch.onnx.export(model, tensor_data, "simple_model.onnx", verbose=True, input_names=['input'], output_names=['output'])
# 4. Perform inference using ONNX Runtime
ort_session = ort.InferenceSession("simple_model.onnx", providers=['CUDAExecutionProvider'])
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(tensor_data)}
ort_outs = ort_session.run(None, ort_inputs)
print(ort_outs[0])
```
Output:
graph(%input : Float(68000, 16, strides=[16, 1], requires_grad=0, device=cuda:0),
%batch_norm.weight : Float(16, strides=[1], requires_grad=1, device=cuda:0),
%batch_norm.bias : Float(16, strides=[1], requires_grad=1, device=cuda:0),
%batch_norm.running_mean : Float(16, strides=[1], requires_grad=0, device=cuda:0),
%batch_norm.running_var : Float(16, strides=[1], requires_grad=0, device=cuda:0)):
%output : Float(68000, 16, strides=[16, 1], requires_grad=1, device=cuda:0) = onnx::BatchNormalization[epsilon=1.0000000000000001e-05, momentum=0.90000000000000002](%input, %batch_norm.weight, %batch_norm.bias, %batch_norm.running_mean, %batch_norm.running_var) # /home/dell/anaconda3/envs/pointseg/lib/python3.8/site-packages/torch/nn/functional.py:2282:0
return (%output)
2023-09-04 11:03:58.430848248 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
2023-09-04 11:03:58.430958361 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
Traceback (most recent call last):
File "/home/dell/CLionProjects/NewSpconvOp/test_cases/bn_test.py", line 34, in <module>
ort_outs = ort_session.run(None, ort_inputs)
File "/home/dell/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
### Urgency
_No response_
### Platform
Linux
### OS Version
Ubuntu 18.04
### ONNX Runtime Installation
Built from Source
### ONNX Runtime Version or Commit ID
1.15
### ONNX Runtime API
C++
### Architecture
X64
### Execution Provider
CUDA
### Execution Provider Library Version
CUDA 11.4 RTX 3060
### Model File
_No response_
### Is this a quantized model?
No