A100 nsight compute profiling error "cuDNN error: CUDNN_STATUS_INTERNAL_ERROR"

I want to profile vgg.py on A100 GPU with nsight compute CLI. The command I used is as below:
sudo /usr/local/cuda-11.0/nsight-compute-2020.1.2/target/linux-desktop-glibc_2_11_3-x64/ncu --export “temp_resul
t” --force-overwrite --target-processes all --kernel-regex-base function --launch-skip-before-match 0 --sampling-interval auto --sampling
-buffer-size 33554432 --cache-control all --clock-control base --apply-rules yes --metrics smsp__sass_data_bytes_m
em_shared --page details --csv /opt/conda/bin/python /home/vgg.py

The error is :
raceback (most recent call last):
File “/home/vgg.py”, line 161, in
main()
File “/home/vgg.py”, line 75, in main
_ = model(dummy_input_batch)
File “/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/opt/conda/lib/python3.7/site-packages/torchvision/models/resnet.py”, line 249, in forward
return self._forward_impl(x)
File “/opt/conda/lib/python3.7/site-packages/torchvision/models/resnet.py”, line 232, in _forward_impl
x = self.conv1(x)
File “/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py”, line 443, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py”, line 440, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original rep
ro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 3, 224, 224], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(3, 64, kernel_size=[7, 7], padding=[3, 3], stride=[2, 2], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [3, 3, 0]
stride = [2, 2, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x5560d2a70980
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 3, 224, 224,
strideA = 150528, 50176, 224, 1,
output: TensorDescriptor 0x5560ce2d4c00
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 112, 112,
strideA = 802816, 12544, 112, 1,
weight: FilterDescriptor 0x5560d154efa0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 64, 3, 7, 7,
Pointer addresses:
input: 0x7f783d400000
output: 0x7f7841660000
weight: 0x7f7840600000

I recommend trying this again with the latest available Nsight Compute version which has many bug fixes and new features. You don’t need to use the version shipped within your CUDA 11.0 toolkit, even when building your apps using this toolkit. You can download the latest standalone version from NVIDIA Nsight Compute | NVIDIA Developer

Thanks for help!