CUDA11.2 + CuDNN8.1.1

GPU: RTX 4000

OS: Ubuntu 20.04

FP16:

```
./conv_sample -mathType1 -filterFormat1 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Executing: conv_sample -mathType1 -filterFormat1 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NHWC (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.353999 sec,
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00474882 sec,
```

INT8:

```
./conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Executing: conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NCHW_VECT_C (for single and double precision tests use a different format)
Testing int8x4 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.327955 sec,
Testing int8x32 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00628805 sec,
```

I believe RTX 4000 has int8 tensor cores, why is int8 slower than fp16?