Int8 is 30% slower than fp16 in cudnn_samples_v8/conv_sample

CUDA11.2 + CuDNN8.1.1
GPU: RTX 4000
OS: Ubuntu 20.04

FP16:

./conv_sample -mathType1 -filterFormat1   -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x

Executing: conv_sample -mathType1 -filterFormat1 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NHWC (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.353999 sec,  
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00474882 sec,

INT8:

./conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Executing: conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NCHW_VECT_C (for single and double precision tests use a different format)
Testing int8x4 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.327955 sec,  
Testing int8x32 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00628805 sec,

I believe RTX 4000 has int8 tensor cores, why is int8 slower than fp16?

ping

@shshao yes! It’s a known issue in pytorch.
You should use for RTX 4000 cuda 11.8(official support) and cudnn 8.7!!

1 Like

Thank you so much! Do you mind giving more details for future reference, like what GPUs it happens on and how CUDA11.8 + CUDNN8.7 fixed it?