According to Pg 59 of CUDNNv6.0 user guide documentation for DP4a configurations:
Config:
INT8x4_EXT_CONFIG
xDesc and wDesc:
CUDNN_DATA_INT8x4
convDesc:
CUDNN_DATA_INT32
yDesc:
CUDNN_DATA_FLOAT
I want to implement convolutions by implementing Dp4a using above configuration. For this I have ‘int8’ inputs and weights. However the above configuration demands inputs and weights to be of type 32-bit packed 4 8-bit numbers i.e ‘int8x4’
In order to do convert my inputs and weights into ‘int8x4’ format , I am using cudnnTransformTensor() API for conversion.
I have 2 problems with the above configuration:
I only get 3x speedup for DP4a when the output yDesc is CUDNN_DATA_FLOAT type. However another configuration given in user guide: ‘INT8x4_CONFIG’ in which yDesc is also CUDNN_DATA_INT8x4 type gives 4x speedup for convolutions.
Is it expected or should i also expect 4x improvement here`too ?
If not, and if I want 4x improvement I get CUDNN_DATA_INT8x4 outputs, which means Pooling layers have to be written to accept this 32-bit numbers. How this can be done?
When I call cudnnTransformTensor() API to convert my inputs and weights from INT8 to INT8x4 format, this API takes quite considerable amount of time in the NVPROF, due to which my 3x improvement is reduced to only 2X.
Theoretically, when implementing DP4A I should expect 4x improvement, but this “Overhead” doesn’t letting me to achieve that.
Were you able to get INT8 convolution working? I was able to get INT8 convolution working for INT8_CONFIG. For INT8_EXT_CONFIG I am getting junk result. I think as per the documentation, code-snippet will remain same as INT8_CONFIG, except output descriptor configurations. Below is my output descriptor configuration. I am using batch size of 1 and number of output features 4. Any Idea what might be going wrong?
Hi,
How did you use cudnnTransformTensor() to convert weights?
cudnnTransformTensor() takes cudnnTensorDescriptor_t, but weightsDesc has type cudnnFilterDescriptor_t.