According to Pg 59 of CUDNNv6.0 user guide documentation for DP4a configurations:
xDesc and wDesc:
I want to implement convolutions by implementing Dp4a using above configuration. For this I have ‘int8’ inputs and weights. However the above configuration demands inputs and weights to be of type 32-bit packed 4 8-bit numbers i.e ‘int8x4’
In order to do convert my inputs and weights into ‘int8x4’ format , I am using cudnnTransformTensor() API for conversion.
I have 2 problems with the above configuration:
I only get 3x speedup for DP4a when the output yDesc is CUDNN_DATA_FLOAT type. However another configuration given in user guide: ‘INT8x4_CONFIG’ in which yDesc is also CUDNN_DATA_INT8x4 type gives 4x speedup for convolutions.
Is it expected or should i also expect 4x improvement here`too ?
If not, and if I want 4x improvement I get CUDNN_DATA_INT8x4 outputs, which means Pooling layers have to be written to accept this 32-bit numbers. How this can be done?
When I call cudnnTransformTensor() API to convert my inputs and weights from INT8 to INT8x4 format, this API takes quite considerable amount of time in the NVPROF, due to which my 3x improvement is reduced to only 2X.
Theoretically, when implementing DP4A I should expect 4x improvement, but this “Overhead” doesn’t letting me to achieve that.
Please let me know how it can be dealt with .
Thank you again very much in advance!