How to reduce time spent in transforming tensors using CUDNNv6.0 for API cudnnTransformTensor() ?

Hi

According to Pg 59 of CUDNNv6.0 user guide documentation for DP4a configurations:

Config:
INT8x4_EXT_CONFIG

xDesc and wDesc:
CUDNN_DATA_INT8x4

convDesc:
CUDNN_DATA_INT32

yDesc:
CUDNN_DATA_FLOAT

I want to implement convolutions by implementing Dp4a using above configuration. For this I have ‘int8’ inputs and weights. However the above configuration demands inputs and weights to be of type 32-bit packed 4 8-bit numbers i.e ‘int8x4’
In order to do convert my inputs and weights into ‘int8x4’ format , I am using cudnnTransformTensor() API for conversion.

I have 2 problems with the above configuration:

  1. I only get 3x speedup for DP4a when the output yDesc is CUDNN_DATA_FLOAT type. However another configuration given in user guide: ‘INT8x4_CONFIG’ in which yDesc is also CUDNN_DATA_INT8x4 type gives 4x speedup for convolutions.
    Is it expected or should i also expect 4x improvement here`too ?
    If not, and if I want 4x improvement I get CUDNN_DATA_INT8x4 outputs, which means Pooling layers have to be written to accept this 32-bit numbers. How this can be done?

  2. When I call cudnnTransformTensor() API to convert my inputs and weights from INT8 to INT8x4 format, this API takes quite considerable amount of time in the NVPROF, due to which my 3x improvement is reduced to only 2X.

Theoretically, when implementing DP4A I should expect 4x improvement, but this “Overhead” doesn’t letting me to achieve that.

Please let me know how it can be dealt with .

Thank you again very much in advance!

Hi,

Were you able to get INT8 convolution working? I was able to get INT8 convolution working for INT8_CONFIG. For INT8_EXT_CONFIG I am getting junk result. I think as per the documentation, code-snippet will remain same as INT8_CONFIG, except output descriptor configurations. Below is my output descriptor configuration. I am using batch size of 1 and number of output features 4. Any Idea what might be going wrong?

/*create and initialize output tensor*/
cudnnTensorDescriptor_t output_descriptor;
checkCUDNN(cudnnCreateTensorDescriptor(&output_descriptor));
checkCUDNN(cudnnSetTensor4dDescriptor(output_descriptor,
                                     /*format=*/CUDNN_TENSOR_NHWC,
                                     /*datatype=*/CUDNN_DATA_FLOAT,
                                     /*batch_size=*/1,
                                     /*channels=*/4,
                                     /*image_height*/image.rows,
                                     /*image_width=*/image.cols));

Thank you.
Hanumanth

I got it working with INT8_EXT_CONFIG as well. I was not allocating memory properly for the output buffer.

Thank you.

Hi,
How did you use cudnnTransformTensor() to convert weights?
cudnnTransformTensor() takes cudnnTensorDescriptor_t, but weightsDesc has type cudnnFilterDescriptor_t.

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnTransformTensor

cudnnStatus_t cudnnTransformTensor(
    cudnnHandle_t                  handle,
    const void                    *alpha,
    const cudnnTensorDescriptor_t  xDesc,
    const void                    *x,
    const void                    *beta,
    const cudnnTensorDescriptor_t  yDesc,
    void                          *y)

Hi,

I am also having this question. Did you figure it out?

Thank you.

Hi CUDALIKE,

I figured out how to use cudnnTransformTensor() to transform kernel and it works for me. If you want I can post code snippet.

Just use it in a such manner:

cudnnTensorDescriptor_t src_weights_desc;
cudnnCreateTensorDescriptor(&src_weights_desc);
cudnnSetTensor4dDescriptor(src_weights_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_INT8, l.n, l.c, l.size, l.size);

cudnnTensorDescriptor_t dst_weights_desc;
cudnnCreateTensorDescriptor(&dst_weights_desc);
cudnnSetTensor4dDescriptor(dst_weights_desc, CUDNN_TENSOR_NCHW_VECT_C, CUDNN_DATA_INT8x4, l.n, l.c, l.size, l.size);

float one = 1;
float zero = 0;
cudnnStatus_t transform_status;
transform_status =
	cudnnTransformTensor(
		cudnn_handle(),
		&one,
		src_weights_desc,
		l.weights_int8_gpu,
		&zero,
		dst_weights_desc,
		l.weights_int8_int8x4_gpu);