Hi,
I am trying to upgrade my code to use the new batch normalization API, and I am having a lot of problems.
First it seems that these API don’t work at all with NCHW tensors. The documentation says: “When the tensor layout is NCHW, higher performance can be obtained when HW-packed tensors are used for x, dy, dx.”. My experience is that when the tensor layout is NCHW, all the new API fail with CUDNN_STATUS_NOT_SUPPORTED. My tensor has N = 512, C = 128, H = 15, W = 15.
Another bigger problem is that cudnnBatchNormalizationForwardTrainingEx fails with CUDNN_STATUS_EXECUTION_FAILED. The strange part is that my code works on my Linux machine with a TitanV, but fails on my Windows 10 machine with a RTX 2080 Ti. I include the log of the API call below.
I! CuDNN (v7402) function cudnnBatchNormalizationForwardTrainingEx() called:
i! handle: type=cudnnHandle_t; streamId=0000000000000002;
i! mode: type=cudnnBatchNormMode_t; val=CUDNN_BATCHNORM_SPATIAL_PERSISTENT (2);
i! bnOps: type=cudnnBatchNormOps_t; val=CUDNN_BATCHNORM_OPS_BN_ACTIVATION (1);
i! alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i! beta: type=CUDNN_DATA_FLOAT; val=0.000000;
i! xDesc: type=cudnnTensorDescriptor_t:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i! nbDims: type=int; val=4;
i! dimA: type=int; val=[512,128,15,15];
i! strideA: type=int; val=[28800,1,1920,128];
i! xData: location=dev; addr=000000131C600000;
i! zDesc: type=cudnnTensorDescriptor_t:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i! nbDims: type=int; val=4;
i! dimA: type=int; val=[512,128,15,15];
i! strideA: type=int; val=[28800,1,1920,128];
i! zData: location=dev; addr=000000131E400000;
i! yDesc: type=cudnnTensorDescriptor_t:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i! nbDims: type=int; val=4;
i! dimA: type=int; val=[512,128,15,15];
i! strideA: type=int; val=[28800,1,1920,128];
i! yData: location=dev; addr=000000131E400000;
i! bnScaleBiasMeanVarDesc: type=cudnnTensorDescriptor_t:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i! nbDims: type=int; val=4;
i! dimA: type=int; val=[1,128,1,1];
i! strideA: type=int; val=[128,1,1,1];
i! bnScaleData: location=dev; addr=000000131BE00600;
i! bnBiasData: location=dev; addr=000000131BE00800;
i! exponentialAverageFactor: type=double; val=0.990000;
i! resultRunningMeanData: location=dev; addr=000000131BE00A00;
i! resultRunningVarianceData: location=dev; addr=000000131BE00C00;
i! epsilon: type=double; val=0.001000;
i! saveMean: location=dev; addr=000000131BE06E00;
i! saveInvVariance: location=dev; addr=000000131BE07000;
i! activationDesc: type=cudnnActivationDescriptor_t:
i! coef: type=double; val=0.000000;
i! mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1);
i! reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0);
i! workspace: location=dev; addr=0000001361E00000;
i! workSpaceSizeInBytes: type=unsigned long long; val=592108;
i! reserveSpace: location=dev; addr=NULL_PTR;
i! reserveSpaceSizeInBytes: type=unsigned long long; val=0;
i! Time: 2019-02-07T16:59:40.464476 (0d+0h+0m+3s since start)
i! Process=7600; Thread=19520; GPU=0; Handle=000002A0695F8D70; StreamId=0000000000000002.
I! CuDNN (v7402) function cudnnGetStream() called:
i! handle: type=cudnnHandle_t; streamId=0000000000000002;
i! Time: 2019-02-07T16:59:40.464476 (0d+0h+0m+3s since start)
i! Process=7600; Thread=19520; GPU=NULL; Handle=NULL; StreamId=NULL.
CUDNN_STATUS_EXECUTION_FAILED
The most disappointing part is that, although it works in Linux, the old NCHW API produces faster training than the new NHWC API. My network has layers of 3x3 convolutions followed by batch normalization. NHWC convolution is much faster than NCHW convolution, but NHWC batch-normalization is so much slower that NHWC is not worth it, and the NCHW code with the old API turns out to be faster. Is such performance to be expected?
Thanks for any help.