I have noticed that cudnnSetTensor and cudnnAddTensor are about 50 times slower for NHWC packed tensors than for NCHW packed tensors. So I had to make a fake NCHW tensor descriptor in order to make them fast. Please Nvidia, fix this: it should be very easy to do.
Other functions that are extremely slow with NHWC are batch-normalization functions. I’d love a fast version, too.