We are working on moving the implementation of our convolutional layer from the legacy cuDNN API (e.g. cudnnConvolutionBiasActivationForward) to cuDNN Frontend.
However, we are experiencing severe problems with the new API.
Internally our tensors are stored in NCHW layout and hence we are using this layout with cuDNN as well. We never encountered any problem with the legacy API with this layout. However, with the new API some configurations are just failing with NCHW layout with “no engine configuration found”. A simple example would be a stand-alone sigmoid activation with [N=2, C=1, H=1, W=2]. We could only workaround this problem by changing the shape of the tensor (which is no problem due to the nature of the activation operation).
The same problem occurs with the slice operation, where unfortunately changing the shape of the tensor is not an option.
Most (not all, e.g. activation seems to reliably find engines with 3 dimensional tensors only) problems vanish if we use NHWC layout with cuDNN Frontend (we tried that because tensor cores are optimized for this layout). However, this is a big issue for us, as a move to NHWC would imply a major rewrite of our codebase.
Another problem is performance. Basically, all operations are slower with the NCHW layout. Even slower than the legacy API (we implemented autotuning with cuDNN Frontend). E.g. we have a convolution with sigmoid activation that is up to a factor of 20 slower compared to the legacy API. We could work around this problem by using two separate graphs, but this introduces unnecessary additional memory usage.
The configuration for this graph is the following:
Input: N=1, C=256, H=80, W=128
Number of filters: 90
Kernel Size: 3x3
Padding: 1
Activation: Sigmoid
What is your advice on how to proceed? Will the situation regarding NCHW layout improve in cuDNN Frontend (resp. the new cuDNN Backend)? Or is a rewrite towards NHWC our only option?