Fuse Operators

Hi, all.
I’m a beginner cuDNN developer and I’m trying to fuse operators for inference.
When going through the documentation (Developer Guide :: NVIDIA Deep Learning cuDNN Documentation), I learned that this might be done by using backend API. (Please correct me if I’m wrong. I’m not 100% sure because the “seem-to-be” relevant functions are in training library.)
Although this post provides a good high-level overview, I’m still not sure how to actually implement operator fusion.
For example, it seems like the following functions might be necessary.

  • cudnnCreateFusedOpsPlan()
  • cudnnCreateFusedOpsVariantParamPack()
  • cudnnCreateFusedOpsConstParamPack()
  • cudnnMakeFusedOpsPlan()
  • cudnnFusedOpsExecute()

Are these sufficient enough to get the right results?
Also, I’m not sure how to pass the input/output tensor information and detailed parameters for convolution, relu, etc.

I went through cuDNN sample codes to find the code example but unfortunately, I had no luck.
If anyone can provide any hint/feedback/idea/code sample, that will be greatly helpful.

Thank you in advance!

Hi Sunggg, cuDNN engineer here! What you have listed are our older fusion APIs (from the v7 era). If you are starting, we suggest you to start with the new v8 graph API via the frontend C++ wrapper for better support and convenience.
You can find the wrapper and fusion code samples here

(search for run_conv_bias_add_activation in cudnn-frontend/conv_sample.cpp at main · NVIDIA/cudnn-frontend · GitHub)

Let us know if you have more questions!

Finally :) Thank you so much!
This is exactly what I’ve been looking for.

One quick question.
Say we provided operations we want to fuse.
Do we have guarantee that all of them will be fused? Or does the engine perform some kind of pattern matching and try its best?
For example, if we provide (conv, batchnorm, relu, add),
I’m wondering if only conv+batchnorm+relu but leave add alone since the engine only supports fused operation for conv+batchnrom+relu.

Appreciate your help.

Hi Sunggg,

Because batchnorm involves several reductions and broadcasts across multiple cuda blocks, currently it cannot be fused as a whole into the end of a convolution.

For more details on supported fusion combinations in v8.1.0, refer to our release notes:
https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-810
If a pattern is within the support, we’ll fuse everything into one kernel.
Note we’ll keep adding support to more things as we develop, so we may support a different way of fusing BN into conv in a future release.

Thank you for the reference.

When checking the reference, I could find this.

With runtime op fusion, the engine can generate and compile fused tensor-core kernels on the fly for the specified operation graph during the execution plan finalization stage. Some of the operation graph patterns supported in this preview are: convolution or matrix multiplication operation with arbitrary combination of one or more pointwise operations, and reduction operations fused onto the output tensor.

Although this gives a high-level idea of the supported patterns, I want more guarantee that all the operations I provide are fused.
Is there any way to check whether all operations are successfully fused?
If there is not, I’m wondering if there is a complete set of supported patterns.

Thank you so much for your help. It is a great help!

Our engineers has been working on the documentation. In the mean time, you can refer to the recent release notes for fusion support
https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-811
We have also released a new version of C++ frontend that includes a bunch of fusion samples - we recommend you to try it out