Cudnn fused conv+bias

I’m trying to implement working conv+bias fused operation via backend api, and try to use example provided in another topic (Cudnn backend api for fused op - #8 by gautamj), but on finalizing of execution plan there always CUDNN_STATUS_UNSUPPORTED. In our production code, i have workaround with adding conv+add+add graph (with zero alpha2 on first add), but in case of two operations (just conv + add) we also have same error.

Can you suggest me what can be wrong?

Tested on:
card - Tesla T4/GTX 1080ti
cudnn - 8.2.2/8.2.4
cuda - 11.1/11.4 (with LD_PRELOAD for

fuseOpDemo.cpp (20.7 KB)

Also I have a few questions:

  1. we actively use inference with cuda streams, and very often when backendExecute called with same plan from multiple streams, we get wrong convolution results until backendExecute is protected with mutex. Is there any requirement to not use backend execute with same plan from multiple streams in parallel.
  2. Where I can find descriptions of knobs for engine (what each knob doing)? For some engines there’s knobs like “EDGE” or some documentation for specific engine?
  3. Is there any information for engines on required input/output tensor data type and format (like engine_5 will not work on NCHW format or with float data type)? On some engines with usage we have loss of precision while numerical note on execution plan was taken into account, may be there’s some other hint?

Thank you in advance.

Hi @lxq,

Thank you for using our API and posting here. The runtime fusion engine that I used in the forum : Cudnn backend api for fused op - #8 by gautamj is only supported for Volta and later GPU’s. GTX 1080Ti is too old. The issue with T4 is that it only supports half datatype and the file uses float datatype for all the tensors.

The corrected file is attached. fuseOpDemo_turing.cpp (20.7 KB) This should work on T4.
You can also look at other samples of fusion from our public repository and try samples from there:

For your questions:
We are working on 1 and 2 and will provide an update.
A new feature in upcoming release 8.3.0 is error reporting which will give much more informative errors like data type and format issues which might help you for 3.

1 Like

Thank you for example!
We further investigate problem with multithreaded call of backendExecute, and found that it seems to be required to create execution plan on separate handles, i.e. for each cudaStream we need to use separate execution plan and we can’t use plan created once. In case where we use plan created once (for ex. on one cudnn handle) we have mismatch of results when plan executed from multiple threads.
Can you tell me, we really need to create a separate cudnnHandle with execution plan for each stream if plan executed in parallel, since I did not find information about this in the documentation?

Hi @lxq , can you try the latest cuDNN 8.3.1 release? we have fixed a issue that we suspect to have caused the mismatches that you observed. However we are still developing a better testing method to check whether there are other remaining multi-thread issues. We will know with more certainty soon.