I have tried single convolution op and single bias(pointwise add) op. Only convolution returns the correct result, while bias throws the same error.
I have also tried [https://github.com/NVIDIA/cudnn-frontend](https://cudnn frontend ). It also throws some errors, the report is pitched below.
Thanks for your interest trying out cudnn fusion! There might be several issues here:
Can you install cuda 11.2u1 or later, and make sure libnvrtc.so is visible in your LD_LIBRARY_PATH? Also make sure you use cudnn 8.1.1 or later compiled against cuda 11.2u1 or later.
Since we generate fusion kernels targeting tensor cores, input/output conv channels need to be a multiple of 8 if you use fp16 tensors or multiple of 4 if you use fp32 tensors .
I see in your example, you are using fp32 tensors, this is only supported on Ampere GPU currently (through TF32 tensor cores). These hardware units are not available on Turing GPUs.
I see in your example, you are using NCHW layout (judging from the way you compute strides), however NHWC (i.e. channels last) layout is needed to utilize tensor cores.
If you make sure (1), you should be able to run the fusion samples without issue. For (2) - (4) you can follow the examples in the fusion sample.
devptrs and uids are incorrect. (can refer to the provided implementation)
the workspace needs to be allocated and provided to the varPack.
I assume you want the outputData tensor to be bound to Y which should be the final output of convBias. I have modified the implementation to reflect that. fuseOpDemo.cpp (20.7 KB)
I’m also attaching a working code snippet that I created by modifying the initial fusedOpDemo.cpp. I have marked down all the changes in the code by comments beginning with "CUDNN : ". Let us know how using that code goes for you.