Hi Sunggg, thanks for trying out the c++ frontend and the runtime fusion functionality, these are good questions!

- Is there any clarification about these?

1.1 When do we use .setVirtual() for TensorBuilder?

A tensor is virtual if it’s some intermediate tensor that’s produced by a node and you don’t need it written out.

e.g. in the conv-relu case if you only need the final result but not the conv output, then you should set the conv output tensor to be virtual

1.2. What’s difference between alpha, alpha2 and beta?

For pointwise node, if it’s a unary math operator or activation forward, the equation is

Y = op(alpha1 * X)

if it’s a binary math operator

`Y = op(alpha1 * X, alpha2 * B)`

if it’s activation backward

`dX = activation_bwd_op(alpha1 * dY, alpha2 * X)`

In the runtime fusion engine we only support `alpha1 == alpha2 == 1`

for convolution node the equation is

`Y = alpha * conv(X, W) + beta * Y`

for fusion use cases, we require beta to always be 0, so each operation in the graph adheres to the static single assignment form

1.3. What’s difference between .setdyDesc() <-> .setyDesc() or .setdxDesc() <-> .setxDesc()? I’m trying to understand this statement.

auto act_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)

.setdyDesc(after_conv_tensor)

.setxDesc(bwd_act_x_tensor)

.setdxDesc(after_activation_tensor)

.setpwDesc(actDesc)

.build();

see the equations above, for each operation in fwd pass, input activation is X, otuput activation is Y

for each operation in bwd pass, input gradient is dY, output gradient is dX

In DL as you probably know, people usually call “dE/dX” as “dX” and “dE/dY” as “dY” where E is the loss

- Is it possible to fuse conv2d+bias+relu or conv2d+bias or conv2d+relu? conv2d+add+bias+relu in the sample code worked, but those cases gave me the following error: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed

Can you dump the API log according to Developer Guide :: NVIDIA Deep Learning cuDNN Documentation

There are some requirements regarding datatype, layout (NHWC always) and tensor shape (input/output channel C and K need to be a multiple of 4 if using fp32 tensors, 8 if FP16 or BF16, 16 if int8) in the runtime fusion engine

If the requirements are met, then it should work

I see you are using cuDNN 8.2.1. Please also let us know what GPU you are using

- Is it possible to serialize/deserialize the generated plan to minimize its overhead for the execution as TensorRT does?

For the same session, you can always cache the execution plan and reuse it for many iterations to amortize the compilation cost.

We are discussing internally about supporting serialization/deserialization. It might be possible to support it later the year.

- What is drelu? Is this Dual Rectified Linear Units?

dRelu is relu backward operation