when the value cudnnFusedOps_t is
the following graph is attached in the document to explain it
from this graph, we believed that the wgrad-opt is to compute the grad of weights with input y1 and grad of output y (dy), and here y1 is forward scale-bias-relu forward result of last conv output (x)
but, we are confused that recompute y1 from x might waste time unnecessarily?
we are also confused that how to compute the grad of scale and bias during training?
and how to get dy after backward of bn?
suppose the training network is
we need to compute the grad of weights in conv, grad of scale/bias
if we use CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD to compute dw, do we need to acquire dy which is the result of backward of bn-scale-bias-relu …?