Behavior of cudnnSoftmaxBackward

Hi, I have been struggling to understand how cudnnSoftmaxBackward works. I found the documentation to be unclear and I would like to confirm how I see cudnnSoftmaxBackward working. Is it possible to improve the documentation to clarify these points? For completeness, I am using the cuDNN API doc: and cuDNN version

According to the API doc, among other arguments, cudnnSoftmaxBackward takes 3 tensors: yData and dy as inputs, and dx as output. All 3 of the tensors must have the same dimensions.

The first clarification: The documentation for cudnnSoftmaxBackward states: “This routine computes the gradient of the softmax function.” I find this already to be misleading as, mathematically, the softmax function returns a vector, so its gradient should have a higher dimension. Upon investigation, it seems that cudnnSoftmaxBackward does not really return the gradient, but rather applies the gradient to the input dy, and returns that computation as dx.

The second clarification: To correctly get the gradient of the softmax function (applied to dy), I have to first preprocess the input tensor y by using cudnnSoftmaxForward, and then provide the result as input to cudnnSoftmaxBackward. This is NOT stated anywhere in the docs.

You can check this easily with y and dy as 4D tensors with n=c=h=w=1. The output tensor dx should be:
dx = (1-softmax(y)) * softmax(y) * dy
but instead it is
dx = (1-y) * y * dy

hi @yet41,
This is the correct behavior how cudnn calculates softmaxBwd.

y=softmax(x)is calculated by softmaxForward
We don’t need to calculate another softmax(y)
y is the input to softmaxBackward, it is also the output of softmaxForward