cudnnConvolutionBiasActivationForward generating wrong half-precision result for group == 32

It seems cudnnConvolutionBiasActivationForward supports half-precision and group, but not at the same time.
I wonder anyone else has encounter this situation and is there anything I missed to support half-precision group convolution. (all following tests are made with same piece of code)

layout: NCHW
input: 1x32x112x112
filter: 32x1x3x3, stride: 1,1, padding: 1,1, group: 32
output: 1x32x112x112

half-precision cudnnConvolutionBiasActivationForward generate correct output for only output channel 0, for other channels the result are totally different with single-precision output(single-precision output has been verified with other CPU implementation).

Other tests have been taken:
On group == 1, half-precision cudnnConvolutionBiasActivationForward generate correct result with acceptable errors compare to single-precision version result.
On group == 32, single-precision generate correct result compare to other CPU implementation result.

In addition, cudnn-9.0-windows10-x64-v7, cudnn-9.2-windows10-x64-v7.1, cudnn-9.2-windows10-x64-v7.2.1.38
has been tested, and no luck. All three versions generate same incorrect result for half-precision && group==32.