I’m z-score normalizing N 1-D vectors, each of size 500 elements. Each element in the vector is of type fp16 and N may be the grid size for my epilog kernel. I’m currently using fp32 inputs into “cudnnBatchNormalizationForwardTraining” to do the z-score normalization. Outputs are then converted to fp16( __half2) for my epilogue kernel.
Do you think it makes sense for me to consider feeding in FP16 inputs to CUDNN for higher throughput? Would this improve my throughput of CUDNN normalization without causing loss of precision/range?
Please note that mine is a non-ML workload and I’m just trying to do z-score normalization here.