[cuBLASDx] Support for MM where input type != output type?

Hi,

Thanks for releasing cuBLASDx!

We noticed that the Precision operator receives a single type for both the inputs and outputs. Is it possible to use cuBLASDx for MM where the input and output types differ? For example FP16 x FP16 → FP32 (which is also a common format for CUDA Tensor Cores).

If not, does cuBLASDx perform the rounding before or after the MM accumulation? This may have an impact on the rounding error of the operation.

Thanks!