Support for the TF32 type and the mixed precision

Dear OptiX Development Team,

I am writing to inquire about the planned support for the TF32 (TensorFloat32) data type within the next release of OptiX in the neural rendering routines (specifically, optixCoopVecMatMul and related functions). Currently, I observe that despite the availability of TF32 hardware support (Tensor Cores) and its ideal characteristics for deep learning - offering FP32 range with FP16-equivalent precision - the following input/output combinations are not natively supported in OptiX:

inputType inputInterpretation matrixElementType biasElementType outputType
FP32 TF32 FP32 FP32 FP32

The current reliance primarily on FLOAT16 is often insufficient for differential neural rendering (DNR) and inverse problems, where the narrow dynamic range of FP16 frequently leads to gradient underflow or unstable numerical results.

Are there plans to introduce native TF32 support for the input vector, bias, and matrix elements in the next release of OptiX? This feature would significantly enhance the numerical stability and performance of high-fidelity neural rendering applications.

Thank you for your consideration.

Hi @LowLevelKB,

That’s a good question. I can’t comment on future support, but I can tell you that we will take your comment as a vote in favor of TF32 support in the cooperative vectors API, and I can tell you that the very next release of OptiX will not have TF32 support yet (which has no bearing on whether we support TF32 in the future).

Bear in mind that cooperative vectors APIs are cross platform, and are undergoing standards review, which might mean that it could take some time before we see support for new formats or types.


David.

Thank you for your answer. I will be looking forward. By the way, for now as the temporary workaround one can always split the rendering into the sequence of multiple optixLaunch calls (to obtain the new batch of the sorted hit primitives in the hit buffer) followed by the separate pure cuda kernels invocations responsible for “shading” where nvcuda::wmma is utilized in order to compute the output of the consecutive neural network layers loaded into the shared memory.