There is no 32-bit integer multiply available in any current TC unit. The “widest” integer format supported is 8-bit multiply, with 32-bit accumulate.
At first glance I wouldn’t say TF32 is a “substitute”, but I guess if you restrict the range carefully, it might be. TF32 is a 19-bit floating point format. It has the same mantissa format as FP16 and the same exponent format as FP32. To the extent that there is overlap of definition, it is bitwise compatible/storage format compatible with FP32.
The supported TC formats for multiplication are FP64, TF32, FP16, BF16, FP8, INT8 (8 bit integer), INT4 (4-bit integer), and INT1 (1-bit integer).
Not all formats are supported on all TC units. V100, for example, supports FP16 only. H100 supports the entire list above.
I guess the integer range (without gaps) that could be supported by TF32 (or FP16 for that matter) would be something like -2048 → +2048
You would, among other things, need to make sure that all inputs, partial results, and final results of multiply and accumulate fit within that range.
It seems like you simply restated your question without taking into account anything I said.
I have no idea what that is. A google search turns up nothing. It’s not part of CUDA C++ from what I can tell.
The largest range you can have is -2048 to +2048. Maybe with some fiddling you could remap that to 0 to 4096. 4096 is 2^12. So no, you cannot have a range of 2^18, without gaps. You don’t get to use all 19 bits (or 16 bits in the case of FP16) when storing an integer in a floating-point (TF32 or FP16) quantity. Even if you think you could with some grand bit-mapping scheme, you won’t be able to pump any such scheme through the TC unit with sensible results. The range limit for direct integer storage is indicated by the number of mantissa bits, not anything else.
Just convert int to float. If the input or output is greater than 2048, or less than -2048, things are broken. Otherwise, the TF32 format and FP32 format (i.e. float format) are the same. Fancy conversion not needed.