How to use uint32_t in tensor core?

I find no uint32_t in tensor core… I guess, the best choice for me is to use TF32?

I find below codes

#pragma unroll
for(int i = threadIdx.x; i < degree_m; i += blockDim.x) {
    shared_M[i] = cvt_utof_tf32(shared_M[i]);

Is this a clever way? Thank you!!! (I guess, not that clever?)

There is no 32-bit integer multiply available in any current TC unit. The “widest” integer format supported is 8-bit multiply, with 32-bit accumulate.

At first glance I wouldn’t say TF32 is a “substitute”, but I guess if you restrict the range carefully, it might be. TF32 is a 19-bit floating point format. It has the same mantissa format as FP16 and the same exponent format as FP32. To the extent that there is overlap of definition, it is bitwise compatible/storage format compatible with FP32.

The supported TC formats for multiplication are FP64, TF32, FP16, BF16, FP8, INT8 (8 bit integer), INT4 (4-bit integer), and INT1 (1-bit integer).

Not all formats are supported on all TC units. V100, for example, supports FP16 only. H100 supports the entire list above.

I guess the integer range (without gaps) that could be supported by TF32 (or FP16 for that matter) would be something like -2048 → +2048

You would, among other things, need to make sure that all inputs, partial results, and final results of multiply and accumulate fit within that range.

# cat
#include <cuda_fp16.h>
#include <iostream>

int main(){

  for (int i = -2060; i < 2060; i++) if (__half2int_rz(__int2half_rz(i)) != i) std::cout << "mismatch at: " <<  i << std::endl;
# nvcc -o t66
# ./t66
mismatch at: -2059
mismatch at: -2057
mismatch at: -2055
mismatch at: -2053
mismatch at: -2051
mismatch at: -2049
mismatch at: 2049
mismatch at: 2051
mismatch at: 2053
mismatch at: 2055
mismatch at: 2057
mismatch at: 2059
1 Like

Oh…So directly use

#pragma unroll
for(int i = threadIdx.x; i < degree_m; i += blockDim.x) {
    shared_M[i] = cvt_utof_tf32(shared_M[i]);

can have a range of 2^18??? Or I need to somehow split the integer into TF32 format?

Maybe something like this:

float convertUnsignedIntToTF32(uint32_t x) {
    if (x >= (1 << 18)) {
        std::cerr << "Out of range for exact representation in TF32" << std::endl;
        return -1.0f;

    uint32_t lower = x & 0x3FF;  
    uint32_t upper = x >> 10;   

    float y = (float)lower + (float)upper * std::pow(2, 10);

    return y;


It seems like you simply restated your question without taking into account anything I said.

I have no idea what that is. A google search turns up nothing. It’s not part of CUDA C++ from what I can tell.

The largest range you can have is -2048 to +2048. Maybe with some fiddling you could remap that to 0 to 4096. 4096 is 2^12. So no, you cannot have a range of 2^18, without gaps. You don’t get to use all 19 bits (or 16 bits in the case of FP16) when storing an integer in a floating-point (TF32 or FP16) quantity. Even if you think you could with some grand bit-mapping scheme, you won’t be able to pump any such scheme through the TC unit with sensible results. The range limit for direct integer storage is indicated by the number of mantissa bits, not anything else.

Just convert int to float. If the input or output is greater than 2048, or less than -2048, things are broken. Otherwise, the TF32 format and FP32 format (i.e. float format) are the same. Fancy conversion not needed.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.