Mapping PTX to F2IP SASS Instruction for Efficient FP32 to INT8 Conversion

I am currently developing a custom fused int8 Conv2d kernel and I am facing an optimization challenge. During the process of profiling my kernel with Nsight Compute, I observed a performance bottleneck related to the conversion from FP32 to INT8.

Upon examining the kernels generated by TensorRT, I noticed that it uses the F2IP (FP32 Down-Convert to Integer and Pack) SASS instruction for this purpose, which is more efficient than F2I (FP32 Down-Convert to Integer) that my kernel is using.

However, I have been unable to determine the specific PTX instruction or sequence of instructions that would lead the CUDA compiler to generate this F2IP SASS instruction.

I would greatly appreciate any guidance or insights the community could offer on this matter. Is there a known PTX instruction or sequence of instructions that could result in the F2IP SASS instruction being used? Or are there any recommended strategies for optimizing the FP32 to INT8 conversion process in my kernel?

Thank you very much in advance!

What does packed mean in the context of F2IP? Did you try something like cvt.rni.s8.f32

I have already attempted to use cvt.rni.s8.f32, but it seems to compile into the F2I instruction. According to the CUDA Binary Utilities documentation (CUDA Binary Utilities), F2IP stands for “FP32 Down-Convert to Integer and Pack”, although it doesn’t provide additional details about this instruction.

From my observations, the Nsight Compute profiling data shows that for the same workload, the number of F2IP instructions in the TensorRT kernel is only half that of the F2I instructions in my kernel. This leads me to speculate that the F2IP instruction might be capable of casting two float values into integers within a single instruction, thereby halving the total number of necessary conversion instructions.

__global__ 
void kernel(int* out,float f1, float f2) {
    int a,b,c;
    asm("cvt.rni.s32.f32 %0, %1;" : "=r"(a) : "f"(f1));
    asm("cvt.rni.s32.f32 %0, %1;" : "=r"(b) : "f"(f2));
    asm("cvt.pack.sat.s8.s32.b32 %0, %1, %2, 0;" : "=r"(c) : "r"(a), "r"(b));
    out[0] = c;
}

When I compile above code for arch=sm_89, cuobjdump shows F2IP

/*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
                                                                                /* 0x000fc400078e00ff */
  /*0010*/                   IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x168] ;     /* 0x00005a00ff057624 */
                                                                          /* 0x000fe200078e00ff */
  /*0020*/                   MOV R2, c[0x0][0x160] ;                      /* 0x0000580000027a02 */
                                                                          /* 0x000fe20000000f00 */
  /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                 /* 0x0000460000047ab9 */
                                                                          /* 0x000fe20000000a00 */
  /*0040*/                   MOV R3, c[0x0][0x164] ;                      /* 0x0000590000037a02 */
                                                                          /* 0x000fe40000000f00 */
  /*0050*/                   F2IP.S8.F32.NTZ R5, R5, c[0x0][0x16c], RZ ;  /* 0x00005b0005057a43 */
                                                                          /* 0x000fca00000014ff */
  /*0060*/                   STG.E [R2.64], R5 ;                          /* 0x0000000502007986 */
                                                                          /* 0x000fe2000c101904 */
  /*0070*/                   EXIT ;                                       /* 0x000000000000794d */
                                                                          /* 0x000fea0003800000 */
  /*0080*/                   BRA 0x80;                                    /* 0xfffffff000007947 */
2 Likes

Thank you for your advice! I’ve tested this code on my machine with arch=sm_80 and I confirmed it generate the F2IP instruction as expected. This is exactly what I needed. Your help is greatly appreciated!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.