Peak Performance of integer operation

Hi,

I am working on video codec using GPUs. Since most the operation in video encoding/decoding is integer, I would like to know what is the peak performance for integer Ops on Nvidia GPUs.
Similar questions have appeared before:

https://devtalk.nvidia.com/default/topic/367268/?comment=2633452
https://devtalk.nvidia.com/default/topic/368314/?comment=2640390
But they are 7 years old.

Also, I am conscious about the table regarding the peak instruction throughput in the programming guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions

However, some table cells are labeled as “Multiple instructions”, which are vague.
Also, it is quite disappointing that the gap between floating point and integer peak Ops becomes larger and larger from Fermi’s 1/3, Kepler’s 1/6, and Maxwell’s “I need to benchmark”.
So could anyone from Nvidia talk about what are the numbers of these “Multiple instructions” and why it shows these trends, what is reason from the perspective of architecture limits the peak of integer Ops.

Best regards

Cuda-Z will give you the Giops of your device, or you can write your own benchmark code.

The GTX 780ti is the best Nvidia GPU for Giops at 1,140( via Cuda-Z).

http://cuda-z.sourceforge.net/

“Multiple instructions” means there is no native instruction to perform that operation, and instead the compiler emits an instruction sequence to perform the operation. It is typically on the order of 5-50 instructions. This can vary from operation to operation, architecture to architecture, and even compiler version to compiler version. If you want to find out what it is for a specific case, create a small test code, compile it, and then dump the machine code using

cuobjdump -sass mycode

32-bit integer add is at approximately the same throughput as corresponding floating point operations for all architectures. So I guess your concern is primarily around the 32-bit integer multiply.

The underlying architecture for all GPUs must efficiently support graphics operations, where single-precision floating point is the dominant operation. The mix of all other functional units are adjusted architecturally to provide the best cost/performance/power consumption behavior on the currently existing codes. Compute codes are also considered in this, but again there is no simple explanation for why the integer multiply ratios change, except to say that it is an ongoing tuning process to optimize the chip according to certain graphics and compute targets.

Stated another way, the chip architects felt that having 1:1 parity between integer and floating point performance was not the best use of silicon budgets (area, power, cost). This judgement is necessarily some sort of analysis over a broad base of codes. Obviously pure integer codes would like to see the 1:1 parity, and in fact would probably like an integer GPU, with twice as much integer throughput and zero floating point throughput, to take the analogy to the extreme. Such a chip would not be viable as a graphics device.

Since it seems like perhaps the only case in question is integer multiply on cc 5.0, I went ahead and did a comparison for a single integer multiply on cc 3.5 and cc 5.0.

kernel:

__global__ void mult_test(const int a, const int b, int *c){

  *c = a*b;
}

cc3.5:

.
        /*0008*/                   MOV R1, c[0x0][0x44];        /* 0x64c03c00089c0006 */
        /*0010*/                   MOV R0, c[0x0][0x144];       /* 0x64c03c00289c0002 */
        /*0018*/                   MOV R2, c[0x0][0x148];       /* 0x64c03c00291c000a */
        /*0020*/                   MOV R3, c[0x0][0x14c];       /* 0x64c03c00299c000e */
        /*0028*/                   IMUL R0, R0, c[0x0][0x140];  /* 0x61c01800281c0002 */
        /*0030*/                   ST.E [R2], R0;               /* 0xe4800000001c0800 */
        /*0038*/                   EXIT;                        /* 0x18000000001c003c */

cc 5.0:

.
        /*0008*/                   MOV R1, c[0x0][0x20];                /* 0x4c98078000870001 */
        /*0010*/                   MOV R6, c[0x0][0x144];               /* 0x4c98078005170006 */
        /*0018*/                   MOV R3, c[0x0][0x140];               /* 0x4c98078005070003 */
                                                                        /* 0x001f8400fc2007e1 */
        /*0028*/                   XMAD R0, R6, R3, RZ;                 /* 0x5b007f8000370600 */
        /*0030*/                   XMAD.MRG R5, R6, R3.H1, RZ;          /* 0x5b007fa800370605 */
        /*0038*/                   MOV R2, c[0x0][0x148];               /* 0x4c98078005270002 */
                                                                        /* 0x001fc400fc4007e4 */
        /*0048*/                   MOV R3, c[0x0][0x14c];               /* 0x4c98078005370003 */
        /*0050*/                   XMAD.PSL.CBCC R0, R6.H1, R5.H1, R0;  /* 0x5b30001800570600 */
        /*0058*/                   STG.E [R2], R0;                      /* 0xeedc200000070200 */
                                                                        /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                                /* 0xe30000000007000f */

So the single-instruction integer multiply (IMUL) on cc 3.5 gets expanded to about 4 instructions on cc5.0. Note that all 64-bit integer operations, on all architectures, get expanded to multiple-instruction-sequences.

Just test it on Titan X Maxwell GPU with compute capability 5.2, CUDA toolkit 7.5, the number of “multiple instructions” for integer multiplication is 5.