Peak Performance of integer operation

Biaowang · July 23, 2014, 5:15pm

Hi,

I am working on video codec using GPUs. Since most the operation in video encoding/decoding is integer, I would like to know what is the peak performance for integer Ops on Nvidia GPUs.
Similar questions have appeared before:

But they are 7 years old.

Also, I am conscious about the table regarding the peak instruction throughput in the programming guide:

However, some table cells are labeled as “Multiple instructions”, which are vague.
Also, it is quite disappointing that the gap between floating point and integer peak Ops becomes larger and larger from Fermi’s 1/3, Kepler’s 1/6, and Maxwell’s “I need to benchmark”.
So could anyone from Nvidia talk about what are the numbers of these “Multiple instructions” and why it shows these trends, what is reason from the perspective of architecture limits the peak of integer Ops.

Best regards

CudaaduC · July 23, 2014, 5:50pm

Cuda-Z will give you the Giops of your device, or you can write your own benchmark code.

The GTX 780ti is the best Nvidia GPU for Giops at 1,140( via Cuda-Z).

[url]http://cuda-z.sourceforge.net/[/url]

Robert_Crovella · July 23, 2014, 6:15pm

“Multiple instructions” means there is no native instruction to perform that operation, and instead the compiler emits an instruction sequence to perform the operation. It is typically on the order of 5-50 instructions. This can vary from operation to operation, architecture to architecture, and even compiler version to compiler version. If you want to find out what it is for a specific case, create a small test code, compile it, and then dump the machine code using

cuobjdump -sass mycode

32-bit integer add is at approximately the same throughput as corresponding floating point operations for all architectures. So I guess your concern is primarily around the 32-bit integer multiply.

The underlying architecture for all GPUs must efficiently support graphics operations, where single-precision floating point is the dominant operation. The mix of all other functional units are adjusted architecturally to provide the best cost/performance/power consumption behavior on the currently existing codes. Compute codes are also considered in this, but again there is no simple explanation for why the integer multiply ratios change, except to say that it is an ongoing tuning process to optimize the chip according to certain graphics and compute targets.

Stated another way, the chip architects felt that having 1:1 parity between integer and floating point performance was not the best use of silicon budgets (area, power, cost). This judgement is necessarily some sort of analysis over a broad base of codes. Obviously pure integer codes would like to see the 1:1 parity, and in fact would probably like an integer GPU, with twice as much integer throughput and zero floating point throughput, to take the analogy to the extreme. Such a chip would not be viable as a graphics device.

Since it seems like perhaps the only case in question is integer multiply on cc 5.0, I went ahead and did a comparison for a single integer multiply on cc 3.5 and cc 5.0.

kernel:

__global__ void mult_test(const int a, const int b, int *c){

  *c = a*b;
}

cc3.5:

.
        /*0008*/                   MOV R1, c[0x0][0x44];        /* 0x64c03c00089c0006 */
        /*0010*/                   MOV R0, c[0x0][0x144];       /* 0x64c03c00289c0002 */
        /*0018*/                   MOV R2, c[0x0][0x148];       /* 0x64c03c00291c000a */
        /*0020*/                   MOV R3, c[0x0][0x14c];       /* 0x64c03c00299c000e */
        /*0028*/                   IMUL R0, R0, c[0x0][0x140];  /* 0x61c01800281c0002 */
        /*0030*/                   ST.E [R2], R0;               /* 0xe4800000001c0800 */
        /*0038*/                   EXIT;                        /* 0x18000000001c003c */

cc 5.0:

.
        /*0008*/                   MOV R1, c[0x0][0x20];                /* 0x4c98078000870001 */
        /*0010*/                   MOV R6, c[0x0][0x144];               /* 0x4c98078005170006 */
        /*0018*/                   MOV R3, c[0x0][0x140];               /* 0x4c98078005070003 */
                                                                        /* 0x001f8400fc2007e1 */
        /*0028*/                   XMAD R0, R6, R3, RZ;                 /* 0x5b007f8000370600 */
        /*0030*/                   XMAD.MRG R5, R6, R3.H1, RZ;          /* 0x5b007fa800370605 */
        /*0038*/                   MOV R2, c[0x0][0x148];               /* 0x4c98078005270002 */
                                                                        /* 0x001fc400fc4007e4 */
        /*0048*/                   MOV R3, c[0x0][0x14c];               /* 0x4c98078005370003 */
        /*0050*/                   XMAD.PSL.CBCC R0, R6.H1, R5.H1, R0;  /* 0x5b30001800570600 */
        /*0058*/                   STG.E [R2], R0;                      /* 0xeedc200000070200 */
                                                                        /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                                /* 0xe30000000007000f */

So the single-instruction integer multiply (IMUL) on cc 3.5 gets expanded to about 4 instructions on cc5.0. Note that all 64-bit integer operations, on all architectures, get expanded to multiple-instruction-sequences.

Biaowang · May 11, 2017, 6:57pm

Just test it on Titan X Maxwell GPU with compute capability 5.2, CUDA toolkit 7.5, the number of “multiple instructions” for integer multiplication is 5.

Topic		Replies	Views
What's the peak performance with 32-bit integers? CUDA Programming and Performance	5	3312	July 11, 2009
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20331	March 12, 2014
Forward looking GPU integer performance CUDA Programming and Performance	22	22332	March 20, 2017
Mythical Tflops CUDA Programming and Performance	11	1301	January 14, 2019
CUDA integer ops in hardware the skinny on ints in CUDA and hardware CUDA Programming and Performance	3	20223	March 26, 2007
Integer performance vs floating point CUDA Programming and Performance	5	8618	March 28, 2014
About instruction throughputs CUDA Programming and Performance	9	5277	May 27, 2010
performance of integer vs float CUDA Programming and Performance	10	22009	June 15, 2009
Throughput for certain integer arithmetic instructions. CUDA Programming and Performance	5	1917	January 15, 2020
throughput of integer add CUDA Programming and Performance	17	3324	August 15, 2011

Peak Performance of integer operation

Related topics