“Multiple instructions” means there is no native instruction to perform that operation, and instead the compiler emits an instruction sequence to perform the operation. It is typically on the order of 5-50 instructions. This can vary from operation to operation, architecture to architecture, and even compiler version to compiler version. If you want to find out what it is for a specific case, create a small test code, compile it, and then dump the machine code using
cuobjdump -sass mycode
32-bit integer add is at approximately the same throughput as corresponding floating point operations for all architectures. So I guess your concern is primarily around the 32-bit integer multiply.
The underlying architecture for all GPUs must efficiently support graphics operations, where single-precision floating point is the dominant operation. The mix of all other functional units are adjusted architecturally to provide the best cost/performance/power consumption behavior on the currently existing codes. Compute codes are also considered in this, but again there is no simple explanation for why the integer multiply ratios change, except to say that it is an ongoing tuning process to optimize the chip according to certain graphics and compute targets.
Stated another way, the chip architects felt that having 1:1 parity between integer and floating point performance was not the best use of silicon budgets (area, power, cost). This judgement is necessarily some sort of analysis over a broad base of codes. Obviously pure integer codes would like to see the 1:1 parity, and in fact would probably like an integer GPU, with twice as much integer throughput and zero floating point throughput, to take the analogy to the extreme. Such a chip would not be viable as a graphics device.
Since it seems like perhaps the only case in question is integer multiply on cc 5.0, I went ahead and did a comparison for a single integer multiply on cc 3.5 and cc 5.0.
kernel:
__global__ void mult_test(const int a, const int b, int *c){
*c = a*b;
}
cc3.5:
.
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x64c03c00089c0006 */
/*0010*/ MOV R0, c[0x0][0x144]; /* 0x64c03c00289c0002 */
/*0018*/ MOV R2, c[0x0][0x148]; /* 0x64c03c00291c000a */
/*0020*/ MOV R3, c[0x0][0x14c]; /* 0x64c03c00299c000e */
/*0028*/ IMUL R0, R0, c[0x0][0x140]; /* 0x61c01800281c0002 */
/*0030*/ ST.E [R2], R0; /* 0xe4800000001c0800 */
/*0038*/ EXIT; /* 0x18000000001c003c */
cc 5.0:
.
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R6, c[0x0][0x144]; /* 0x4c98078005170006 */
/*0018*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/* 0x001f8400fc2007e1 */
/*0028*/ XMAD R0, R6, R3, RZ; /* 0x5b007f8000370600 */
/*0030*/ XMAD.MRG R5, R6, R3.H1, RZ; /* 0x5b007fa800370605 */
/*0038*/ MOV R2, c[0x0][0x148]; /* 0x4c98078005270002 */
/* 0x001fc400fc4007e4 */
/*0048*/ MOV R3, c[0x0][0x14c]; /* 0x4c98078005370003 */
/*0050*/ XMAD.PSL.CBCC R0, R6.H1, R5.H1, R0; /* 0x5b30001800570600 */
/*0058*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*0068*/ EXIT; /* 0xe30000000007000f */
So the single-instruction integer multiply (IMUL) on cc 3.5 gets expanded to about 4 instructions on cc5.0. Note that all 64-bit integer operations, on all architectures, get expanded to multiple-instruction-sequences.