Integer addition throughput benchmark

Hi.

I’m trying to write a benchmark for testing the throughput of integer addition on Jetson TK1.
In theory, I should be able to get 160 IPC on Jetson’s single SMX.

Initially, I was having trouble getting the code to work.
nvcc compiled away any set of integer addition instructions to just one instruction.
So I added
-O0 -Xcicc -O0 -Xptxas -O0
to nvcc to remove all optimizations.
The resulting code seems to work fine; I verified using cuobjdump -sass output on the executable and it generates the right number of IADD instructions.

However, I’m not getting 160 IPC (or equivalently 136.32 GIops @ 852 MHz).
I’m only getting 80.7 GIops with ILP=2,3,4 and 92.2 GIops with ILP=1.

I’ve also tried to turn the optimizations back on and mixed in XOR instructions so that the compiler won’t optimize away the instructions.
I assumed that I should get the same performance since XOR also has 160 IPC throughput.
However, although cuobjdump shows me that I’m getting the right # of IADD and LOP32I.XOR instructions, my performance is the same as before.

I would appreciate any help on this; I’ve searched through the forums but I haven’t seen any good answer for why integer add IPC of 160 is difficult to get on a Kepler SMX.

Thank you.

p.s.
Here is a snippet of my integer kernel (no XOR) with ILP==2, TYPE==int:

global void int_kernel_2_1(int num_threads, int num_streams, TYPE* in, TYPE* out)
{
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
TYPE x, y;

if(tid < num_threads) {
x = in[tid];
y = in[num_threads + tid];

x = x + 5;
y = y + 5;
x = x + 5;
y = y + 5;
x = x + 5;
y = y + 5;

x = x + 5;
y = y + 5;

out[tid] = x;
out[num_threads + tid] = y;
}
}
#1

I’m not sure what the pipeline depth of Kepler is but it’s as least as deep as Maxwell which is 6. This means without an ILP of whatever the pipeline depth is, the SMX is going to have to do some heavy context switching to fill the dependency stalls. Context switches are cheap but not free. You want to be able to get some decent amount of actual ILP in before switching to another warp and getting your performance from TLP.

I’d try much bigger values of ILP like 16 or 32. Oh, and I’ve found the -O0 ptxas option may tend to order the instructions the way you want, but it also fails to cull a lot of nonsense instructions. Though sounds like this may not be an issue for such straight forward code.

At least with using all immediate secondary operands you’re not dealing with any register bank conflict issues. It always helps to see the full sass to figure these issues out.

Compile flags:
-gencode arch=compute_32,code=sm_32 --ptxas-options=-v -m32 -Xcompiler -mfloat-abi=hard

This is what’s in the main body:

global void int_kernel_2_256_(int num_threads, int num_streams, TYPE* in, TYPE* out)
{
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
TYPE x, y;

if(tid < num_threads) {
x = in[tid];
y = in[num_threads + tid];

x = x + CONST;
x = x && 0x000FFFFF;
y = y + CONST;
y = y && 0x000FFFFF;


out[tid] = x;
out[num_threads + tid] = y;
}
}

and in the objdump:

Function : _Z16int_kernel_2_256iiPiS_

.headerflags @“EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)”
/* 0x08a000b0a0a08c00 /
/0008/ MOV R1, c[0x0][0x44]; /
0x64c03c00089c0006 /
/0010/ S2R R0, SR_CTAID.X; /
0x86400000129c0002 /
/0018/ S2R R3, SR_TID.X; /
0x86400000109c000e /
/0020/ IMAD R2, R0, c[0x0][0x28], R3; /
0x51080c00051c000a /
/0028/ ISETP.GE.U32.AND P0, PT, R2, c[0x0][0x140], PT; /
0x5b601c00281c081e /
/0030/ @P0 BRA.U 0x12d8; /
0x120000095000023c /
/0038/ @!P0 ISCADD R4, R2, c[0x0][0x148], 0x2; /
0x60c0080029200812 /
/
0x08a0b8a010b8a010 /
/0048/ @!P0 IADD R0, R2, c[0x0][0x140]; /
0x6080000028200802 /
/0050/ @!P0 LD R4, [R4]; /
0xc400000000201010 /
/0058/ @!P0 ISCADD R3, R0, c[0x0][0x148], 0x2; /
0x60c008002920000e /
/0060/ @!P0 LD R3, [R3]; /
0xc400000000200c0c /
/0068/ @!P0 IADD R5, R4, 0x5; /
0xc080000002a01015 /
/0070/ @!P0 LOP32I.XOR R4, R5, 0xf0f0f0f0; /
0x2278787878201410 /
/0078/ @!P0 IADD R5, R3, 0x5; /
0xc080000002a00c15 /
/
0x08a09c80a010a010 /
/0088/ @!P0 IADD R3, R4, 0x5; /
0xc080000002a0100d /
/0090/ @!P0 LOP32I.XOR R4, R5, 0xf0f0f0f0; /
0x2278787878201410 /
/0098/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/00a0/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/00a8/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/00b0/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/00b8/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/
0x08a09c80a010a010 /
/00c8/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/00d0/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/00d8/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/00e0/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/00e8/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/00f0/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/00f8/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/
0x08a09c80a010a010 /
/0108/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/0110/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/0118/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/0120/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/0128/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/0130/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/0138/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/
0x08a09c80a010a010 /
/0148/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/0150/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/0158/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/0160/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/0168/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/0170/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/0178/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /

/1290/ @!P0 IADD R3, R3, 0x5; /
0xc080000002a00c0d /
/1298/ @!P0 IADD R4, R4, 0x5; /
0xc080000002a01011 /
/12a0/ @!P0 ISCADD R2, R2, c[0x0][0x14c], 0x2; /
0x60c0080029a0080a /
/12a8/ @!P0 LOP32I.XOR R3, R3, 0xf0f0f0f0; /
0x2278787878200c0c /
/12b0/ @!P0 ISCADD R0, R0, c[0x0][0x14c], 0x2; /
0x60c0080029a00002 /
/12b8/ @!P0 LOP32I.XOR R4, R4, 0xf0f0f0f0; /
0x2278787878201010 /
/
0x08000000b81000b8 /
/12c8/ @!P0 ST [R2], R3; /
0xe40000000020080c /
/12d0/ @!P0 ST [R0], R4; /
0xe400000000200010 /
/12d8/ MOV RZ, RZ; /
0xe4c03c007f9c03fe /
/12e0/ EXIT; /
0x18000000001c003c /
/12e8/ BRA 0x12e8; /
0x12007ffffc1c003c /
/12f0/ NOP; /
0x85800000001c3c02 /
/12f8/ NOP; /
0x85800000001c3c02 */

Scott, your Maxwell analyses on different threads are full of gems! You just dropped another when you said “Context switches are cheap but not free. You want to be able to get some decent amount of actual ILP in before switching to another warp and getting your performance from TLP.”

This surprises me! The NVidia whitepapers and descriptions of TLP all abstract the details away but all seem to imply or even explicitly state that context switches between warps (even warps from other blocks) are free. There are many scheduler diagrams from NVidia and others reinforcing this idea. Of course even those discussions often abstract away details (especially about dual-issue which adds extra complexity to the discussion).

Can you expand more on what you mean by context switch overhead? As an anticipatory guess, maybe each of the 4 schedulers can only hold a few warps in a “ready queue”, and when THOSE exhaust the scheduler has to “context switch” to grab new warps from an SMM level pool. Or perhaps steal and exchange ownership with one of its sibling schdulers? Those ideas are just made up, but would fit the idea of 0 context switch overhead but with exceptions, and your discussion is talking about the cost of those exceptions.

@lemonherb, my morning coffee idea was to use “add.sat.s32” opcodes instead of a simple add. The “saturation” feature should squelch any constant optimizations. Looks like it worked and the SASS looks right.

The nvprof profiler reports an IPC of 4.05 on Maxwell and about 3.53 on sm_35 (GK208) and sm_30 (GK104) devices.

An IPC of 4.05 equates to ~128 add.sat’s per clock on Maxwell… which is the max integer ops/clock. Impressive that Maxwell made it so easy.

However, an IPC of ~3.5 is ~112 integer ops/clock on Kepler. That’s still a long way from an IPC of 5 and 160 integer ops/clock. Maybe someone can tweak the code.

I didn’t try it on my Jetson TK1 board because I’m waiting for CUDA 6.5.

Here’s the source code.

Well, compared to a cpu thread or process context switch, they are essentially “free”. But if you’re looking at the actual throughput of code that looks like the above, it’s clear it doesn’t run at full speed.

From my maxwell assembler work, I haven’t 100% confirmed this, but I’m pretty sure at least one of the control flags is a yield hint. Ptxas seems to mainly set it on instructions that need to stall 5 or more clocks because of a dependency. This implies yielding for small stall counts isn’t worth the cost. Though I haven’t really examined enough code to confirm this hunch.

Also, code like the above from lemonherb, but with an ILP of just one (each instruction dependent on the last) runs at about 56% of full throughput on Maxwell (but maybe faster on Kepler?).

And finally, in my sgemm code, scattering “yield” flags throughout the highly independent instructions actually makes it run faster. This I believe is because it doesn’t let any one warp rush too far ahead to get stuck at a barrier. By keeping all your warps equally active keeps things more balanced and gives you the best chance for TLP when you hit a latency that needs to be filled (like a memory load).

But this is just all educated guesses. There could be something more subtle at play here I’m not aware of. But with the assembler I wrote you should now be able to carefully construct some benchmarks to test these hypotheses. For example, writing some code that you know has no hidden stalls (like register bank conflicts or instruction fetch) and measure the clock values from all threads. Introduce something you know will force a context switch and see if all the clocks are accounted for or if some are lost in the process of the switch.

Also, I do know that you can never assume your code will run without switches until it hits a hint or stall or barrier. A switch can happen at any time for reasons internal to the hardware. And you have to take this into account when manually scheduling your own code (that is don’t try and take shortcuts with your stall counts by assuming it’s covered by the latency of some other instruction).

Oh, and lemonherb: try higher ILP values and see how that effects things. You want to get at least beyond the pipeline depth for Kepler, which I’m pretty sure is more than 6.

First of all, thanks for all the help, everyone.

I’ve tried pushing my ILP to 8 and I am still getting ~3.55 IPC on the TK1.
I tested allanmac’s code with ILP==16, 4 repetitions per loop, and then 16K iterations (the cuobjdump didn’t seem to have unrolled it), and got 3.66 IPC.

These are both slight improvements over IPC of 3.39 when ILP == 2, but still not close to the peak IPC of 5.
The highest that I’ve observed is still 3.875 when IPC == 1, which is close to the theoretical peak of 4 when no ILP is used.

I’d really appreciate any help on this matter.

Thank you!