Integer addition throughput benchmark

Hi.

I’m trying to write a benchmark for testing the throughput of integer addition on Jetson TK1.
In theory, I should be able to get 160 IPC on Jetson’s SMX.

Initially, I was having trouble getting the code to work.
nvcc compiled away any set of integer addition instructions to just one instruction.
So I added
-O0 -Xcicc -O0 -Xptxas -O0
to nvcc to remove all optimizations.
The resulting code seems to work fine; I verified using cuobjdump -sass output on the executable and it generates the right number of IADD instructions.

However, I’m not getting 160 IPC (or equivalently 136.32 GIops @ 852 MHz).
I’m only getting 80.7 GIops with ILP=2,3,4 and 92.2 GIops with ILP=1.

So I have two questions:

  1. How do I write a benchmark that can get 160 IPC for integer add throughput.
  2. Why is ILP=1 getting better performance than ILP > 1. Is it perhaps because I turned off nvcc optimizations?

I would appreciate any help on this; I’ve searched through the forums but I haven’t seen any good answer for why integer add IPC of 160 is difficult to get on a Kepler SMX.

Thank you.

p.s.
Here is a snippet of my kernel with ILP==2, TYPE==int:

global void int_kernel_2_1(int num_threads, int num_streams, TYPE* in, TYPE* out)
{
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
TYPE x, y;

if(tid < num_threads) {
x = in[tid];
y = in[num_threads + tid];

x = x + 5;
y = y + 5;
x = x + 5;
y = y + 5;
x = x + 5;
y = y + 5;
....
x = x + 5;
y = y + 5;

out[tid] = x;
out[num_threads + tid] = y;

}
}

You should dump your executable or .cubin with cuobjdump or nvdisasm.

A quick guess is that all those constant additions will be folded into two IADD ops.

It does, which is why i added
-O0 -Xcicc -O0 -Xptxas -O0
to nvcc.
This disables the optimization, and generates the right # of IADD instructions (verified using cuobjdump -sass).

The problem is, I’m only getting < 3.875 instructions per cycle, instead of the theoretical 5.
Also, the strange thing is, if ILP==1, the IPC is HIGHER than when ILP==2,3,4.

Sorry, I somehow missed you dumped the code with cuobjdump!

The impact of “register bank conflicts” have been discussed in other forum postings.

So disabling optimizations might be counterproductive.

Perhaps you could throw in some AND/OR/XOR/MIN/MAX ops to keep the compiler from optimizing away your benchmark? As I’m sure you know, the bitwise/compare ops are also listed as 160 ops/clock.

I’ll look into the register bank conflict thing and see if I can fix that manually (probably not).
If that doesn’t work I’ll try adding in those instructions.
Since they’re (probably) implemented using the same hardware, it’ll probably cost about the same as integer add in terms of energy.

Thanks, as always, allanmac!

Actually, never mind.
Mixing in bitwise operations and turning on the optimizations is giving me the same performance as before.

So, something else is causing the performance bottleneck.
I am not sure why this is happening, as this is a very simple, unrolled code.

global void int_kernel_2_1(int num_threads, int num_streams, TYPE* in, TYPE* out)
{
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
TYPE x, y;

if(tid < num_threads) {
x = in[tid];
y = in[num_threads + tid];

x = x + CONST;
x = x ^ 0xF0F0F0F0;
y = y + CONST;
y = y ^ 0xF0F0F0F0;
x = x + CONST;
x = x ^ 0xF0F0F0F0;
y = y + CONST;
y = y ^ 0xF0F0F0F0;
....
x = x + CONST;
x = x ^ 0xF0F0F0F0;
y = y + CONST;
y = y ^ 0xF0F0F0F0;

out[tid] = x;
out[num_threads + tid] = y;
}
}

If anybody has any idea on how to get the stated 160 integer add instructions per cycle throughput, please help!

Thank you.

@lemonherb, I responded to your post over in the Programming & Performance forum post.

I can reach an integer IPC of 3.53 on a Kepler GK208 (~2x a TK1). That’s ~112 integer ops/clock which is still pretty far from 160.

A Maxwell GPU easily reaches its advertised peak performance of 128 integer ops/clock.

I haven’t tried it on the TK1.