Achieving Peak Compute Performance on Kepler

I have a question regarding getting the theoretical peak on a Kepler machine.
I’ve been trying to create a microbenchmark that can hit the specified peak of 4.3 TFLOP/s on a K40c.

The benchmark goes like this

global void
testKernel (int n, float *in, float *out)
int gid = threadIdx.x + blockIdx.x * blockDim.x;
shared float sm[512];

float r = (float) (gid & 0x00000003);
float s = (float) (gid & 0x000000C0);
float t = (float) (gid & 0x00000006);
float u = (float) (gid & 0x00000009);

r += r * 0.01f;
s += s * 0.02f;
t += t * 0.03f;
u += u * 0.04f;
… repeated ~256 times

sm[threadIdx.x] = r + s + t + u;

in the main function:

nThreads = 16777216;
tbSize = 512;
gridSize = (n + tbSize - 1) / tbSize;

clock_gettime (CLOCK_MONOTONIC_RAW, &tStart);
testKernel <<<gridSize, tbSize>>> (nThreads, d_In, d_Out);
cudaDeviceSynchronize ();
clock_gettime (CLOCK_MONOTONIC_RAW, &tEnd);

Now, I am only getting ~ 3.2 TFLOP/s instead of 4.3 as specified in the specs and I have no idea why.

nvprof says
achieved occupancy is 0.94
ipc is 5
stall_inst_fetch is 15%
stall_exec_dependency is 19%

Now, with this many threads, occupancy, and ILP, shouldn’t there be no stalls?
What should be the theoretical max IPC? How do we calculate this?
Does anybody have any code that can actually hit peak on either a K40 or a GTX Titan?

Thank you.

It may be that Kepler can’t actually sustain a 4.3 TFlop fp32 throughput, likely because of register throughput limits. NVidia does not go into low level architecture details, but one model of the Kepler SMX consists of 4 “quads”, each with a scheduler and its own pool of registers. If that register file can only issue 3 registers per tick, then that’s just enough to feed 32 SP cores the 3 arguments needed for an FMADD but no more. With 4 quads, that’s only 128 SPs out of 192, giving you about 2/3 of the theoretical peak, matching your results.

Others have tried and failed to reach the 4.3 TF limit too.

This paper goes into some detail on this issue (check out table 2):

Lai now works for nvidia and if you dump the sass from the cublas lib you can find an implementation that uses all of the techniques that are discussed in that paper (plus a double buffering approach made available with the additional registers in sm35)… which leads me to believe he likely wrote it. Anyway, that cublas sgemm implemenation (sgemm_sm35_ldg_nt_128x8x128x16x16) is probably about as fast as anyone is going to be able to get Kepler to perform (with useful ffma instructions anyway). You can see there are even no-ops added to the code to prevent over saturation of the fp pipeline (presumably).

cuobjdump -sass -fun sgemm_sm35_ldg_nt_128x8x128x16x16 cublas_device.lib

With Maxwell, I’ve noticed it’s much easier to get to the theoretical ffma throughput (so long as you can wrestle with the quirks of ptxas and not have it senselessly allocate additional registers or have it completely reorganize carefully placed load and compute instructions designed to hide latency).

Thank you for your replies.
At least it looks like I’m not missing something obvious.