I have a question regarding getting the theoretical peak on a Kepler machine.
I’ve been trying to create a microbenchmark that can hit the specified peak of 4.3 TFLOP/s on a K40c.
The benchmark goes like this
testKernel (int n, float *in, float *out)
int gid = threadIdx.x + blockIdx.x * blockDim.x;
shared float sm;
float r = (float) (gid & 0x00000003);
float s = (float) (gid & 0x000000C0);
float t = (float) (gid & 0x00000006);
float u = (float) (gid & 0x00000009);
r += r * 0.01f;
s += s * 0.02f;
t += t * 0.03f;
u += u * 0.04f;
… repeated ~256 times
sm[threadIdx.x] = r + s + t + u;
in the main function:
nThreads = 16777216;
tbSize = 512;
gridSize = (n + tbSize - 1) / tbSize;
clock_gettime (CLOCK_MONOTONIC_RAW, &tStart);
testKernel <<<gridSize, tbSize>>> (nThreads, d_In, d_Out);
clock_gettime (CLOCK_MONOTONIC_RAW, &tEnd);
Now, I am only getting ~ 3.2 TFLOP/s instead of 4.3 as specified in the specs and I have no idea why.
achieved occupancy is 0.94
ipc is 5
stall_inst_fetch is 15%
stall_exec_dependency is 19%
Now, with this many threads, occupancy, and ILP, shouldn’t there be no stalls?
What should be the theoretical max IPC? How do we calculate this?
Does anybody have any code that can actually hit peak on either a K40 or a GTX Titan?