Achieving Peak Compute Performance on Kepler

lemonherb · March 18, 2014, 6:25am

Hi.
I have a question regarding getting the theoretical peak on a Kepler machine.
I’ve been trying to create a microbenchmark that can hit the specified peak of 4.3 TFLOP/s on a K40c.

The benchmark goes like this

global void
testKernel (int n, float *in, float *out)
{
int gid = threadIdx.x + blockIdx.x * blockDim.x;
shared float sm[512];

float r = (float) (gid & 0x00000003);
float s = (float) (gid & 0x000000C0);
float t = (float) (gid & 0x00000006);
float u = (float) (gid & 0x00000009);

r += r * 0.01f;
s += s * 0.02f;
t += t * 0.03f;
u += u * 0.04f;
… repeated ~256 times

sm[threadIdx.x] = r + s + t + u;
}

in the main function:

nThreads = 16777216;
tbSize = 512;
gridSize = (n + tbSize - 1) / tbSize;

clock_gettime (CLOCK_MONOTONIC_RAW, &tStart);
testKernel <<<gridSize, tbSize>>> (nThreads, d_In, d_Out);
cudaDeviceSynchronize ();
clock_gettime (CLOCK_MONOTONIC_RAW, &tEnd);

Now, I am only getting ~ 3.2 TFLOP/s instead of 4.3 as specified in the specs and I have no idea why.

nvprof says
achieved occupancy is 0.94
ipc is 5
stall_inst_fetch is 15%
stall_exec_dependency is 19%

Now, with this many threads, occupancy, and ILP, shouldn’t there be no stalls?
What should be the theoretical max IPC? How do we calculate this?
Does anybody have any code that can actually hit peak on either a K40 or a GTX Titan?

Thank you.

SPWorley · March 18, 2014, 5:02pm

It may be that Kepler can’t actually sustain a 4.3 TFlop fp32 throughput, likely because of register throughput limits. NVidia does not go into low level architecture details, but one model of the Kepler SMX consists of 4 “quads”, each with a scheduler and its own pool of registers. If that register file can only issue 3 registers per tick, then that’s just enough to feed 32 SP cores the 3 arguments needed for an FMADD but no more. With 4 quads, that’s only 128 SPs out of 192, giving you about 2/3 of the theoretical peak, matching your results.

Others have tried and failed to reach the 4.3 TF limit too.

scottgray · March 18, 2014, 7:19pm

This paper goes into some detail on this issue (check out table 2):

Lai now works for nvidia and if you dump the sass from the cublas lib you can find an implementation that uses all of the techniques that are discussed in that paper (plus a double buffering approach made available with the additional registers in sm35)… which leads me to believe he likely wrote it. Anyway, that cublas sgemm implemenation (sgemm_sm35_ldg_nt_128x8x128x16x16) is probably about as fast as anyone is going to be able to get Kepler to perform (with useful ffma instructions anyway). You can see there are even no-ops added to the code to prevent over saturation of the fp pipeline (presumably).

cuobjdump -sass -fun sgemm_sm35_ldg_nt_128x8x128x16x16 cublas_device.lib

With Maxwell, I’ve noticed it’s much easier to get to the theoretical ffma throughput (so long as you can wrestle with the quirks of ptxas and not have it senselessly allocate additional registers or have it completely reorganize carefully placed load and compute instructions designed to hide latency).

lemonherb · March 18, 2014, 8:10pm

Thank you for your replies.
At least it looks like I’m not missing something obvious.

Topic		Replies	Views
peak computational throughput CUDA Programming and Performance	3	902	December 24, 2015
Calculation of the maximum performance of the Tesla K40 GPU (general discussion) CUDA Programming and Performance	3	3709	January 15, 2015
Can't reach peak performance CUDA Programming and Performance	12	6584	January 26, 2012
What would happen with my program on Kepler with 1024 threads, 8 blocks, 32 reg max? CUDA Programming and Performance	16	4286	June 13, 2012
About GPU peak performance CUDA Programming and Performance	6	1596	August 29, 2023
SGEMM performance of current Kepler GPUs? CUDA Programming and Performance	14	4734	July 25, 2014
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2468	October 19, 2019
Measuring FLOPS CUDA Programming and Performance	8	14165	January 19, 2010
Power throttling observed with GPU stress test that calls cublas CUDA Programming and Performance	11	1672	October 20, 2023
Optimizing cuBlas in kernels CUDA Programming and Performance	3	709	April 9, 2015

Achieving Peak Compute Performance on Kepler

Related topics