Can't reach peak performance

jayshenoy · January 3, 2012, 9:25am

I’m trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.

__global__ void new_ker(float *x)

{

  int index = threadIdx.x+blockIdx.x*blockDim.x;

  float a,b;

  a=0;

  b=x[index];

  //LOOP=10000000

  //No. of blocks = 1

  //Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)

  #pragma unroll 2048

  for(int i=0;i<LOOP;i++){

       a=a*b+b;

  }  

x[index] = a;

}

I don’t want to increase ILP in the code. Any ideas why it’s not reaching peak??

tera · January 3, 2012, 10:43am

Unroll less aggressively. 2048 instructions of 64 bit each are 16kB of instructions, but the first level instruction cache is only 4kB big. So try [font=“Courier New”]#pragma unroll 500[/font] or less.

jayshenoy · January 3, 2012, 11:12am

Okay, done. But still no peak. Nice link by the way.

Jimmy_Pettersson · January 3, 2012, 3:04pm

You dont seem to be doing a FMA operation ?

FMA is defined as:

a = a + b*c ;

You are doing:

a = a*c + b; ( not FMA )

Unless im sleepy by the computer here :)

seibert · January 3, 2012, 3:14pm

The FMA instruction as defined in the PTX manual is the 4-operand variety, where this distinction doesn’t matter. I assume this means the underlying hardware also implements FMA4, though I haven’t looked a cudaobjdump to see.

jayshenoy · January 4, 2012, 5:57am

Actually, it is taken as FMAD.

Here’s a PTX snippet

mov.f32     %f49, 0f43d3accd;       // 423.35

    mov.f32     %f50, 0f40228f5c;       // 2.54

    mov.f32     %f51, 0f43d3accd;       // 423.35

    mov.f32     %f52, 0f40228f5c;       // 2.54

    mov.f32     %f53, 0f43d3accd;       // 423.35

    mad.f32     %f54, %f53, %f1, %f52;

    mad.f32     %f55, %f51, %f54, %f50;

    mad.f32     %f56, %f49, %f55, %f48;

    mad.f32     %f57, %f47, %f56, %f46;

    mad.f32     %f58, %f45, %f57, %f44;

    mad.f32     %f59, %f43, %f58, %f42;

    mad.f32     %f60, %f41, %f59, %f40;

    mad.f32     %f61, %f39, %f60, %f38;

    mad.f32     %f62, %f37, %f61, %f36;

    mad.f32     %f63, %f35, %f62, %f34;

    mad.f32     %f64, %f33, %f63, %f32;

    mad.f32     %f65, %f31, %f64, %f30;

    mad.f32     %f66, %f29, %f65, %f28;

jayshenoy · January 4, 2012, 6:03am

Actually I’m performing Volkov’s experiments. I’m not entirely sure whether I’ve taken the right parameters say e.g LOOP, Unroll etc.

Jimmy_Pettersson · January 4, 2012, 1:25pm

Ah, you are right about the FMAD.

I recently wrote a performance test for the GF104 architecture and there it was possible to reach 98 % of peak performance if you use enough ILP. What happens if you increase the number of blocks ( you are currently only using one? ) ? Do you reach near peak performance if you do that?

jayshenoy · January 4, 2012, 1:39pm

Okay. If multiple blocks are used peak can be reached(see this post). But, I wanted peak for an SM as in Vasily Volkov’s experiments. I’m not getting the same graph which he got. Higher ILP versions of the code go off the peak.

tera · January 4, 2012, 10:37pm

Using the self-contained example code with the corrected FLOP/s calculation that was posted by talonmies in the stackoverflow thread, I get 99.7% of theoretical peak performance with just a single block on the 9400m in the notebook I’m currently writing on.

What operating system do you use? What happens if you run the executable multiple times in direct succession, i.e. [font=“Courier New”]with ./a.out; ./a.out; ./a.out[/font] on Linux/Mac OS? You might be affected by power-saving modes that take a short moment to clock the GPU up to full speed after being idle.

jayshenoy · January 5, 2012, 5:55am

Yeah, I can achieve that. But the current problem is that higher ILP versions are going beyond the peak at higher THPB

I use Fedora 15. I get almost same execution times if I execute multiple times.

I suspect that the parameters are wrong. I’m trying to vary parameters, which is not looking good.

avidday · January 5, 2012, 2:37pm

But that code you linked to isn’t running multiple blocks - it is running 1 block per multiprocessor (and the card in question has 30 multiprocessors, thus 30 blocks), which seems to be exactly what you are asking for.

jayshenoy · January 26, 2012, 1:07pm

Problem solved!. Kernel code for higher ILP was not getting launched properly because of the aggressive optimization by the compiler

Topic		Replies	Views
Achieving Peak Compute Performance on Kepler CUDA Programming and Performance	3	1373	March 18, 2014
Unable to reach full fp32 throughput on Titan X CUDA Programming and Performance	14	1622	December 10, 2016
How close to peak can you get on a CPU? CUDA Programming and Performance	33	2943	November 9, 2010
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37407	August 30, 2009
A maximum performance of 823 GFlops meseared for GTX 295 with mad+muls CUDA Programming and Performance	11	4712	March 8, 2010
Instruction level parallelism and maximum instruction per clock CUDA Programming and Performance	6	1108	September 3, 2019
unable to get maximum multiply add throughput on Fermi CC 2.1 CUDA Programming and Performance	6	1585	February 11, 2013
Peak Performance of integer operation CUDA Programming and Performance	3	2864	May 11, 2017
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2311	April 19, 2012
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10151	June 19, 2008

Can't reach peak performance

Related topics