Can't reach peak performance

I’m trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.

__global__ void new_ker(float *x)


  int index = threadIdx.x+blockIdx.x*blockDim.x;

  float a,b;




  //No. of blocks = 1

  //Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)

  #pragma unroll 2048

  for(int i=0;i<LOOP;i++){



x[index] = a;


I don’t want to increase ILP in the code. Any ideas why it’s not reaching peak??

Unroll less aggressively. 2048 instructions of 64 bit each are 16kB of instructions, but the first level instruction cache is only 4kB big. So try [font=“Courier New”]#pragma unroll 500[/font] or less.

Okay, done. But still no peak. Nice link by the way.

You dont seem to be doing a FMA operation ?

FMA is defined as:

a = a + b*c ;

You are doing:

a = a*c + b; ( not FMA )

Unless im sleepy by the computer here :)

The FMA instruction as defined in the PTX manual is the 4-operand variety, where this distinction doesn’t matter. I assume this means the underlying hardware also implements FMA4, though I haven’t looked a cudaobjdump to see.

Actually, it is taken as FMAD.

Here’s a PTX snippet

mov.f32     %f49, 0f43d3accd;       // 423.35

    mov.f32     %f50, 0f40228f5c;       // 2.54

    mov.f32     %f51, 0f43d3accd;       // 423.35

    mov.f32     %f52, 0f40228f5c;       // 2.54

    mov.f32     %f53, 0f43d3accd;       // 423.35

    mad.f32     %f54, %f53, %f1, %f52;

    mad.f32     %f55, %f51, %f54, %f50;

    mad.f32     %f56, %f49, %f55, %f48;

    mad.f32     %f57, %f47, %f56, %f46;

    mad.f32     %f58, %f45, %f57, %f44;

    mad.f32     %f59, %f43, %f58, %f42;

    mad.f32     %f60, %f41, %f59, %f40;

    mad.f32     %f61, %f39, %f60, %f38;

    mad.f32     %f62, %f37, %f61, %f36;

    mad.f32     %f63, %f35, %f62, %f34;

    mad.f32     %f64, %f33, %f63, %f32;

    mad.f32     %f65, %f31, %f64, %f30;

    mad.f32     %f66, %f29, %f65, %f28;

Actually I’m performing Volkov’s experiments. I’m not entirely sure whether I’ve taken the right parameters say e.g LOOP, Unroll etc.

Ah, you are right about the FMAD.

I recently wrote a performance test for the GF104 architecture and there it was possible to reach 98 % of peak performance if you use enough ILP. What happens if you increase the number of blocks ( you are currently only using one? ) ? Do you reach near peak performance if you do that?

Okay. If multiple blocks are used peak can be reached(see this post). But, I wanted peak for an SM as in Vasily Volkov’s experiments. I’m not getting the same graph which he got. Higher ILP versions of the code go off the peak.

Using the self-contained example code with the corrected FLOP/s calculation that was posted by talonmies in the stackoverflow thread, I get 99.7% of theoretical peak performance with just a single block on the 9400m in the notebook I’m currently writing on.

What operating system do you use? What happens if you run the executable multiple times in direct succession, i.e. [font=“Courier New”]with ./a.out; ./a.out; ./a.out[/font] on Linux/Mac OS? You might be affected by power-saving modes that take a short moment to clock the GPU up to full speed after being idle.

Yeah, I can achieve that. But the current problem is that higher ILP versions are going beyond the peak at higher THPB

I use Fedora 15. I get almost same execution times if I execute multiple times.

I suspect that the parameters are wrong. I’m trying to vary parameters, which is not looking good.

But that code you linked to isn’t running multiple blocks - it is running 1 block per multiprocessor (and the card in question has 30 multiprocessors, thus 30 blocks), which seems to be exactly what you are asking for.

Problem solved!. Kernel code for higher ILP was not getting launched properly because of the aggressive optimization by the compiler