unable to get maximum multiply add throughput on Fermi CC 2.1

Hi. I have this piece of code to test if I can get the maximum multiply-add throughput (my real microbenchmark isn’t this naive), but can’t to get the 48 instructions/(clock cycle * #SM) that CC 2.1 devices are capable of. I’m only getting 416 Gops/s, which is about 2/3 of the peak. I’ve read how CC 2.1 devices schedule instructions, but don’t see any bottleneck. However, if I change all the multiply-adds to multiplies, I get 582 Gops/s, which is much closer to 632 Gops/s peak.

What could be gating the throughput?

__global__ void FloatMultiplyAddBench()
{
 float r[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
  
  #pragma unroll 8
  for (int i = 0; i < iterations; i += 16)
  {
    r[0] = r[15] * r[14] + r[13];
	r[1] = r[14] * r[13] + r[12];
	r[2] = r[13] * r[12] + r[11];
	r[3] = r[12] * r[11] + r[10];
	r[4] = r[11] * r[10] + r[9];
	r[5] = r[10] * r[9] + r[8];
	r[6] = r[9] * r[8] + r[7];
	r[7] = r[8] * r[7] + r[6];
	r[8] = r[7] * r[6] + r[5];
	r[9] = r[6] * r[5] + r[4];
	r[10] = r[5] * r[4] + r[3];
	r[11] = r[4] * r[3] + r[2];
	r[12] = r[3] * r[2] + r[1];
	r[13] = r[2] * r[1] + r[0];
	r[14] = r[1] * r[0] + r[15];
	r[15] = r[0] * r[15] + r[14];
  }

  // store values to prevent compiler from marking them as dead
  __shared__ float scratch;
  if (threadIdx.x == 0)
  {
    scratch = r[0] + r[1] + r[2] + r[3] + r[4] + r[5] + r[6] + r[7] + r[8] + r[9] + r[10] + r[11] + r[12] + r[13] + r[14] + r[15];
  }
}

Won’t there be read-after-write dependencies near the middle of that sequence?

Perhaps you can rotate the register indices a little to raise the possible ILP to 4:

0 C 8 4
1 D 9 5
2 E A 6
3 F B 7
4 0 C 8
5 1 D 9
6 2 E A
7 3 F B
8 4 0 C
9 5 1 D
A 6 2 E
B 7 3 F
C 8 4 0
D 9 5 1
E A 6 2
F B 7 3

Just a guess!

  • Won't there be read-after-write dependencies near the middle of that sequence
  • Good eye. I was careless with the register assignment. I tried your sequence and one with even a longer dependence, but the speed only goes from 416 to 440 Gops/s, which is only 70% of the theoretical peak.

    I tried the same code on a GTX 680 and it get much closer to the peak: 1398 Gops/s, which is 90% of the peak, which is good enough for me.

    I still would like to know why the GTX 560 (SM 2.1) isn’t getting that much. I’ve checked that the compiler didn’t do anything silly by looking at the PTX, and I also measured #instructions executed in the profiler which is within a few % of what I expect. The profiler also says the IPC is 2 (half warps/cycle). Assuming the ideal is 3 IPC (48 ops/cycle) that suggests the problem could be instructions aren’t being dual issued, which is definitely needed by SM2.1 to keep all 48 FPUs busy.

    The other thing I was thinking are register bank conflicts. I can’t find the post anymore, but I remember a few years ago, someone said it can happen but very rarely. I thought that might be what’s happening here since for SM2.1, there’s that awkward ratio of 48 ALUs to 2 warp schedulers, so if they didn’t allocate the registers right, there could be a scheduling conflict?

    Just a thought, but how are you measuring execution time? CUDA event timers are probably most accurate. But if the time delta is small, it’s easy to have quantization errors in timing itself.
    This could probably be tested by simply comparing speeds with say 1000, 2000, 4000, and 8000 blocks to make sure the timings are consistent.

    Thought #2 is to make sure to have a “warm up” repetition to make sure that the instruction cache is populated. Execute the kernel once before you start timing, then time a second invocation.

    Something similar to:

    kernel<<< ... >>>(); // warm up launch
    
        cudaEventRecord(startEvent,0);
        kernel<<< ... >>>(); // real bench
        cudaEventRecord(stopEvent,0);
        cudaEventSynchronize(stopEvent);
    
        cudaEventElapsedTime(&ms, startEvent, stopEvent);
    

    Thought #3: That final “use the registers or the compiler will optimize them away” is obviously necessary. But that evaluation inside the final statement may be needlessly expensive (perhaps trivial depending how many main body iterations you have, but in benching, you never know.)
    The trick I like to use is to use not a “threadIdx.x==0” test but a nonsense test that always fails but the compiler doesn’t know any better, like threadIdx.x==999999 or if r[0]==0.37363636

    @Uncle Joe, I spent a lot of time last year tuning a compute-intensive kernel on both sm_20 and sm_21 Fermi devices and… I never saw the missing 1/3rd performance gain! The results were nearly identical on both SM types despite my thinking that it was a good candidate for sm_21’s improved throughput. So I’ve been puzzled by this as well.

    One thing that I did notice and welcomed was that compiling for sm_21 produced ever so slightly better performance on an sm_20 device. :)

    OK, I’ve got the GTX 560 to get 604 Gops/s (96%) - very good, but the register access sequence is practically useless (see below). For now, I conclude the register file bandwidth is the bottleneck.

    SPWorley, that’s pretty clever to use threadIdx.x == 999999. I changed my check to threadIdx.y == 1, which should have the same effect. I’m using CUDA events to do timing as you should. My execution configuration is 128 blocks and 256 threads/block and I was already using a warm up call.

    float r[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    #pragma unroll 8
      for (int i = 0; i < iterations; i += 16)
      {
        // 604 Gops/s
        r[0] = r[0] * r[0] + r[0];
        r[1] = r[1] * r[1] + r[1];
        r[2] = r[2] * r[2] + r[2];
        r[3] = r[3] * r[3] + r[3];
        r[4] = r[4] * r[4] + r[4];
        r[5] = r[5] * r[5] + r[5];
        r[6] = r[6] * r[6] + r[6];
        r[7] = r[7] * r[7] + r[7];
        r[8] = r[8] * r[8] + r[8];
        r[9] = r[9] * r[9] + r[9];
        r[10] = r[10] * r[10] + r[10];
        r[11] = r[11] * r[11] + r[11];
        r[12] = r[12] * r[12] + r[12];
        r[13] = r[13] * r[13] + r[13];
        r[14] = r[14] * r[14] + r[14];
        r[15] = r[15] * r[15] + r[15];
      }
    
      // store values to prevent compiler from marking them as dead
      __shared__ float scratch;
      if (threadIdx.y == 1)
      {
        // don't use a loop to index r or else data will no longer be in registers
        scratch = r[0] + r[1] + r[2] + r[3] + r[4] + r[5] + r[6] + r[7] + r[8] + r[9] + r[10] + r[11] + r[12] + r[13] + r[14] + r[15];
      }
    

    Excellent! So I wonder when this peak might ever be reached in practice?

    Maybe 3-register ops (vs. FFMA’s max of 4) might still achieve the peak? Or an FFMA that only uses 3 or 2 different registers? e.g. a=a*b+c