How to get more Gflops ? :)

Hi Mates !

I reacon that most of you started interesting in CUDA because of a computing capabilities. I have GeForce 8600 GT, and I found that the maximum of
Gflops on my card is 114 Gflops - so I am trying to reach this mountain :) !

I have already wrote some easy program but unfortunately I “reached” only 10 Gflops… what a shame - my goodness !!


This is kernel of this program : (whole program is attached)

// Block & threads indexes
int bx = blockIdx.x;
int by = blockIdx.y;    
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first thread in each block
int aBegin = BLOCK_SIZE * bx + (by * wA);

// First thread in block
if (tx == 0) 
  timer[bx] = clock();
 
float Csub = 0;

    // Declaration of the shared memory vector AS
    __shared__ float As[BLOCK_SIZE][1];

    // Load the blocks from device memory to shared memory;
    AS(ty, tx) = A[aBegin + wA * ty + tx]; 
                  
    // Synchronize to make sure the matrices are loaded
    __syncthreads();
   
    // Computations
    for( int k = 0; k < 1000; k++)           
       Csub += AS(ty,tx) + AS(ty,tx); 
             
    // Synchronize to make sure that the preceding computation is done 
    __syncthreads();
    
 // Write the block sub-matrix to device memory
 C[BLOCK_SIZE * bx + (by * wA)+tx] = Csub;
 
 // Last thread in block
 if (tx == BLOCK_SIZE-1) 
   timer[bx+gridDim.x] = clock();

PLEASE SEND MY ANY SUGGESTIONS HOW TO CHANGE THE CODE TO GET MORE FLOPS !!

[1] How do you measure the flops ? I do it in this way :
Vector AS has 4194304 elements, in each loop there is 1000 iterations-> what makes 4194304000 operations, and if the time is about 410 ms, the easy calculation is :
4194304000 / 410ms

  • is there something wrong im my thinking ?

I would be graceful as a swan for any piece of advice :))

Yunior

compile to ptx and study the output.

obviuosly the compiler is doing something stupid in your inner loop, probably fetching from the array A every iteration instead of keeping it in a register. also see if maybe unrolling helps (the compiler never unrolls itself).

You’ll need to unroll the loop to get close to peak performance. The following code gets about 315 GFlops on my machine:

/*

 * Copyright 1993-2007 NVIDIA Corporation.  All rights reserved.

 *

 * NOTICE TO USER:

 *

 * This source code is subject to NVIDIA ownership rights under U.S. and

 * international Copyright laws.  Users and possessors of this source code

 * are hereby granted a nonexclusive, royalty-free license to use this code

 * in individual and commercial software.

 *

 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR

 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

 * OR PERFORMANCE OF THIS SOURCE CODE.

 *

 * U.S. Government End Users.   This source code is a "commercial item" as

 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

 * "commercial computer  software"  and "commercial computer software

 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

 * and is provided to the U.S. Government only as a commercial end item.

 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the

 * source code with only those rights set forth herein.

 *

 * Any use of this source code in individual and commercial software must

 * include, in the user documentation and internal comments to the code,

 * the above Disclaimer and U.S. Government End Users Notice.

 */

/*

    This sample is intended to measure the peak computation rate of the GPU in GFLOPs

    (giga floating point operations per second).

   It executes a large number of multiply-add operations, writing the results to

    shared memory. The loop is unrolled for maximum performance.

   Depending on the compiler and hardware it might not take advantage of all the

    computational resources of the GPU, so treat the results produced by this code

    with some caution.

*/

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

#define NUM_SMS (16)

#define NUM_THREADS_PER_SM (384)

#define NUM_THREADS_PER_BLOCK (192)

#define NUM_BLOCKS ((NUM_THREADS_PER_SM / NUM_THREADS_PER_BLOCK) * NUM_SMS)

#define NUM_ITERATIONS 10

// 128 MAD instructions

#define FMAD128(a, b) \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

      a = b * a + b; \

      b = a * b + a; \

__shared__ float result[NUM_THREADS_PER_BLOCK];

__global__ void gflops()

{

    float a = result[threadIdx.x];  // this ensures the mads don't get compiled out

    float b = 1.01f;

   for (int i = 0; i < NUM_ITERATIONS; i++)

    {

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

        FMAD128(a, b);

    }

    result[threadIdx.x] = a + b;

}

int

main(int argc, char** argv) 

{

    CUT_DEVICE_INIT();

   // warmup

    gflops<<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();

    CUDA_SAFE_CALL( cudaThreadSynchronize() );

   // execute kernel

    unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer( &timer));

    CUT_SAFE_CALL( cutStartTimer( timer));

   gflops<<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();

   CUDA_SAFE_CALL( cudaThreadSynchronize() );

    CUT_SAFE_CALL( cutStopTimer( timer));

    float time = cutGetTimerValue( timer);

   // output results

    printf( "Time: %f (ms)\n", time);

    const int flops = 128 * 2 * 16 * NUM_ITERATIONS * NUM_BLOCKS * NUM_THREADS_PER_BLOCK;

    printf("Gflops: %f\n", (flops / (time / 1000.0f)) / 1e9 );

   CUT_SAFE_CALL( cutDeleteTimer( timer));

    CUT_EXIT(argc, argv);

}

What about the fact that whenever I debug THE SAME PROGRAM, with the same instructions the time of executing kernel the time is always different.

Thx,

Y.

Simon you are a star !

Thank you for your reply. I decided to strike while iron is hot and ask you
caple of questions.

[1] Do I properly understand SM as Single Multiprocessor,
and SMS - Single MultiprocesorS ?

[2] Can you tell me on what kind of machine do you work ?

(as I said I am working on GeForce 8600 - 2 multiprocessors, and 
when I execute your program I got 74 Gflops)

[3] I recon that amount of 74Gflops can be still better if I work on
parameters. can you tell me how do you work out :
- NUM_THREADS_PER_SM (384)
- NUM_THREADS_PER_BLOCK (192)

 - what is the way of thinking to compute NUM_BLOCKS as    
 ((NUM_THREADS_PER_SM / NUM_THREADS_PER_BLOCK) *  NUM_SMS) 
  is it some proved way to count it? 

My '3 weeks exerience` says that my card works the most efficient with
NUM_THREADS_PER_BLOCK = 128. 

So please tell me how do you count those parameters.

[4] I repete the question, you may say ‘stupid one’,
but why program of the same number of instructions has each time
different time of execution. I am surprised because each instruction
takes its own specific number of cycles and thinking in this way
it should takes the same amount of time, shouldn`t it?

Thank you very much for reply,

Yunior

A small note on [3]: The optimal number of threads per block will vary from kernel to kernel (for a variety of reasons) and possibly even from dataset to dataset (for reasons of the memory access pattern used). The only way to really know what is optimal is to benchmark every time.

As for [4]: Are you measuring the time of a single kernel call? Then what you are seeing is noise from trying to measure a tiny amount of time. You should run enough kernel calls to chalk up 10 seconds of data and take the average. Another reason why you would see different times is that the very first call to the kernel will cause it to be compiled and uploaded to the device, incurring a huge overhead. You should call the kernel once to “warm it up” and then make the calls you time.

MisterAnderson42 thank you for reply and piece of advice - never before have I warmup a kernel :) before an action - it really works !

Mayby you would like to tell another thing :

if I have an element (it is from the program that I attached at ate beginning of this discussion) :

C[BLOCK_SIZE * bx + (by * wA)+tx] = Csub;

what is more efficient :

(a) Should I put “BLOCK_SIZE * bx + (by * wA)+tx” whenever I use it as an index of element in matrix.

( b ) Should I create extra variable :

int index = BLOCK_SIZE * bx + (by * wA)+tx;

and use it as C[index] ?

I know that you may say that it is not important, but if you make about 100000 iterations, and it must be computed 100000 times it may effect.

ttfn,

Yunior

You’re welcome, glad I could help. About your latest question, there is no substitute for experimentation :) But I will add this comment: If you write it in the form of (a), the compiler is capable of changing it into (B) for you. It’s called common subexpression elimination (CSE). Many people on this forum will disagree, but I suggest you trust the compiler to do these kinds of things for you. In my code, I spent a few days trying a number of things to tinker with the compiler and try to get faster code. In most cases, the original code was the best. And in the few cases where I was able to increase performance a little bit, the code was turned into a big nasty mess that was hard to work with.

Unfortunately, every rule has it’s exceptions. You DO need to make sure that you don’t read values from global memory (or via a texture fetch) more than once. The CSE in the compiler doesn’t seem to optimize those away (probably because the global memory value could have changed in between reads). And then there is the exception of loop unrolling, which you have already been exposed to in this post.

The architects here sometimes use the terminology SM (streaming multiprocessor) and SP (stream processor), but in the CUDA documentation we typically just use “multiprocessor” and “processor”.

The numbers I gave were on a GeForce 8800 GTX (16 multiprocessors).

The GeForce 8600 has 4 multiprocessors, not 2 (essentially 1/4 of a 8800), so 74Gflops sounds about right:

4 multiprocessors * 8 processors/multiprocessor * 2 flops/cycle * 1.19GHz = 76GFlops.

You’re not going to get much better than 74GFlops, but see the programming guide and the occupancy calculator documentation on how to tune the number of threads per block.

As “MisterAnderson” has been trying to explain to you, there is always some variation when timing - make sure to “warm up” your kernel and time a large number of iterations and take the average.

The really last 2 questions to this topic :

[1]I insist that I have 2 multiprocessors because if I execute program deviceQuery it shows me that I have 2 multiprocesors, what is more, there is 2 multiprocesors in row with GeForce 8600 GT in programing guide in a table with all cards that supports CUDA as well.

Where is the mistake in my thinking ?

[2] In occupancy calculator if I choose GPU : G84 (because I know that this is type of my GPU) it computes everything but for 4 multiprocessors as well … ??

I recon that this question is connected with [1], but I keep saying that GeForce 8600 GT has 2 multiprocessors - and if this is a true - occupancy calculator will not help me … .

Where is the mistake in my thinking ?

All best,

Yunior

The mistake (confirmed by Nvidia guys) is in Programming Guide.

8600 GT and 8600 GTS both have 4 multiprocessors. Nvidia’s product page says exactly this.

BTW, deviceQuery says noting about number of multiprocessors. Where did you find it in its output?

The issue of 4 vs 2 multiprocessors is borne out of something between lie, marketing, and irrelevance. In fact, the 8800 GTX has 8 multiprocesors not 16. Architectural charts of G80 posted on various tech sites show that each of 8 multiprocessors contains 16 scalar processors. If you’ll remember, various parts of the Programming Guide also mention that instructions get executed in half-warps of 16 threads. However, NVIDIA claims 16 MPs in various pieces of marketing probably because that was the number of shader pipelines in previous-gen parts. NVIDIA can’t be seen halfing its pipelines, can they? Of course not. Even though the new multiprocessors are twice as wide as the old shaders.

This thread actually brings us to another piece of NVIDIA fud. 315 and 74 GFLOPs are actually very, very impressive numbers that hit true theoretical performance to within a few percent. Yet the advertising claims figures 50% higher. Turns out that extra flops are only available with some sort of special fetch (or something) and aren’t accessible from CUDA (yet?). Go figure

Ok - I am a big wally !! I do not know why I kept thinking that I have 2 multiprocessors (maybe because the fact that nVidia programming guide said it and I have treated it as a bible :) since I started working on CUDA).

It was a ‘bloody mirage’ if we are talking about deviceQuery - I was more that sure that it says 2 multiprocessors.

Thank you for a helping hand and lead me on a good ways to :Simon, AndreiB, MisterAnderson42, alex_dubinsky.

Yunior

It’s nothing magical. I believe it is stated clearly in the documentation (been a while since I looked) that the extra GFLOPS come from the texture interpolation, which is accessible from CUDA. If someone could design an ideal algorithm with the perfect balance of interpolated texture reads and FLOPs, I would guess that the 500GFLOPs theoretical number could be reached. Also note that this number is used extensively in marketing, but isn’t overly pushed by the guide or in the forums. If you look at figure 1-1 in the programming guide, the graph clearly shows “only” 340 GFLOPs.

The issue of the number of multiprocessors in the device can be tested empirically. I’ve done it indirectly. Just run performance tests on a kernel that performs a fair amount of computation (at least > 0.1ms in a single block) and plot the kernel execution time vs. number of blocks executed. The time will “stair-step” up when more multiprocessors are activated. The numbers in my performance tests are consistent with 16 multiprocessors on the 8800 GTX and the number of blocks per multiprocessor from the occupancy calculator.

Counting interpolation as flops would have been really dirty. Actually, it is some sort of new programmable multiply that can occur in line with interpolation (see ref.). However, it is not accessible from DX/OGL, and i’m pretty certain not from CUDA either. Come to think of it, counting plain bilinear which is accessible would have been less dirty.

The number of ALUs per multiprocessor is 8x2. I’m not sure what the real practical significance of this is, but that is the grouping which has its own register file and probably is the grouping that shares sram and other resources. From a “pure” pov, that is a multiprocessor even if it may not matter. Also, I can’t say I know why it’s 8x2 instead of 16 (it’s probably related to your result). But, I think that that fact is the explanation for the “typo” in the Guide.

Source:

http://www.beyond3d.com/content/reviews/1/11

The author doesn’t list his references, but in a thread he pointedly claims he’s ‘confirmed’ the points we’re dicussing:

http://forum.beyond3d.com/showthread.php?t=37228 (post #5)

Why would they build in a new, programmable multiply and not use it in either CUDA or OpenGL/DX? Sounds quite redundant then, maybe it’s broken in the first place?

I have a question … :).

Is there any connection with number of multiporcessor and the times that we unroll the loop to get the best performance… ?
As in example - Simon has 16 multiprocessors and he used FMAD128 16 times ?

Thx

Y.

There isn’t a connection. You should always unroll loops in CUDA as much as you possibly can, as the compiler will not do it.

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> ya, loop unrolling if you like to get crazy with the code. For complex projects it is practically unfeasible. The point is why don’t we can get some cleaver compiler instead of waste our time with carpets? I noticed loop unrolling flags do not work for other compilers too (gcc etc.). Why do programmers add IA capabilities to games but dont to serious stuff like compilers?

I have a macro processor to unroll loops…
http://forums.nvidia.com/index.php?showtopic=42636
We could do simple stuff like this ourselves, and don’t have to rely on the nv compiler group.

BTW: That thing enabled me to use unrolled loop and driver API in my moderately big project, improving performance by 10% or so.