Unable to reach full fp32 throughput on Titan X

Hi,
I have written this benchmark and couldn’t get full fp32 throughput mentioned here : GPU Archives | NVIDIA Blog

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <time.h>
#include <sys/time.h>

#define USECPSEC 1000000ULL
#define ITER 1024*1024*512
#define SSZ 512
#define BSZ 7*16

unsigned long long dtime_usec(unsigned long long start){
  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void DoFAdd(float *in1d, float* outd) {
    int tx = threadIdx.x;
    float in1 = in1d[tx];
    float out = outd[tx];
    for (int i = 0; i < ITER; i++) {
      out += in1;
    }
    outd[tx] = out;
}


int main() {
    float *in1d,  *outd;
    cudaMalloc((void**)&in1d, SSZ*4);
    cudaMalloc((void**)&outd, SSZ*4);
    unsigned long long dt = dtime_usec(0);
    DoFAdd <<< dim3(BSZ), dim3(SSZ) >>>(in1d, outd);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= BSZ;
    ops *= SSZ;
    float et = dt/(float)USECPSEC;
    unsigned long long Mops = ops/1000000;
    std::cout<<et<<"s for "<< Mops << " FAdds"<<std::endl;
    float tp = (Mops)/(et*1000000);
    std::cout << "Throughput: " << tp << " Tops/s" << std::endl;
}

What am I doing wrong?

Approaching the full published throughput requires using floating-point multiply-add operations.

A loop that just does add operations won’t get you closer than 50%.

This exercise (especially if focused on a real world use, as opposed to a pure synthetic benchmark) generally takes considerable coding effort and skill. Such effort and skill has already been packaged up for you in the cublas Sgemm library call (as one example). I recommend that you experiment with that.

If you want to learn the necessary techniques yourself, I would suggest perhaps starting with Scott Gray’s treatment here:

[url]https://github.com/NervanaSystems/maxas/wiki/SGEMM[/url]

You have a low ratio of arithmetic operations vs control flow instructions related to the for loop. Try an explicit #pragma unroll 1024

And performing arithmetic that maps to a hardware fma instruction would also help a lot.

Christian

Hmmm… Interesting. As a test I ran SHOC on TXP, it is showing the same numbers for just adds and peak for fmads.

@txbob thanks for the link. Can you explain why mads are used to measure throughput?

@cbuchner1 I tried to unroll and it made performance number worse (just wanted you to know). I guess it’s because of instruction register spilling (guessing from knowledge from Scott Grays blog).

GFLOPS is about operations (the ‘O’ in GFLOPS). One FMA instruction comprises two floating-point operations (an addition plus a multiplication). That is why peak throughput numbers are based on executing FMAs, and as few other instructions as possible.

Since modern GPUs have multi-issue capabilities, you would also want to compute with at least two independent dependency chains per thread, each consisting entirely of FMAs (you can merge the two chains with the very last FMA in the sequence, just before you write out the final result).

Unless I have made a mistake, this useless code demonstrates 9.8TFlops/s throughput against a peak theoretical of 11TF:

$ cat t52.cu
#include<iostream>
#include <stdio.h>

#define ITER 1024
#define SSZ 512
#define BSZ 64


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void DoFMA() {
    __shared__ float data1[SSZ];
    __shared__ float data2[SSZ];
    float out = threadIdx.x+1;
#pragma unroll 1024
    for (int i = 0; i < ITER; i++)
      out += data1[threadIdx.x]*data2[threadIdx.x] ;
    if (!out) printf("out = %d\n", out);
}


int main() {
    DoFMA<<<BSZ, SSZ>>>();  // warm up
    unsigned long long dt = dtime_usec(0);
    DoFMA<<<BSZ*ITER, SSZ>>>();
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= ITER;
    ops *= SSZ;
    ops *= BSZ;
    ops *= 2;
    float et = dt/(float)USECPSEC;
    std::cout << "throughput: " << ops/et << " ops/s" << std::endl;
}
$ nvcc -arch=sm_61 -o t52 t52.cu
$ ./t52
throughput: 9.84661e+12 ops/s
$

I have not looked at the generated code, but I think as written this might bottleneck on shared memory throughput. Try making one factor a constant memory reference, e.g. by using a kernel argument. Also, using two parallel dot products per thread should help.

I was thinking the compiler should optimize the shared memory reads into register reads.

I was actually thinking a quickly whipped up code that demonstrated 90% of peak theoretical was sufficient to validate a truth claim.

If I can write a trivial code that demonstrates 9.8TF measured, then the truth claim of 11TF peak theoretical does not seem outlandish (to me).

The program above doesn’t get close to 90% of peak on my Maxwell-based K2200. The code below gets me pretty close to peak, with measurement noise of less than 0.1%. No idea how it will fare on a Titan X.

#include <stdio.h>
#include <stdlib.h>

#define REPEAT    5
#define ITER      512
#define THREADS   128
#define BLOCKS  65520

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void DoFMA (float a, float b, float x, float y, float *res) {
#define OPS_PER_FMA    2
#define FMAS_PER_ITER  2
#define EXTRA_OPS      1
#pragma unroll ITER
    for (int i = 0; i < ITER; i++) {
        x = a * x + b;
        y = b * y + a;
    }
    *res = x + y;
}


int main() 
{
    float res, *res_d = 0;
    double start, stop, elapsed, min_elapsed = 1e308;
    cudaMalloc ((void **)&res_d, sizeof (*res_d));
    cudaDeviceSynchronize (); // ensure GPU is idle before timed portion
    for (int k = 0; k < REPEAT; k++) {
        start = second();
        DoFMA<<<BLOCKS, THREADS>>>(1.125f, 1.125f, 3.0f, 5.0f, res_d);
        cudaDeviceSynchronize (); // wait until kernel has finished
        stop = second ();
        elapsed = stop - start;
        if (elapsed < min_elapsed) min_elapsed = elapsed;
    }
    printf ("elapsed = %.6f seconds\n", min_elapsed);
    unsigned long long ops = ITER * FMAS_PER_ITER * OPS_PER_FMA + EXTRA_OPS;
    ops *= BLOCKS;
    ops *= THREADS;
    printf ("throughput = %10.2f GFLOPS\n", ops / min_elapsed /1e9);
    cudaMemcpy (&res, res_d, sizeof(res), cudaMemcpyDeviceToHost);
    cudaFree (res_d);
#if DEBUG
    printf ("res = %15.8e\n", res);
#endif
    return EXIT_SUCCESS;
}

Norbert, your code also gives full throughput on my GM206 GTX 960 in Linux.

Surprisingly, Linux nvcc 8.0.44 complains about the #pragma line with an error about ITER being undefined. That seems like an nvcc macro expansion bug! Substituting the explicit 512 fixes it. Note the for loop uses ITER and that does work fine.

Funny, the #pragma works just fine for me with CUDA 7.5 on Windows. I am too lazy to dig out the C++ standard right now. There are multiple phases in the preprocessor, and I do not know off-hand what the required order of these phases is, that is, must ITER be expanded before the #pragma is being consumed? [Later:] The ISO C++11 standard, section 16.6 states #pragma is followed by pp-tokens, which I read to mean that macro expansion should take place. This is contradicted by some online sources that state that macro expansion does not take place in a #pragma directive.

The program reports 1377 GFLOPS on my Quadro K2200 GPU at the maximum core clock of 1124 GHz, which is about 96% of the theoretical maximum of 1440 GFLOPS (various online sources list this GPU at 1339 GFLOPS, not sure why I computed a different number; maybe the maximum core clock is different on my model?)

Huh, you learn something new every day. Google found this post, which gives references to section 6.10.9 of the c99 standard, or 16.9 of the c++0x standard which say that #pragma is not macro expanded, but the equivalent _Pragma(" argument ") expression expands into such a #pragma. I had never even heard of the second form.

But if I try that in nvcc, using the ## stringify macro expansion and implicit string concatination like:

_Pragma("unroll " ##ITER)

nvcc complains that _Pragma needs a string literal, so it looks as if the macro stringification was not evaluated before _Pragma.

Yes, this is a side topic to the FMA throughput question so I’ll leave off discussion, but I wonder what the right behavior should be (is nvcc in error, or more likely is my understanding wrong). And then of course what is the proper way to use a #pragma unroll with a #defined unroll count?

On my Titan X Pascal, I get 9989 GFlops with Norbert’s code (vs. 9846 with mine)

What is the core frequency while the test is running? Since this is a short-running kernel with only few repetitions, the full boost clock may never kick in (I had to force it even on my Quadro). The Titan XP is listed as providing 10157 GFLOPS at nominal core clock, 10974 GFLOPS at maximum (NVIDIA-defined) boost clock. At least we are in the ballpark of that, as you point out.

I changed REPEAT from 5 to 500 and now I get 10278 GF

If I change REPEAT to 500000 the GF number doesn’t really change but the clocks look like this:

Clocks
        Graphics                    : 1455 MHz
        SM                          : 1455 MHz
        Memory                      : 4513 MHz
        Video                       : 1303 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5005 MHz
        Video                       : 1708 MHz

So undoubtedly there is still perf left on the table. I wasn’t really trying to go after max perf regardless of the time invested to do it. I was trying to suggest the plausibility of a particular claim.