Unable to reach full fp32 throughput on Titan X

Aditya_Avinash · December 9, 2016, 10:17pm

Hi,
I have written this benchmark and couldn’t get full fp32 throughput mentioned here : GPU Archives | NVIDIA Blog

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <time.h>
#include <sys/time.h>

#define USECPSEC 1000000ULL
#define ITER 1024*1024*512
#define SSZ 512
#define BSZ 7*16

unsigned long long dtime_usec(unsigned long long start){
  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void DoFAdd(float *in1d, float* outd) {
    int tx = threadIdx.x;
    float in1 = in1d[tx];
    float out = outd[tx];
    for (int i = 0; i < ITER; i++) {
      out += in1;
    }
    outd[tx] = out;
}


int main() {
    float *in1d,  *outd;
    cudaMalloc((void**)&in1d, SSZ*4);
    cudaMalloc((void**)&outd, SSZ*4);
    unsigned long long dt = dtime_usec(0);
    DoFAdd <<< dim3(BSZ), dim3(SSZ) >>>(in1d, outd);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= BSZ;
    ops *= SSZ;
    float et = dt/(float)USECPSEC;
    unsigned long long Mops = ops/1000000;
    std::cout<<et<<"s for "<< Mops << " FAdds"<<std::endl;
    float tp = (Mops)/(et*1000000);
    std::cout << "Throughput: " << tp << " Tops/s" << std::endl;
}

What am I doing wrong?

Robert_Crovella · December 9, 2016, 10:59pm

Approaching the full published throughput requires using floating-point multiply-add operations.

A loop that just does add operations won’t get you closer than 50%.

This exercise (especially if focused on a real world use, as opposed to a pure synthetic benchmark) generally takes considerable coding effort and skill. Such effort and skill has already been packaged up for you in the cublas Sgemm library call (as one example). I recommend that you experiment with that.

If you want to learn the necessary techniques yourself, I would suggest perhaps starting with Scott Gray’s treatment here:

[url]https://github.com/NervanaSystems/maxas/wiki/SGEMM[/url]

cbuchner1 · December 9, 2016, 11:01pm

You have a low ratio of arithmetic operations vs control flow instructions related to the for loop. Try an explicit #pragma unroll 1024

And performing arithmetic that maps to a hardware fma instruction would also help a lot.

Christian

Aditya_Avinash · December 9, 2016, 11:58pm

Hmmm… Interesting. As a test I ran SHOC on TXP, it is showing the same numbers for just adds and peak for fmads.

@txbob thanks for the link. Can you explain why mads are used to measure throughput?

@cbuchner1 I tried to unroll and it made performance number worse (just wanted you to know). I guess it’s because of instruction register spilling (guessing from knowledge from Scott Grays blog).

njuffa · December 10, 2016, 12:08am

GFLOPS is about operations (the ‘O’ in GFLOPS). One FMA instruction comprises two floating-point operations (an addition plus a multiplication). That is why peak throughput numbers are based on executing FMAs, and as few other instructions as possible.

Since modern GPUs have multi-issue capabilities, you would also want to compute with at least two independent dependency chains per thread, each consisting entirely of FMAs (you can merge the two chains with the very last FMA in the sequence, just before you write out the final result).

Robert_Crovella · December 10, 2016, 12:19am

Unless I have made a mistake, this useless code demonstrates 9.8TFlops/s throughput against a peak theoretical of 11TF:

$ cat t52.cu
#include<iostream>
#include <stdio.h>

#define ITER 1024
#define SSZ 512
#define BSZ 64


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void DoFMA() {
    __shared__ float data1[SSZ];
    __shared__ float data2[SSZ];
    float out = threadIdx.x+1;
#pragma unroll 1024
    for (int i = 0; i < ITER; i++)
      out += data1[threadIdx.x]*data2[threadIdx.x] ;
    if (!out) printf("out = %d\n", out);
}


int main() {
    DoFMA<<<BSZ, SSZ>>>();  // warm up
    unsigned long long dt = dtime_usec(0);
    DoFMA<<<BSZ*ITER, SSZ>>>();
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= ITER;
    ops *= SSZ;
    ops *= BSZ;
    ops *= 2;
    float et = dt/(float)USECPSEC;
    std::cout << "throughput: " << ops/et << " ops/s" << std::endl;
}
$ nvcc -arch=sm_61 -o t52 t52.cu
$ ./t52
throughput: 9.84661e+12 ops/s
$

njuffa · December 10, 2016, 1:31am

I have not looked at the generated code, but I think as written this might bottleneck on shared memory throughput. Try making one factor a constant memory reference, e.g. by using a kernel argument. Also, using two parallel dot products per thread should help.

Robert_Crovella · December 10, 2016, 1:50am

I was thinking the compiler should optimize the shared memory reads into register reads.

I was actually thinking a quickly whipped up code that demonstrated 90% of peak theoretical was sufficient to validate a truth claim.

If I can write a trivial code that demonstrates 9.8TF measured, then the truth claim of 11TF peak theoretical does not seem outlandish (to me).

njuffa · December 10, 2016, 5:30am

The program above doesn’t get close to 90% of peak on my Maxwell-based K2200. The code below gets me pretty close to peak, with measurement noise of less than 0.1%. No idea how it will fare on a Titan X.

#include <stdio.h>
#include <stdlib.h>

#define REPEAT    5
#define ITER      512
#define THREADS   128
#define BLOCKS  65520

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void DoFMA (float a, float b, float x, float y, float *res) {
#define OPS_PER_FMA    2
#define FMAS_PER_ITER  2
#define EXTRA_OPS      1
#pragma unroll ITER
    for (int i = 0; i < ITER; i++) {
        x = a * x + b;
        y = b * y + a;
    }
    *res = x + y;
}


int main() 
{
    float res, *res_d = 0;
    double start, stop, elapsed, min_elapsed = 1e308;
    cudaMalloc ((void **)&res_d, sizeof (*res_d));
    cudaDeviceSynchronize (); // ensure GPU is idle before timed portion
    for (int k = 0; k < REPEAT; k++) {
        start = second();
        DoFMA<<<BLOCKS, THREADS>>>(1.125f, 1.125f, 3.0f, 5.0f, res_d);
        cudaDeviceSynchronize (); // wait until kernel has finished
        stop = second ();
        elapsed = stop - start;
        if (elapsed < min_elapsed) min_elapsed = elapsed;
    }
    printf ("elapsed = %.6f seconds\n", min_elapsed);
    unsigned long long ops = ITER * FMAS_PER_ITER * OPS_PER_FMA + EXTRA_OPS;
    ops *= BLOCKS;
    ops *= THREADS;
    printf ("throughput = %10.2f GFLOPS\n", ops / min_elapsed /1e9);
    cudaMemcpy (&res, res_d, sizeof(res), cudaMemcpyDeviceToHost);
    cudaFree (res_d);
#if DEBUG
    printf ("res = %15.8e\n", res);
#endif
    return EXIT_SUCCESS;
}

SPWorley · December 10, 2016, 6:19am

Norbert, your code also gives full throughput on my GM206 GTX 960 in Linux.

Surprisingly, Linux nvcc 8.0.44 complains about the #pragma line with an error about ITER being undefined. That seems like an nvcc macro expansion bug! Substituting the explicit 512 fixes it. Note the for loop uses ITER and that does work fine.

njuffa · December 10, 2016, 7:37am

Funny, the #pragma works just fine for me with CUDA 7.5 on Windows. I am too lazy to dig out the C++ standard right now. There are multiple phases in the preprocessor, and I do not know off-hand what the required order of these phases is, that is, must ITER be expanded before the #pragma is being consumed? [Later:] The ISO C++11 standard, section 16.6 states #pragma is followed by pp-tokens, which I read to mean that macro expansion should take place. This is contradicted by some online sources that state that macro expansion does not take place in a #pragma directive.

The program reports 1377 GFLOPS on my Quadro K2200 GPU at the maximum core clock of 1124 GHz, which is about 96% of the theoretical maximum of 1440 GFLOPS (various online sources list this GPU at 1339 GFLOPS, not sure why I computed a different number; maybe the maximum core clock is different on my model?)

SPWorley · December 10, 2016, 9:40am

Huh, you learn something new every day. Google found this post, which gives references to section 6.10.9 of the c99 standard, or 16.9 of the c++0x standard which say that #pragma is not macro expanded, but the equivalent _Pragma(" argument ") expression expands into such a #pragma. I had never even heard of the second form.

But if I try that in nvcc, using the ## stringify macro expansion and implicit string concatination like:

_Pragma("unroll " ##ITER)

nvcc complains that _Pragma needs a string literal, so it looks as if the macro stringification was not evaluated before _Pragma.

Yes, this is a side topic to the FMA throughput question so I’ll leave off discussion, but I wonder what the right behavior should be (is nvcc in error, or more likely is my understanding wrong). And then of course what is the proper way to use a #pragma unroll with a #defined unroll count?

Robert_Crovella · December 10, 2016, 2:37pm

On my Titan X Pascal, I get 9989 GFlops with Norbert’s code (vs. 9846 with mine)

njuffa · December 10, 2016, 5:35pm

What is the core frequency while the test is running? Since this is a short-running kernel with only few repetitions, the full boost clock may never kick in (I had to force it even on my Quadro). The Titan XP is listed as providing 10157 GFLOPS at nominal core clock, 10974 GFLOPS at maximum (NVIDIA-defined) boost clock. At least we are in the ballpark of that, as you point out.

Robert_Crovella · December 10, 2016, 5:52pm

I changed REPEAT from 5 to 500 and now I get 10278 GF

If I change REPEAT to 500000 the GF number doesn’t really change but the clocks look like this:

Clocks
        Graphics                    : 1455 MHz
        SM                          : 1455 MHz
        Memory                      : 4513 MHz
        Video                       : 1303 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5005 MHz
        Video                       : 1708 MHz

So undoubtedly there is still perf left on the table. I wasn’t really trying to go after max perf regardless of the time invested to do it. I was trying to suggest the plausibility of a particular claim.

Topic		Replies	Views
Kepler max effective FMA throughput CUDA Programming and Performance	5	1775	May 2, 2014
How to get more Gflops ? :) CUDA Programming and Performance	21	27764	September 12, 2008
Benchmark numbers wrong for half math on 1060 CUDA Programming and Performance	9	1290	January 23, 2017
Can't reach peak performance CUDA Programming and Performance	12	6678	January 26, 2012
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37769	August 30, 2009
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3120	November 9, 2010
The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM CUDA Programming and Performance	5	1261	January 7, 2014
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10202	June 19, 2008
unable to get maximum multiply add throughput on Fermi CC 2.1 CUDA Programming and Performance	6	1632	February 11, 2013
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13735	June 2, 2008

Unable to reach full fp32 throughput on Titan X

Related topics