Approaching the full published throughput requires using floating-point multiply-add operations.
A loop that just does add operations won’t get you closer than 50%.
This exercise (especially if focused on a real world use, as opposed to a pure synthetic benchmark) generally takes considerable coding effort and skill. Such effort and skill has already been packaged up for you in the cublas Sgemm library call (as one example). I recommend that you experiment with that.
If you want to learn the necessary techniques yourself, I would suggest perhaps starting with Scott Gray’s treatment here:
Hmmm… Interesting. As a test I ran SHOC on TXP, it is showing the same numbers for just adds and peak for fmads.
@txbob thanks for the link. Can you explain why mads are used to measure throughput?
@cbuchner1 I tried to unroll and it made performance number worse (just wanted you to know). I guess it’s because of instruction register spilling (guessing from knowledge from Scott Grays blog).
GFLOPS is about operations (the ‘O’ in GFLOPS). One FMA instruction comprises two floating-point operations (an addition plus a multiplication). That is why peak throughput numbers are based on executing FMAs, and as few other instructions as possible.
Since modern GPUs have multi-issue capabilities, you would also want to compute with at least two independent dependency chains per thread, each consisting entirely of FMAs (you can merge the two chains with the very last FMA in the sequence, just before you write out the final result).
I have not looked at the generated code, but I think as written this might bottleneck on shared memory throughput. Try making one factor a constant memory reference, e.g. by using a kernel argument. Also, using two parallel dot products per thread should help.
The program above doesn’t get close to 90% of peak on my Maxwell-based K2200. The code below gets me pretty close to peak, with measurement noise of less than 0.1%. No idea how it will fare on a Titan X.
#include <stdio.h>
#include <stdlib.h>
#define REPEAT 5
#define ITER 512
#define THREADS 128
#define BLOCKS 65520
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
__global__ void DoFMA (float a, float b, float x, float y, float *res) {
#define OPS_PER_FMA 2
#define FMAS_PER_ITER 2
#define EXTRA_OPS 1
#pragma unroll ITER
for (int i = 0; i < ITER; i++) {
x = a * x + b;
y = b * y + a;
}
*res = x + y;
}
int main()
{
float res, *res_d = 0;
double start, stop, elapsed, min_elapsed = 1e308;
cudaMalloc ((void **)&res_d, sizeof (*res_d));
cudaDeviceSynchronize (); // ensure GPU is idle before timed portion
for (int k = 0; k < REPEAT; k++) {
start = second();
DoFMA<<<BLOCKS, THREADS>>>(1.125f, 1.125f, 3.0f, 5.0f, res_d);
cudaDeviceSynchronize (); // wait until kernel has finished
stop = second ();
elapsed = stop - start;
if (elapsed < min_elapsed) min_elapsed = elapsed;
}
printf ("elapsed = %.6f seconds\n", min_elapsed);
unsigned long long ops = ITER * FMAS_PER_ITER * OPS_PER_FMA + EXTRA_OPS;
ops *= BLOCKS;
ops *= THREADS;
printf ("throughput = %10.2f GFLOPS\n", ops / min_elapsed /1e9);
cudaMemcpy (&res, res_d, sizeof(res), cudaMemcpyDeviceToHost);
cudaFree (res_d);
#if DEBUG
printf ("res = %15.8e\n", res);
#endif
return EXIT_SUCCESS;
}
Norbert, your code also gives full throughput on my GM206 GTX 960 in Linux.
Surprisingly, Linux nvcc 8.0.44 complains about the #pragma line with an error about ITER being undefined. That seems like an nvcc macro expansion bug! Substituting the explicit 512 fixes it. Note the for loop uses ITER and that does work fine.
Funny, the #pragma works just fine for me with CUDA 7.5 on Windows. I am too lazy to dig out the C++ standard right now. There are multiple phases in the preprocessor, and I do not know off-hand what the required order of these phases is, that is, must ITER be expanded before the #pragma is being consumed? [Later:] The ISO C++11 standard, section 16.6 states #pragma is followed by pp-tokens, which I read to mean that macro expansion should take place. This is contradicted by some online sources that state that macro expansion does not take place in a #pragma directive.
The program reports 1377 GFLOPS on my Quadro K2200 GPU at the maximum core clock of 1124 GHz, which is about 96% of the theoretical maximum of 1440 GFLOPS (various online sources list this GPU at 1339 GFLOPS, not sure why I computed a different number; maybe the maximum core clock is different on my model?)
Huh, you learn something new every day. Google found this post, which gives references to section 6.10.9 of the c99 standard, or 16.9 of the c++0x standard which say that #pragma is not macro expanded, but the equivalent _Pragma(" argument ") expression expands into such a #pragma. I had never even heard of the second form.
But if I try that in nvcc, using the ## stringify macro expansion and implicit string concatination like:
_Pragma("unroll " ##ITER)
nvcc complains that _Pragma needs a string literal, so it looks as if the macro stringification was not evaluated before _Pragma.
Yes, this is a side topic to the FMA throughput question so I’ll leave off discussion, but I wonder what the right behavior should be (is nvcc in error, or more likely is my understanding wrong). And then of course what is the proper way to use a #pragma unroll with a #defined unroll count?
What is the core frequency while the test is running? Since this is a short-running kernel with only few repetitions, the full boost clock may never kick in (I had to force it even on my Quadro). The Titan XP is listed as providing 10157 GFLOPS at nominal core clock, 10974 GFLOPS at maximum (NVIDIA-defined) boost clock. At least we are in the ballpark of that, as you point out.
So undoubtedly there is still perf left on the table. I wasn’t really trying to go after max perf regardless of the time invested to do it. I was trying to suggest the plausibility of a particular claim.