Benchmark numbers wrong for half math on 1060

Hi,
I ran benchmark below to check fp16 throughput on GTX 1060.
Here is the code:

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <cuda.h>
#include "cuda_runtime.h"
#include "cuda_fp16.h"

#define USECPSEC 1000000ULL
#define ITER 1024*1024*1024
#define WI 512
#define WG 6*10
#define SIZE WI<<1

unsigned long long dtime_usec(unsigned long long start){
  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void DoHAdd(const half *in1d, half* outd) {
    int tx = threadIdx.x;
    half in1 = in1d[tx];
    half out = outd[tx];

    for (int i = 0; i < ITER; i++) {
      out = __hadd(in1, out);
    }
    outd[tx] = out;
}


int main() {
    half *in1d, *outd;
//    cudaSetDevice(2);
    int devID = 1;
    cudaDeviceProp devProp;
    cudaGetDeviceProperties(&devProp, devID);
    std::cout<<devProp.name<<std::endl;
    cudaMalloc((void**)&in1d, SIZE);
    cudaMalloc((void**)&outd, SIZE);
    unsigned long long dt = dtime_usec(0);
    DoHAdd <<< dim3(WG), dim3(WI) >>> (in1d, outd);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    unsigned long long ops = ITER;
    ops *= WG;
    ops *= WI;
    float et = dt/(float)USECPSEC;
    unsigned long long Mops = ops/1000000;
    std::cout<<et<<"s for "<< Mops << " HAdds"<<std::endl;
    float tp = (Mops)/(et*1000000);
    std::cout << "Throughput: " << tp << " Tops/s" << std::endl;
}

Compilation commands:

nvcc -gencode arch=compute_61,code=sm_61 hadd.cu

So what’s the output of the above program? How is it wrong?

I am getting different throughput for change in ITER value. Ranging from 2TOPs to 2341524TOPs

Try adding proper status checks to all CUDA operations. Are any errors reported? I am not sure of the robustness of the timing framework, the computation looks a bit complicated with at least three different data types involved: possible issues of granularity and overflow. The following simpler code should do the trick (accurate to microsecond resolution):

double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}

double start, stop, elapsed;
cudaDeviceSynchronize()
start = second();
[ .... code under test .... ]
cudaDeviceSynchronize()
stop = second();
elapsed = stop - start;

Make sure you don’t get integer overflow when computing the number of operations. It is not clear whether the above is showing your maximum settings, but 102410241024 is already 2**30, so if you make those constants a bit larger there will be integer overflow in that product.

None of those throughput values sound correct to me. The Pascal family members all deliver some amount of FP32 throughput in the range of 1-12 TFlops, considering FMA. cc6.1 devices have a ratio of FP32 to FP16 throughput of 64:1, and if we further account for the 2:1 ratio when considering FMA vs. ordinary add as you are doing, we get a 128x reduction for FP16 adds. This means a Titan X Pascal would deliver a max of about 0.1 TFlops FP16 add (but see note below about half2 vector type).

First of all, with an ITER value of 102410241024 this code takes a long time for me to run even on a Titan X Pascal. If you are running this on windows, I would be concerned about windows WDDM TDR giving you variable, spurious numbers.

Second, I suspect the code is not running on the device you think it is. At the beginning of your main routine, you set devID to 1 and then query for properties of that device, and print out the device name. But if you never call cudaSetDevice, the subsequent code will all run on device 0.

When I run your code with ITER reduced to 1024*1024, I get a throughput of about 0.03TFlops on my Titan X Pascal. It’s not the 0.1 I previously claimed, but its in the ballpark and not variable. I believe there is a remaining factor of 2, because in order to get full throughput on FP16 operations, its necessary to do those operations on half2 vector type:

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC

I haven’t bothered with clock boosting my Titan X Pascal or considered other optimizations to your code, so 0.03TFlops vs. 0.05TFlops max theoretical is close enough for me - ie. as much effort as I would want to put into this.

@njuffa I think OP’s timing (which I think is mine) is OK. It computes the integer number of microseconds using all integer-type arithmetic, AFAICT.

It’s not obvious to me that it has “3 different types”.

All presented types are unsigned long long with the exception of the types inherent in struct timeval. According to this reference:

http://stackoverflow.com/questions/1469495/unix-programming-struct-timeval-how-to-print-it-c-programming

those are each underlying types of long int, which on a 64-bit platform is a 64 bit type, just as ULL is. At most I see 2 types (arguably, yours contains 2 types also). Even on a 32-bit platform, it may be a 64-bit type, and even if it is not, I’m pretty sure multiplication of a 32-bit type with a 64-bit type produces a 64-bit type, and addition of a 32-bit type with a 64-bit type produces a 64-bit type. The implicit casting of signed to unsigned is of no consequence, since the underlying system can only produce nonnegative integers.

With the exception of 64-bit overflow, I see no hazards. Since the values are all scaled to integer microseconds since the epoch, a 64 bit signed overflow could not occur within 100,000 years after the epoch.

So instead of the year 2038 problem, we have the “year 101,970 problem”.

I explicitly crafted it that way to avoid floating point arithmetic, which has it’s own hoops to jump through, although probably fewer, and also so that I could be assured of the precision of the result. Since your method relegates microseconds to the fractional component, I wasn’t sure of the quantization effect of power-of-2 representation on a fractional second, whereas mine uses all integer representations and presents calculations using the same granularity as presented by the underlying time system. I assume we are talking about only +/- 0.5ULP here, but I didn’t want to think about it or convince myself that there were no scaling or accumulation effects.

OK, two different types :-) I mistakenly thought ‘dt’ is ‘double’. I am in general terrible parsing other people’s code, as I think I have mentioned before. So if people post code for review, I just try to pick on things that look suspect to me on first glance (here: possible integer overflow, ‘float’ granularity, CUDA API or launch errors), up to them to follow up on these pointers or prove their code correct. In the case at hand, clearly the “measured” performance numbers are impossibly high, so something seems badly out of kilter.

Converting Unix-y ‘timeval’ structs into ‘double’ seconds provides sufficient granularity[ *] to provide microsecond resolution until the year 2038 rolls around (pun intended), and measurement noise generally makes it difficult to measure execution times any more accurate than microseconds in the first place. I have used that code for twenty years, and will happily continue to use it for the next twenty years.

Unix’s maximum count of seconds is 231-1, leaving 21 bits of ‘double’ for sub-second resolution, for worst case granularity of 2(-21) seconds = 4.8e-7 seconds, sufficient for accurate representation down to to microsecond level in January of 2038.

@txbob,

I tried to run on both Titan X (Pascal and Maxwell), I am able to reproduce results for Pascal.

Output:

TITAN X (Pascal)
0.957497s for 32212 HAdds
Throughput: 0.0336419 Tops/s

How should I compile the source on Maxwell? What CC and code changes should I make?

Maxwell cc 5.0 and 5.2 devices do not support FP16 operations.

Great! Thanks!