performance gap between host allocated halfs and kernel casted halfs

Hey there,

I’m setting up a best practice guide for a simple kernel. There is one last thing I need to fix, viz. the kernel performance of the halfs.

My program can handle halfs in two different flavors:

First version: alloc floats on the host, transfer them to the device and convert the floats in the kernel to halfs. Obvious disadvantage: you have to send floats, so double the amount of bytes. However, this implementation works as expected, so the kernel is double so fast compared to the float edition

Second version: alloc halfs on the host, transfer them to the device and directly do the computations with them. The transfer time is perfectly half of the floats, but the kernel time is really bad compared to the typecasting version of the half case.

Why is it faster to convert them in the kernel, instead of sending them directly as halfs to the kernel?

Thanks in advance and kind regards
Max

I wouldn’t be able to say much without a complete test case. However, getting high performance with half datatypes in kernel code usually means that the type you want to focus on is half2, not half.

__global__ void add_sin_cos_half_host_pinned(__half* a, __half* b, __half* c, int N){
        int idx = blockIdx.x*blockDim.x+threadIdx.x;
        if(idx < N){
                for(int i=0; i<1000; ++i){
                c[idx] += _hadd(a[idx], b[idx]) + hsin(a[idx])*hsin(a[idx]) + hcos(b[idx])*hcos(b[idx]) + hexp(b[idx]);}
        }
}
__global__ void add_sin_cos_half_pinned(float* a, float* b, float* c, int N){
        int idx = blockIdx.x*blockDim.x+threadIdx.x;
        if(idx < N){
                __half half_a = __float2half(a[idx]);
                __half half_b = __float2half(b[idx]);
                __half half_c = 0;
                for(int i=0; i<1000; ++i){
                half_c += __hadd(half_a, half_b) + hsin(half_a)*hsin(half_a) + hcos(half_b)*hcos(half_b) + hexp(half_b);
                }
                c[idx] = __half2float(half_c);
        }
}

In general I would agree with you, but since the only difference in the program is the kernel and I only compare the kernel time, it’s a bit weird that the one with conversion in the kernel takes half of float time (so speedup of 2), while the other one reaches only a speedup of 1.2 or something.

For a complete test case, I would like to see an entire code (all host code and device code needed to build a complete application, without me having to add anything or change anything - copy, paste, compile, run) along with the platform (OS, GPU) you are running on, as well as the complete compile command.

If any of that is missing, I’m less likely to spend any time on it. For example, I wouldn’t want to waste time trying to analyze code, only to discover that OP is compiling a debug instead of release project, and trying to do performance analysis on debug code.

At first glance, your codes look different to me because in one case you are loading the a and b quantities exactly once (the fast case) and storing c exactly once, and in the other case you are (in source code, at least) loading the a and b quantities multiple times and potentially storing c multiple times. To state equivalence between the two presumes things about the compiler that I’m not sure are always true.

I also hate to try and analyze artificial code, like the loops of 1000. I don’t know what guesses the compiler will be doing under the hood. To analyze performance best, my suggestion would be to just work on large data sets, rather than artificially increasing the work by 1000. The compiler might discover things about your loop of 1000 where all the data is per-thread local data, that it cannot/does not discover for the case where some of the data is global data.

I have rewritten your code to avoid some of the issues I mentioned. Here is a complete test case, on tesla V100, CentOS 7, CUDA 10.1.243, which suggests to me there is no significant difference in kernel execution times between the two cases as I have written them:

$ cat t1641.cu
#include <iostream>
#include <cuda_fp16.h>

__device__ __half w(__half half_a, __half half_b){
          return  __hadd(half_a, half_b) + hsin(half_a)*hsin(half_a) + hcos(half_b)*hcos(half_b) + hexp(half_b);
}

__global__ void add_sin_cos_half_host_pinned(__half* a, __half* b, __half* c, int N){
        int idx = blockIdx.x*blockDim.x+threadIdx.x;
        if(idx < N){
                __half half_a = a[idx];
                __half half_b = b[idx];
                __half half_c = 0;
                for(int i=0; i<1000; ++i)
                  half_c += w(half_a, half_b);
                c[idx] = half_c;
        }
}
__global__ void add_sin_cos_half_pinned(float* a, float* b, float* c, int N){
        int idx = blockIdx.x*blockDim.x+threadIdx.x;
        if(idx < N){
                __half half_a = __float2half(a[idx]);
                __half half_b = __float2half(b[idx]);
                __half half_c = 0;
                for(int i=0; i<1000; ++i)
                  half_c += w(half_a, half_b);
                c[idx] = __half2float(half_c);
        }
}

const int ds = 32*1048576;

int main(){

  __half *d_ha, *d_hb, *d_hc;
  float *d_fa, *d_fb, *d_fc;
  const int hds = ds*sizeof(__half);
  const int fds = ds*sizeof(float);
  cudaMalloc(&d_ha, hds);
  cudaMalloc(&d_hb, hds);
  cudaMalloc(&d_hc, hds);
  cudaMalloc(&d_fa, fds);
  cudaMalloc(&d_fb, fds);
  cudaMalloc(&d_fc, fds);
  cudaMemset(d_ha, 0, hds);
  cudaMemset(d_hb, 0, hds);
  cudaMemset(d_fa, 0, fds);
  cudaMemset(d_fb, 0, fds);

  add_sin_cos_half_pinned<<<ds/1024, 1024>>>(d_fa, d_fb, d_fc, ds);
  add_sin_cos_half_host_pinned<<<ds/1024, 1024>>>(d_ha, d_hb, d_hc, ds);

  add_sin_cos_half_pinned<<<ds/1024, 1024>>>(d_fa, d_fb, d_fc, ds);
  add_sin_cos_half_host_pinned<<<ds/1024, 1024>>>(d_ha, d_hb, d_hc, ds);
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_70 -o t1641 t1641.cu
$ nvprof ./t1641
==12946== NVPROF is profiling process 12946, command: ./t1641
==12946== Profiling application: ./t1641
==12946== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.00%  192.42ms         2  96.209ms  96.086ms  96.331ms  add_sin_cos_half_pinned(float*, float*, float*, int)
                   49.99%  192.38ms         2  96.189ms  96.082ms  96.296ms  add_sin_cos_half_host_pinned(__half*, __half*, __half*, int)
                    0.00%  5.2800us         4  1.3200us  1.1840us  1.5680us  [CUDA memset]
      API calls:   54.80%  385.16ms         1  385.16ms  385.16ms  385.16ms  cudaDeviceSynchronize
                   43.64%  306.72ms         6  51.119ms  325.74us  304.83ms  cudaMalloc
                    0.77%  5.3857ms         4  1.3464ms  677.62us  3.3237ms  cuDeviceTotalMem
                    0.71%  4.9661ms       388  12.799us     341ns  525.10us  cuDeviceGetAttribute
                    0.06%  444.14us         4  111.03us  102.39us  126.06us  cuDeviceGetName
                    0.02%  119.44us         4  29.860us  12.701us  75.906us  cudaMemset
                    0.01%  62.672us         4  15.668us  10.173us  29.536us  cudaLaunchKernel
                    0.00%  26.885us         4  6.7210us  3.7720us  12.116us  cuDeviceGetPCIBusId
                    0.00%  8.1800us         8  1.0220us     480ns  1.5930us  cuDeviceGet
                    0.00%  6.3160us         3  2.1050us     336ns  4.0530us  cuDeviceGetCount
                    0.00%  2.4490us         4     612ns     530ns     713ns  cuDeviceGetUuid
$

Also, the codes as you have written them are really not identical. The half approach depends on the values in c. The float approach does not (c is purely overwritten).

Your posted code also has a typo in it. The first function uses _hadd which doesn’t exist. So this evidently isn’t the code you are actually running.

Dear Robert,

don’t get me wrong, in general I agree 100% with you, but it is only the kernel that differs in this particular case and you showed me right now with your answer that this was sufficient in order to help me. I didn’t take care of the fact that I load and store multiple times.

Again, I agree that requests like this suck in 99% of the cases. However, the program is such a artificial problem and the implementation almost trivial, except the point that there is a lot of stuff for measuring batches of runs, write that out to a file, taking care of command line arguments of students etc. So my point is, my code is a trivial problem and a trivial implementation with just a huge amount of user handling for students. That’s the only point why I didn’t post the whole thing.

The artificial loop is btw just to increase the work load, because I’m already scaling the data to memory limits.

Thanks Robert for helping me out on this one!

After a quick analysis of the SASS code, it seems evident to me that one of the issues is that loading from global memory is occurring on each loop iteration with the slow case, but not with the fast case.

Hey Robert,

I changed it in my code and I can confirm that the performances now match.
Next time I’m gonna extract a mini program, even for artificial problems like this, promised!

Thanks again!