GPU/CPU precision comparison and Kernel instructions question

Hello everybody!

I am trying to compare CPU to GPU speedup and precision of computation using a simple code.
I solve the heat equation with finite difference and explicit schemes in 2D. I do 1000 time iterations and compare the sum of the 2D array (I’m interested by the precision).
For the speedup, I just capture the time of execution of the program (linux ‘time’).

You’ll find below the cpp and cu files. The instruction of compilation are commented in the first line.
I use CUDA 8.0 for the .cu files and IntelC++Compiler v17 for the .cpp file.
I am on MacOS with an Intel i7 3.3Ghz and a Nvidia GT650m.

main.cpp

// /opt/intel/bin/icpc -O2 -fp-model precise main.cpp && time ./a.out
#include <iostream>
#include <cmath>
#include <iomanip>

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      Told[ind] = T[ind];
    }
  }

  double time = std::clock();

  for (int step=0; step<1000; step++) {

    for (int i = 2; i < nxt-2; i++) {
      for (int j = 2; j < nyt-2; j++) {
        int ind = i*nyt + j;
        T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)])
        + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]);
      }
    }

    for (int i = 0; i < nxt*nyt; i++) {
        Told[i] = T[i];
    }

  }

  //std::cout << 1000.0*(std::clock()-time)/(double) CLOCKS_PER_SEC << " ms" << std::endl;

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T[i];
  }

  std::cout << "Sum: " << std::setprecision(10) << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

main.cu

// /usr/local/cuda/bin/nvcc --gpu-architecture=sm_30 -O2 main.cu && time ./a.out
#include <cuda.h>
#include <iostream>
#include <cmath>
#include <iomanip>

__global__ void solve(float* T, float dtdx, float dtdy, int N)
{
  int indX = blockIdx.x * (blockDim.x-4) + threadIdx.x;
  int indY = blockIdx.y * (blockDim.y-4) + threadIdx.y;
  __shared__ float temp[32*32*6];

  for (int step=0; step<1000; step++) {

  temp[threadIdx.x*32+threadIdx.y] = T[indX * N + indY];
  __syncthreads();

  if(threadIdx.x > 1 && threadIdx.x < (blockDim.x-2) && threadIdx.y > 1 && threadIdx.y < (blockDim.y-2))
    T[indX * N + indY] = temp[(threadIdx.x)*32+(threadIdx.y)] + dtdx * (-temp[(threadIdx.x-2)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x-1)*32+(threadIdx.y)]-30.f*temp[(threadIdx.x)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x+1)*32+(threadIdx.y)]-temp[(threadIdx.x+2)*32+(threadIdx.y)])
    + dtdy * (-temp[(threadIdx.x)*32+(threadIdx.y-2)]+16.f*temp[(threadIdx.x)*32+(threadIdx.y-1)]-30.f*temp[(threadIdx.x)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x)*32+(threadIdx.y+1)]-temp[(threadIdx.x)*32+(threadIdx.y+2)]);
  }
}

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  //float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      //Told[ind] = T[ind];
    }
  }

  float* Tk;
  cudaMalloc((void**)&Tk,nxt*nyt*sizeof(float));
  cudaMemcpy(Tk,T,nxt*nyt*sizeof(float),cudaMemcpyHostToDevice);
  dim3 block(100,100,1);
  dim3 thread(32,32,1);

  double time = std::clock();

  //for (int step=0; step<1000; step++) {

  solve<<<block, thread>>>(Tk, dtdx, dtdy, nyt);

  //}
  //std::cout << 1000.0*(std::clock()-time)/(double) CLOCKS_PER_SEC << " ms" << std::endl;

  //time = std::clock();
  cudaMemcpy(T,Tk,nxt*nyt*sizeof(float),cudaMemcpyDeviceToHost);
  cudaFree(Tk);
  //std::cout << "SendMem " << 1000.0*(std::clock()-time)/(double) CLOCKS_PER_SEC << " ms" << std::endl;

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T[i];
  }

  std::cout << "Sum: " << std::setprecision(10) << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

1st question about precision:

The sequential code (main.cpp) gives me a value that we can fairly take as our ‘true’ value (in my case: 1.087123036).
If I execute the CUDA code given above (main.cu), I end up with a small difference of value (1.087129474). Thus, the precision of the computation is not conserved between C++ and C++CUDA…

Does somebody know a bit about precision (float/double) in CUDA?
Is there any optimisation from the compiler regarding floating-point calculations?

2nd question about limit of instructions inside the Kernel:

If I execute the CUDA code on my machine (main.cu), my OS froze for the time of computation and sometimes the system crashes…
If you look at my code, I have put the time loop inside the kernel… I know there is a limit of instructions that you can put in a CUDA kernel but how can I check the number of instructions in my kernel?

If I put the time loop outside of my kernel (as in the main_2.cu given below), the execution works fine but the time of computation is really long… even longer (way longer) than sequential computation (cpp file).
But it seems from all the tutorial I’ve seen so far that the time loop is always outside of the kernel… Thus, the kernel is called every time step while in my first code (main.cu), the code calls one kernel and everything is done inside that kernel.

Using the main_2.cu, the final value is correct (1.087123036), so where is the issue with main.cu?
Could this issue (crash) be due to the limit of instructions inside the kernel?

If the correct solution is main_2.cu, why CUDA is longer than classical sequential computation?
(In my case, main.cpp take 13s, main.cu: 10s (but only 0.1s inside the kernel, the rest is for memory transfer) and main_2.cu: 44s)

main_2.cu

// /usr/local/cuda/bin/nvcc --gpu-architecture=sm_30 -O2 main_2.cu && time ./a.out
#include <cuda.h>
#include <iostream>
#include <cmath>
#include <iomanip>

__global__ void solve(float* T, float dtdx, float dtdy, int N)
{
  int indX = blockIdx.x * (blockDim.x-4) + threadIdx.x;
  int indY = blockIdx.y * (blockDim.y-4) + threadIdx.y;
  __shared__ float temp[32*32*6];

  temp[threadIdx.x*32+threadIdx.y] = T[indX * N + indY];
  __syncthreads();

  if(threadIdx.x > 1 && threadIdx.x < (blockDim.x-2) && threadIdx.y > 1 && threadIdx.y < (blockDim.y-2))
    T[indX * N + indY] = temp[(threadIdx.x)*32+(threadIdx.y)] + dtdx * (-temp[(threadIdx.x-2)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x-1)*32+(threadIdx.y)]-30.f*temp[(threadIdx.x)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x+1)*32+(threadIdx.y)]-temp[(threadIdx.x+2)*32+(threadIdx.y)])
    + dtdy * (-temp[(threadIdx.x)*32+(threadIdx.y-2)]+16.f*temp[(threadIdx.x)*32+(threadIdx.y-1)]-30.f*temp[(threadIdx.x)*32+(threadIdx.y)]+16.f*temp[(threadIdx.x)*32+(threadIdx.y+1)]-temp[(threadIdx.x)*32+(threadIdx.y+2)]);
}

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  //float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      //Told[ind] = T[ind];
    }
  }

  float* Tk;
  cudaMalloc((void**)&Tk,nxt*nyt*sizeof(float));
  cudaMemcpy(Tk,T,nxt*nyt*sizeof(float),cudaMemcpyHostToDevice);
  dim3 block(100,100,1);
  dim3 thread(32,32,1);

  double time = std::clock();

  for (int step=0; step<1000; step++) {

    solve<<<block, thread>>>(Tk, dtdx, dtdy, nyt);

  }
  //std::cout << 1000.0*(std::clock()-time)/(double) CLOCKS_PER_SEC << " ms" << std::endl;

  //time = std::clock();
  cudaMemcpy(T,Tk,nxt*nyt*sizeof(float),cudaMemcpyDeviceToHost);
  cudaFree(Tk);
  //std::cout << "SendMem " << 1000.0*(std::clock()-time)/(double) CLOCKS_PER_SEC << " ms" << std::endl;

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T[i];
  }

  std::cout << "Sum: " << std::setprecision(10) << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

Thank you all!

I normally consider about 5-6 decimal mantissa digits to be at about the limit of a 32-bit float quantity. Usually people don’t look for exactly identical results in these situations, and NVIDIA has published a whitepaper explaining why not:

http://docs.nvidia.com/cuda/floating-point/index.html

interestingly, I see reports like this with some regularity, and in a number of cases it has been found that the GPU result is actually closer to the numerically perfect computation. In this case however, I would be suspicious about your use of syncthreads. I think your calculation is essentially a jacobi iteration method, and this depends on the entire iteration for timestep x to be complete, before the iteration for step x+1 begins. __syncthreads() is a block-wide barrier, not a grid-wide barrier, so certain blocks can race ahead of other blocks. Your 2nd implementation (where each kernel launch comprises a single timestep calculation) guarantees a grid wide barrier (the kernel launch enforces that) so that may be the explanation of why the 2nd cuda implementation matches your “ground truth”. I would basically say your first implementation is defective. Depending on the range of relaxation from beginning timestep to ending timestep, your first method (I think) could deviate widely from the expected result, in some situations, even though it does not appear to do so here.

On your machine, the GPU is running the CUDA kernel and also hosting the display (i.e. the UI). The GPU cannot do both simultaneously. So when the CUDA kernel is running, the display is “frozen”. When the UI responsiveness freezes for a long period of time (say, 30 seconds), it’s not uncommon for unix-based OS’s to crash in a variety of ways.

Your different implementations result in a kernel that takes longer or shorter to run, and so the display responsiveness is frozen for longer or shorter periods. The best solution is to get a GPU that is not hosting a display, for CUDA usage.

I wouldn’t spend a lot of time worrying about kernel instruction limits. That is not the issue here.

That is not the case. By applying numerical analysis or re-doing the computation at higher precision we could gain a better idea of the “true” value, and from there characterize the numerical error of either computation (serial CPU code or parallel GPU code). Until we do so, any assumption about which code delivers more accurate results is purely speculative.

I grant you that for this particular code, the likelihood that the CPU version is somewhat more accurate is reasonably high. I am basing this assessment on the relative heavy use of standard math functions. Some CPU-based libraries compute single-precision math function via wrappers around double-precision math functions, making the single-precision ones extremely accurate. On the other hand, if you compile the CPU code with a highly optimizing compiler it may use more approximate single-precision math functions to boost performance.

It would be interesting to compile the CPU code with different compilers (eg. gcc, Intel icc, Microsoft Visual C++) and different optimization settings, then compare the results. You are likely to find some interesting numerical discrepancies. For example, here are results from the Intel compiler version 13 on 64-bit Windows, one time with default optimization, then with maximum optimization, then maximum optimizations with proscription of unsafe floating-point transformations:

icl testcase.cpp
Sum: 1.095370889

icl /O3 /QxHOST testcase.cpp
Sum: 1.096786499

icl /O3 /QxHOST /fp:strict testcase.cpp
Sum: 1.087123036

You could probably get an even wider range of results by either allowing or disallowing the use of FMAs in the generated code.

by the way your main2.cu takes about 8 seconds to run on a Tesla K20X according to my testing. So 44s run time is quite long and may be evidence of relative slowness of your GT650m GPU, or possibly other factors, such as display interaction with CUDA processing.

When I ran it on a Tesla P100 it took 2.7s (lol) and about the same on GTX Titan X Pascal

Of course this still engenders “fairness” questions since your CPU implementation is far from optimal also. But the intel compiler may be doing a reasonably good job with things, and the usual multi-threaded approach might only yield ~2x speedup, depending on the CPU mem bandwidth of that mac.

Thank you all. You answered all my question!

So, Is there a way of synchronizing the block ?
Like __syncthread() for the threads but for the block ?

If not, the best idea is to use the time loop outside of the Kernel, right?

Thanks again!

currently (CUDA 8) the only global sync mechanism provided by CUDA is the kernel launch itself