CUDA slower than CPU?

Hello, I’m new to cuda, also new to C++, i’m having a weird issue with this sample code (from Nvidia Website : An Even Easier Introduction to CUDA | NVIDIA Technical Blog), the GPU code is 3x slower than the CPU one, here is the code

CPU CODE :

#include <iostream>
#include <math.h>

using namespace std;


// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}



int main(void)
{
  unsigned int N = 1<<25; // 1M elements

  std::cout << "Iterations : " << N << "\n";

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

GPU CODE :

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<25;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

GPU : RTX4800

compile commands :

GPU code :

 nvcc .\test2.cu -o test2 -arch=sm_90a

CPU code :

  nvcc .\test.cu -o test 

CPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 892
Ticks             : 8920678
TotalDays         : 1,03248587962963E-05
TotalHours        : 0,000247796611111111
TotalMinutes      : 0,0148677966666667
TotalSeconds      : 0,8920678
TotalMilliseconds : 892,0678

GPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test2.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 2
Milliseconds      : 882
Ticks             : 28823809
TotalDays         : 3,33608900462963E-05
TotalHours        : 0,000800661361111111
TotalMinutes      : 0,0480396816666667
TotalSeconds      : 2,8823809
TotalMilliseconds : 2882,3809

right, that code (vector add) is too trivial to be faster on the GPU. It’s purpose is to teach general CUDA syntax and methodology. With a bit of searching on various forums, you can find examples of codes that are faster on the GPU than on the CPU.

Thank you Robert, but I insist, this sample code was taken from here (nvidia) and it explicitly compare performance between CPU and GPU (with greater performance for the later) so I suspect this code is missing something.
Furthermore, changing add<<<numBlocks, blockSize>>>(N, x, y);
to add<<<1,1>>>(N, x, y); doesnt affect response time (as it should).

I would suggest that you measure the kernel time and cpu function time directly in code. Otherwise, the full program execution time will include all the CUDA related overhead such as context creation, memory deallocation, etc.

where exactly is that performance comparison made in that blog?

That’s a big speedup (463ms down to 2.7ms), but not surprising since I went from 1 thread to 256 threads. The K80 is faster than my little Macbook Pro GPU (at 3.2ms). Let’s keep going to get even more performance.

Along the tutorial, there is some modifications done in the program and each of those modifications come with a great improvement in performance.

From my side I see no improvements by doing the modifications given in the tutorial.

@striker159 : I increased N value to limit the overheads effects of initialization and to maximize the compute time.

That is comparing two different versions of GPU code. One that uses only 1 GPU CUDA thread, and one that uses many threads. That’s not a statement of CPU-only performance.

The 463 ms number corresponds to the first activity in the Profile It! section of the blog. That is profiling a GPU code (add_cuda). It’s not a statement of CPU performance. Likewise the 2.7ms number also corresponds to GPU code, later in the blog. You may wish to study the blog carefully.

I also acknowledge your statements about your performance expectations. As already indicated, for your case my expectation is that you will need to do kernel level profiling. It won’t be sufficient to do application time measurement.

FWIW I’m also unfamiliar with this GPU:

maybe you meant RTX 4080. In that case the correct architecture to use when compiling is -arch=sm_89, not -arch=sm_90a

You don’t appear to be doing any proper CUDA error checking in your code. It’s possible that the kernel is not launching at all (none of your outputs show any indication of max_error printout, for example.)

Ok you got me, you’re right, there is no CPU/GPU comparaison, I’ll re-run benchmark by changing the <<<x,y>>> values when I go back home, but I’m pretty sure change <<<1,1>>> to <<<1,256>>> did not change anything but let me clarify this later :)