CUDA slower than CPU?

l.lefevre · August 16, 2023, 2:39pm

Hello, I’m new to cuda, also new to C++, i’m having a weird issue with this sample code (from Nvidia Website : An Even Easier Introduction to CUDA | NVIDIA Technical Blog), the GPU code is 3x slower than the CPU one, here is the code

CPU CODE :

#include <iostream>
#include <math.h>

using namespace std;


// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}



int main(void)
{
  unsigned int N = 1<<25; // 1M elements

  std::cout << "Iterations : " << N << "\n";

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

GPU CODE :

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<25;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

GPU : RTX4800

compile commands :

GPU code :

 nvcc .\test2.cu -o test2 -arch=sm_90a

CPU code :

  nvcc .\test.cu -o test

CPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 892
Ticks             : 8920678
TotalDays         : 1,03248587962963E-05
TotalHours        : 0,000247796611111111
TotalMinutes      : 0,0148677966666667
TotalSeconds      : 0,8920678
TotalMilliseconds : 892,0678

GPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test2.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 2
Milliseconds      : 882
Ticks             : 28823809
TotalDays         : 3,33608900462963E-05
TotalHours        : 0,000800661361111111
TotalMinutes      : 0,0480396816666667
TotalSeconds      : 2,8823809
TotalMilliseconds : 2882,3809

Robert_Crovella · August 16, 2023, 3:47pm

right, that code (vector add) is too trivial to be faster on the GPU. It’s purpose is to teach general CUDA syntax and methodology. With a bit of searching on various forums, you can find examples of codes that are faster on the GPU than on the CPU.

l.lefevre · August 16, 2023, 5:04pm

Thank you Robert, but I insist, this sample code was taken from here (nvidia) and it explicitly compare performance between CPU and GPU (with greater performance for the later) so I suspect this code is missing something.
Furthermore, changing add<<<numBlocks, blockSize>>>(N, x, y);
to add<<<1,1>>>(N, x, y); doesnt affect response time (as it should).

striker159 · August 16, 2023, 5:34pm

I would suggest that you measure the kernel time and cpu function time directly in code. Otherwise, the full program execution time will include all the CUDA related overhead such as context creation, memory deallocation, etc.

Robert_Crovella · August 16, 2023, 6:50pm

where exactly is that performance comparison made in that blog?

l.lefevre · August 18, 2023, 11:55am

That’s a big speedup (463ms down to 2.7ms), but not surprising since I went from 1 thread to 256 threads. The K80 is faster than my little Macbook Pro GPU (at 3.2ms). Let’s keep going to get even more performance.

Along the tutorial, there is some modifications done in the program and each of those modifications come with a great improvement in performance.

From my side I see no improvements by doing the modifications given in the tutorial.

@striker159 : I increased N value to limit the overheads effects of initialization and to maximize the compute time.

Robert_Crovella · August 18, 2023, 12:16pm

That is comparing two different versions of GPU code. One that uses only 1 GPU CUDA thread, and one that uses many threads. That’s not a statement of CPU-only performance.

The 463 ms number corresponds to the first activity in the Profile It! section of the blog. That is profiling a GPU code (add_cuda). It’s not a statement of CPU performance. Likewise the 2.7ms number also corresponds to GPU code, later in the blog. You may wish to study the blog carefully.

I also acknowledge your statements about your performance expectations. As already indicated, for your case my expectation is that you will need to do kernel level profiling. It won’t be sufficient to do application time measurement.

FWIW I’m also unfamiliar with this GPU:

maybe you meant RTX 4080. In that case the correct architecture to use when compiling is -arch=sm_89, not -arch=sm_90a

You don’t appear to be doing any proper CUDA error checking in your code. It’s possible that the kernel is not launching at all (none of your outputs show any indication of max_error printout, for example.)

l.lefevre · August 18, 2023, 12:33pm

Ok you got me, you’re right, there is no CPU/GPU comparaison, I’ll re-run benchmark by changing the <<<x,y>>> values when I go back home, but I’m pretty sure change <<<1,1>>> to <<<1,256>>> did not change anything but let me clarify this later :)

Topic		Replies	Views
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	481	November 19, 2018
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13010	July 9, 2008
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	154	July 7, 2024
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8609	December 18, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6527	February 19, 2009
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7162	November 28, 2007
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1823	October 26, 2017
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	626	September 21, 2020
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2362	September 3, 2010
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7887	August 16, 2007

CUDA slower than CPU?

Related topics