Simple CUDA example 4x slower on Xavier AGX GPU than CPU

Hi all. I am wondering if something is wrong (or misconfigured) with the GPU on this Jetson AGX. I am starting from the very simplest example on the NVidia “getting started with CUDA” page:

https://developer.nvidia.com/blog/even-easier-introduction-cuda/

This demonstrates a program to add one million pairs of floats on the CPU, then on the GPU. In this demo, you are expected to see a huge performance increase on GPU vs CPU.

I am following the instructions on the above page verbatim. What I see is, the CUDA version runs 4x slower on this Jetson AGX than the CPU version on the exact same Jetson Xavier AGX:

timmyd@eris:~/cuda-intro$ g++ add.cpp -o add
timmyd@eris:~/cuda-intro$ nvcc add.cu -o add_cuda
timmyd@eris:~/cuda-intro$ time ./add
Max error: 0

real 0m0.032s
user 0m0.024s
sys 0m0.004s
timmyd@eris:~/cuda-intro$ time ./add_cuda
Max error: 0

real 0m0.139s
user 0m0.036s
sys 0m0.068s

Something is clearly broken. Is there a hardware problem with the GPU? Is it somehow being throttled? Is the OS configured to disable it? How would I begin to find out?

I am doing all kinds of google searches for “GPU slower than CPU”, etc., but not finding anything useful.

What obvious dumb beginner mistake am I making?

After a little more investigation, here are some conclusions:

Most of the program’s time is spent allocating the cuda memory arrays and initializing. The inner add() loop is actually 10x faster in CUDA than the single-threaded CPU version. However, the cudaMalloc() is half the speed of good old “new”, and that’s what dominates.

If I change the test programs to use vectors of 1 billion floats instead of 1 million, and change the add() function to do something more computationally intense, say…

y[i] = exp ( sin ( 3.0f * x[i] + 4.0f * y[i] + 5.0 ) );

then the results are a dramatic win in favor of CUDA:

timmyd@eris:~/cuda-intro$ time ./add
add() time = 18.845328 sec
real	0m20.940s
user	0m20.292s
sys	0m0.484s

timmyd@eris:~/cuda-intro$ time ./add_cuda
add() time = 1.996374 sec
real	0m9.716s
user	0m4.568s
sys	0m3.048s

Note, I removed the silly check for Max error at the end, and replaced it with timing calls around the add():

  struct timespec ts0 = { 0 }, ts1 = { 0 };
  clock_gettime ( CLOCK_MONOTONIC, &ts0 );
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  clock_gettime ( CLOCK_MONOTONIC, &ts1 );
  double t = ts1.tv_sec - ts0.tv_sec + 1.0e-9 * ( ts1.tv_nsec - ts0.tv_nsec );
  printf ( "add() time = %.6f sec\n", t );

I’m getting a much better feel for what are, and are not, the bottlenecks here. At any rate, I’m at least convinced that my Jetson isn’t broken :-)

Hi,

Yes. Use time ./xxx will mix all the execution time together.
It will be good to separate the kernel and compare the performance between CPU and GPU.

More, the example is pretty simple. You should find a more obvious difference once the task complexity increases.

Thanks.