Hi all. I am wondering if something is wrong (or misconfigured) with the GPU on this Jetson AGX. I am starting from the very simplest example on the NVidia “getting started with CUDA” page:
This demonstrates a program to add one million pairs of floats on the CPU, then on the GPU. In this demo, you are expected to see a huge performance increase on GPU vs CPU.
I am following the instructions on the above page verbatim. What I see is, the CUDA version runs 4x slower on this Jetson AGX than the CPU version on the exact same Jetson Xavier AGX:
timmyd@eris:~/cuda-intro$ g++ add.cpp -o add
timmyd@eris:~/cuda-intro$ nvcc add.cu -o add_cuda
timmyd@eris:~/cuda-intro$ time ./add
Max error: 0
real 0m0.032s
user 0m0.024s
sys 0m0.004s
timmyd@eris:~/cuda-intro$ time ./add_cuda
Max error: 0
real 0m0.139s
user 0m0.036s
sys 0m0.068s
Something is clearly broken. Is there a hardware problem with the GPU? Is it somehow being throttled? Is the OS configured to disable it? How would I begin to find out?
I am doing all kinds of google searches for “GPU slower than CPU”, etc., but not finding anything useful.
After a little more investigation, here are some conclusions:
Most of the program’s time is spent allocating the cuda memory arrays and initializing. The inner add() loop is actually 10x faster in CUDA than the single-threaded CPU version. However, the cudaMalloc() is half the speed of good old “new”, and that’s what dominates.
If I change the test programs to use vectors of 1 billion floats instead of 1 million, and change the add() function to do something more computationally intense, say…
then the results are a dramatic win in favor of CUDA:
timmyd@eris:~/cuda-intro$ time ./add
add() time = 18.845328 sec
real 0m20.940s
user 0m20.292s
sys 0m0.484s
timmyd@eris:~/cuda-intro$ time ./add_cuda
add() time = 1.996374 sec
real 0m9.716s
user 0m4.568s
sys 0m3.048s
Note, I removed the silly check for Max error at the end, and replaced it with timing calls around the add():
struct timespec ts0 = { 0 }, ts1 = { 0 };
clock_gettime ( CLOCK_MONOTONIC, &ts0 );
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
clock_gettime ( CLOCK_MONOTONIC, &ts1 );
double t = ts1.tv_sec - ts0.tv_sec + 1.0e-9 * ( ts1.tv_nsec - ts0.tv_nsec );
printf ( "add() time = %.6f sec\n", t );
I’m getting a much better feel for what are, and are not, the bottlenecks here. At any rate, I’m at least convinced that my Jetson isn’t broken :-)