CUDA is slower than expected. Is something missing?

augustogermani · June 25, 2024, 3:55pm

Hi there! I’m new in GPU parallel programming, and I been studying CUDA a few weeks. To testing the language and compare to sequential programming in CPU, I did two codes: one that calcules a sequence of B convolutions with size N (naive algorithm, without Fourier) in C++; and its equivalent in CUDA (with parallelism, T threads).

Firstly, here is the configuration of my notebook:

CPU: Intel Core i7-1355U, 1700 MHz
GPU: NVIDIA GeForce MX550 (Clock: 1065 MHz, 1024 Cores)

According to my calculations, if I test a code in one thread CUDA, the time will be ~1,6x slower than the same code in C++ in my notebook - but in my tests I saw bigger differences! Defining the variables like this:

B (convolutions): 256
N (size): 1024

The codes return this for me:

The CUDA code is ~4,299 x slower than C++ code (~2,7x slower than expected)!

When I will double the threads in CUDA code, the time ratio gets worse (in each step, the time is a little bit more of half the last time, probably because of the creation of more threads to run in GPU - I guess). The times are above:

Two threads in GPU (128 conv. per thread): 1,68445 sec
Four threads in GPU (64 conv. per thread): 1,06738 sec
Eight threads in GPU (32 conv. per thread): 0,88099 sec

I’m compiling the CUDA code in the CMD with these commands:

> nvcc kernel2.cu -o kernel2
> kernel2

My codes are available here, in my GitHub profile. I want to know if this is exactly what is expected or is there a way to make more use of the GPU capacity (my notebook arrived yesterday, so I imagine the GPU is intact). Is there something missing in the code? I’ll be grateful to anyone who can shed light on!

njuffa · June 25, 2024, 6:44pm

You need more parallelism than that to use a GPU well. For low-end GPUs like the GeForce MX550 you would want at least on the order of 1K threads, for high-end GPUs you would want at least on the order of 10K threads.

For serial or mildly-parallel code, using a CPU is likely to provide best performance. Note that your CPU can boost up to 5 GHz, and probably does so for a brief benchmark that completes in under a second. Its not clear how much memory is being used, but from the numbers shown the code probably runs completely within the CPU caches (= low latency, high bandwidth).

augustogermani · June 25, 2024, 7:23pm

Wow! I really didn’t test my CUDA code with many threads (at most I tested with about 300 threads). I’ll try run my code with more datas and more threads - this code is only to test the GPU, but my propose is a “manual” implementation of DNNs in CUDA, so I’ll have to use a lot of data lol. And I’ll test the C++ code in CPU too. I can come back soon with more conclusions!

njuffa · June 25, 2024, 8:29pm

(1) If you have a task that exposes massive parallelism, a reasonable expectation for CPU → GPU speedup at application level is 2x to 10x, with an average across use cases of around 5x.

(2) These estimates assume that (a) the comparison does not purposefully low-ball one side of the software, e.g. single-threaded non-SIMD code on the CPU side, and (b) does not purposefully low-ball the hardware on either side of the comparison.

In this case, I would argue that use of a GeForce MX550 comes close to low-balling the hardware on the GPU side. That GPU has barely 2x the memory bandwidth of the dual-channel DDR4-3200 system memory of the i7-1355U, and many use cases these days are limited by memory bandwidth rather than by computational throughput. Large FFTs in particular are known to be limited by memory throughput.

To better understand GPU performance, you would want to familiarize yourself with the CUDA profiler.

Curefab · July 7, 2024, 9:37am

Are you asking about a specific result? E.g. why you get 4,3x slower speed than 1,6x slower speed compared to the CPU?

Then you have to tell us, how you come to those numbers. Otherwise it would be hard to pinpoint the reason, why you observe different numbers.

Or why your code does not linearly scale with the number of threads?

Then the results of Nsight compute (=profiler, what njuffa recommended) would be helpful.

Cuda likes to have neighboring threads of a warp (group of 32 threads) to access memory locations, which are adjacent that is called ‘coalesced’. Optimally the accessed memory is aligned at a 32 or even better 128 bytes boundary. In your case the two or four threads access independent memory locations, which are far from each other. That slows down the memory transactions by a lot. So the performance of the CUDA kernel code would be improved, if (at least) 32 threads better cooperate for the task.

One can say, that each thread can do something different and the program would still give the correct output.

But for good performance, each 32 threads have to closely cooperate, by coordinating their memory accesses and by keeping the program flow non-diverged.

Topic		Replies	Views
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1866	October 26, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
CUDA slower than CPU? CUDA Programming and Performance	7	842	August 18, 2023
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7892	August 16, 2007
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4327	March 7, 2011
Cuda code performance CUDA Programming and Performance	14	3153	December 16, 2014
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2310	August 12, 2009
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	501	November 19, 2018
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
Why is this slow CUDA Programming and Performance	7	3731	February 7, 2012

CUDA is slower than expected. Is something missing?

Related topics