CUDA is slower than expected. Is something missing?

Hi there! I’m new in GPU parallel programming, and I been studying CUDA a few weeks. To testing the language and compare to sequential programming in CPU, I did two codes: one that calcules a sequence of B convolutions with size N (naive algorithm, without Fourier) in C++; and its equivalent in CUDA (with parallelism, T threads).

Firstly, here is the configuration of my notebook:

  • CPU: Intel Core i7-1355U, 1700 MHz
  • GPU: NVIDIA GeForce MX550 (Clock: 1065 MHz, 1024 Cores)

According to my calculations, if I test a code in one thread CUDA, the time will be ~1,6x slower than the same code in C++ in my notebook - but in my tests I saw bigger differences! Defining the variables like this:

B (convolutions): 256
N (size): 1024

The codes return this for me:

The CUDA code is ~4,299 x slower than C++ code (~2,7x slower than expected)!

When I will double the threads in CUDA code, the time ratio gets worse (in each step, the time is a little bit more of half the last time, probably because of the creation of more threads to run in GPU - I guess). The times are above:

  • Two threads in GPU (128 conv. per thread): 1,68445 sec
  • Four threads in GPU (64 conv. per thread): 1,06738 sec
  • Eight threads in GPU (32 conv. per thread): 0,88099 sec

I’m compiling the CUDA code in the CMD with these commands:

> nvcc kernel2.cu -o kernel2
> kernel2

My codes are available here, in my GitHub profile. I want to know if this is exactly what is expected or is there a way to make more use of the GPU capacity (my notebook arrived yesterday, so I imagine the GPU is intact). Is there something missing in the code? I’ll be grateful to anyone who can shed light on!

You need more parallelism than that to use a GPU well. For low-end GPUs like the GeForce MX550 you would want at least on the order of 1K threads, for high-end GPUs you would want at least on the order of 10K threads.

For serial or mildly-parallel code, using a CPU is likely to provide best performance. Note that your CPU can boost up to 5 GHz, and probably does so for a brief benchmark that completes in under a second. Its not clear how much memory is being used, but from the numbers shown the code probably runs completely within the CPU caches (= low latency, high bandwidth).

1 Like

Wow! I really didn’t test my CUDA code with many threads (at most I tested with about 300 threads). I’ll try run my code with more datas and more threads - this code is only to test the GPU, but my propose is a “manual” implementation of DNNs in CUDA, so I’ll have to use a lot of data lol. And I’ll test the C++ code in CPU too. I can come back soon with more conclusions!

(1) If you have a task that exposes massive parallelism, a reasonable expectation for CPU → GPU speedup at application level is 2x to 10x, with an average across use cases of around 5x.

(2) These estimates assume that (a) the comparison does not purposefully low-ball one side of the software, e.g. single-threaded non-SIMD code on the CPU side, and (b) does not purposefully low-ball the hardware on either side of the comparison.

In this case, I would argue that use of a GeForce MX550 comes close to low-balling the hardware on the GPU side. That GPU has barely 2x the memory bandwidth of the dual-channel DDR4-3200 system memory of the i7-1355U, and many use cases these days are limited by memory bandwidth rather than by computational throughput. Large FFTs in particular are known to be limited by memory throughput.

To better understand GPU performance, you would want to familiarize yourself with the CUDA profiler.

1 Like

Are you asking about a specific result? E.g. why you get 4,3x slower speed than 1,6x slower speed compared to the CPU?

Then you have to tell us, how you come to those numbers. Otherwise it would be hard to pinpoint the reason, why you observe different numbers.

Or why your code does not linearly scale with the number of threads?

Then the results of Nsight compute (=profiler, what njuffa recommended) would be helpful.

Cuda likes to have neighboring threads of a warp (group of 32 threads) to access memory locations, which are adjacent that is called ‘coalesced’. Optimally the accessed memory is aligned at a 32 or even better 128 bytes boundary. In your case the two or four threads access independent memory locations, which are far from each other. That slows down the memory transactions by a lot. So the performance of the CUDA kernel code would be improved, if (at least) 32 threads better cooperate for the task.

One can say, that each thread can do something different and the program would still give the correct output.

But for good performance, each 32 threads have to closely cooperate, by coordinating their memory accesses and by keeping the program flow non-diverged.