Cuda -- a Toy story problems

I have a numerical modelling program written in Delphi some time ago. It performs a multidimensional minimization where at each point the function to be minimised is calculated as a sum of several tens of convolutions. So all in all the program does some rather large number of calculations to get to the result. It can take about 4 hours to complete on a Core2Duo 2.13GHz PC.

I have made an experiment to port the program to C++ and then use CUDA. Porting a numeric code from Delphi to C is fairly easy. And it is easy to integrate that C code with the CUDA programming envoronment using Visual Studio 2008.

Now the bad news:
a) the CUDA code runs SLOW SLOW SLOW !!! - one loop of calculation takes about 105 seconds. Perhaps this is due to the fact that the data for the code has to reside in the Global memory as there are about 80 dimensions, each of which requires calculation of convolution plus calculation of the functions that are being convolved. I dunno.

B) if one turns on the EmuRelease option so in effect compiles the code to run only on CPU one gets 40 seconds for a single loop. 2.5x faster than using CUDA. The clear signal here is such that the CPU code runs MUCH MUCH faster than the GPU code. Even taking into account that I have 2 processors in the CPU which work in parallel the calculation on a single CPU would be faster then on the GPU.

c) the most sad conclussion is such that my Delphi code calculates one loop in 14 secs !!! So the single-threaded, classical, Delphi code is WAY faster than any of the fancy C++ or CUDA gadgets. It would take a good C++ compiler (like Intel C++) a a real fast CUDA enabled device to have a performance comparable to Delphi. (I was doing experiments with GeeForce 8400 GS, but frankly I do not expect to see any significant difference if I changed to the new GTX … cards).

d) the CUDA card is VERY limited as far as the global memory is concerned. I can fit in my 256MB card model that uses 80 dimensions, with 1024 FFT points for each. I require 80 dimensions with 8192 points to have a numerically correct results. So the CUDA device fails me on also on this. I dunno what the graphic card is doing with the memory, there is no way of checking the memory usage, but 265MB memory is much too small to hold global vectors containg all in all 8192x80x7*sizeof(float) data. Too sad.

Conclussions:
a) if you have to do numeric calculation use Delphi
B) Delphi is light-spped faster than any Visual Studio C or CUDA
c) graphic card has severe memory limit, unsuitable for larger modelling calculations.

Rather sad conclussion as I hoped to speed up my modelling software significantly, but this is the way it is.

External Image

Your 8400 GS has an effective peak speed of 28.8 GFLOP/s, while the Core 2 Duo reaches 34.1 GFLOP/s. So even with an ideal CUDA program you would not be able to reach the speed of your CPU.

To beat your CPU you’d need to upgrade to a decent GPU, and likely invest a lot into optimizing your CUDA code. The convolutions are probably bandwidth-limited, make sure they operate in shared memory as much as possible.

If the computation is dominated by the convolutions, using Fourier methods to speed them up might be an easier option. (EDIT: just realized you seem to be doing this already)

I’m not sure I understand your memory needs. 8192807sizeof(float) would be a mere 17.5 MBytes. If on the other hand you really mean dimensions, 8192^807*sizeof(float) would not fit into anything, neither a GPU or any computer at all.

Your 8400 GS has an effective peak speed of 28.8 GFLOP/s, while the Core 2 Duo reaches 34.1 GFLOP/s. So even with an ideal CUDA program you would not be able to reach the speed of your CPU.

To beat your CPU you’d need to upgrade to a decent GPU, and likely invest a lot into optimizing your CUDA code. The convolutions are probably bandwidth-limited, make sure they operate in shared memory as much as possible.

If the computation is dominated by the convolutions, using Fourier methods to speed them up might be an easier option. (EDIT: just realized you seem to be doing this already)

I’m not sure I understand your memory needs. 8192807sizeof(float) would be a mere 17.5 MBytes. If on the other hand you really mean dimensions, 8192^807*sizeof(float) would not fit into anything, neither a GPU or any computer at all.

It’s hard to tell what went wrong from your description, but I think you have overgeneralized from your experience. My conclusion is that you get good results from the tool you understand (Delphi) and don’t get good results with unfamiliar tools. (I’m not trying to be insulting here, but it sounds like you are are far more experienced with Delphi than C or CUDA.)

As was pointed out, your card is at the very bottom end of cards supporting CUDA. Most people do CUDA work with cards that are at least 30x faster than the one you used, have 4x more memory, and have 15x the on-board memory bandwidth. Comparatively, you would be lucky if your 8400 GS came even close to the speed of your CPU.

However, if your CUDA code is much, much slower than the CPU, then I would guess one of the following things is true:

  • You have structured your CUDA code in a way which makes it incredibly inefficient, but could be fixed with more study. (Very common when you are starting out.)

  • You are limited by the PCI-Express bandwidth, in which case any speed benefit from the GPU is swamped by the communication overhead.

  • There is no way to formulate your calculation such that a GPU can efficiently execute it. In this case, CUDA is no benefit.

It’s hard to tell what went wrong from your description, but I think you have overgeneralized from your experience. My conclusion is that you get good results from the tool you understand (Delphi) and don’t get good results with unfamiliar tools. (I’m not trying to be insulting here, but it sounds like you are are far more experienced with Delphi than C or CUDA.)

As was pointed out, your card is at the very bottom end of cards supporting CUDA. Most people do CUDA work with cards that are at least 30x faster than the one you used, have 4x more memory, and have 15x the on-board memory bandwidth. Comparatively, you would be lucky if your 8400 GS came even close to the speed of your CPU.

However, if your CUDA code is much, much slower than the CPU, then I would guess one of the following things is true:

  • You have structured your CUDA code in a way which makes it incredibly inefficient, but could be fixed with more study. (Very common when you are starting out.)

  • You are limited by the PCI-Express bandwidth, in which case any speed benefit from the GPU is swamped by the communication overhead.

  • There is no way to formulate your calculation such that a GPU can efficiently execute it. In this case, CUDA is no benefit.

Oh, and one other thing you should be aware of: Emulation mode is only intended as a CUDA debugging tool. It is in fact very inefficient to run CUDA code on the CPU with emulation mode. If you find that EmuRelease is faster than running your code on the GPU, then you have probably structured the kernel very poorly. (Although I’ve never tried this with a GPU as slow as the 8400 GS. It’s possible a Core 2 Duo could outrun the 8400 GS, even in emulation mode…)

Oh, and one other thing you should be aware of: Emulation mode is only intended as a CUDA debugging tool. It is in fact very inefficient to run CUDA code on the CPU with emulation mode. If you find that EmuRelease is faster than running your code on the GPU, then you have probably structured the kernel very poorly. (Although I’ve never tried this with a GPU as slow as the 8400 GS. It’s possible a Core 2 Duo could outrun the 8400 GS, even in emulation mode…)

As you pointed out, the 8400 GS is the very slowest CUDA GPU NVIDIA ever made. (there was a 8300GS, but that was not a retail release.)

It has 6.4 GB/sec bandwidth and 43 GFLOPS SP.

For comparison, the GTX480 has 177GB/sec bandwidth and 1345 GFLOPS.

As you pointed out, the 8400 GS is the very slowest CUDA GPU NVIDIA ever made. (there was a 8300GS, but that was not a retail release.)

It has 6.4 GB/sec bandwidth and 43 GFLOPS SP.

For comparison, the GTX480 has 177GB/sec bandwidth and 1345 GFLOPS.

So we all agree that the toy in this story is the graphics card he used. These low end cards are good for development but not for much else.

Some hardware recommendations follow (none of these suggestions are the most expensive alternatives of their kind because I do care about price/performance ratio a lot)

-Go for a nVidia GT 240 with 1GB of GDDR5 memory for inexpensive single precision performance
(alternatively GTS 240 and GTS 250 are OK too in some cases but their compute capability is just 1.1)

-Go for an nVidia GTX 260 for inexpensive double precision performance

-Go for an nVidia GTX 470 for high end performance

So we all agree that the toy in this story is the graphics card he used. These low end cards are good for development but not for much else.

Some hardware recommendations follow (none of these suggestions are the most expensive alternatives of their kind because I do care about price/performance ratio a lot)

-Go for a nVidia GT 240 with 1GB of GDDR5 memory for inexpensive single precision performance
(alternatively GTS 240 and GTS 250 are OK too in some cases but their compute capability is just 1.1)

-Go for an nVidia GTX 260 for inexpensive double precision performance

-Go for an nVidia GTX 470 for high end performance

Icc is much faster than Delphi on any code. Up to 10-20 times. It is 1st April joke and nothing more.

Icc is much faster than Delphi on any code. Up to 10-20 times. It is 1st April joke and nothing more.

That would explain a lot.

That would explain a lot.