Hardware for CUDA development

Hello,

My name is Roberto Dohnert. Now before I start I want to say this is NOT spam, nor is it advertising. i work with PC/OpenSystems LLC we create a wide range of hardware for different uses powered by our Linux distribution Black Lab Linux. We were wondering. What do you guys look for when you acquire hardware for developing with CUDA. What type of specs do you look for? Is there a certain price point etc?

Roberto J. Dohnert
PC/OpenSystems LLC

  1. Number of cuda cores, the higher the better.
  2. Number of cuda “super cores” smx or smp or whatever it’s called.
  3. Memory frequency
  4. Memory bandwidth
  5. Watt usage (tdp) (Could be expressed as cuda cores per watt, or watt per cuda core.).
  6. How many slots it takes. Preferably just one. Leaves room for venting/air flow.
  7. Passive cooling versus active cooling. No fan is less noise and dust issues.
  8. And lately and most importantly: 64 bit floating point operations. 32 bit floating point operations are basically worthless. So any card that does not support 64 bit floating point operations is basically worthless for generic processing.

Concerning 8… I don’t think nvidia has any customer graphics cards yet that support 64 bit floating point. There is always some catch/snag.

Only the workstation/high-end stuff seems to have 64 bit floating point… from hear-say ;)

Every NVIDIA GPU of compute capability 1.3 and later supports double precision floating point, so anything better than a GTX 260.

The difference is that the GeForce cards (except Titan) have their double precision throughput limited to somewhere between 1/4 and 1/16 the throughput of the Tesla cards. That’s a big restriction, but very different from “no support”.

As for an answer to the question posed:

When I was assembling CUDA workstations for our research group, I was most interested in systems that had a single GPU, but room to add newer cards as they came out. So, a decent sized case + a power supply that could run 2 or 3 high end cards. The motherboard should also have PCI-Express switches to support the extra x16 slots at full bandwidth rather than cutting all the slots to x8 when extra cards are installed. Beyond that, I considered an SSD essential, but that’s not really CUDA specific.

Most pre-built systems (Dell, etc) do not have room for additional cards, so we always built our own systems from parts.

In theory and according to specification perhaps however practice is a different matter.

Compiler issues, Development Environment issues.

Without proper software support a commodore C64 might even offer better 64 bit floating support :)

Do you have any idea what you are talking about? Double precision works fine on GeForce cards. I have used it for accumulators as part of a larger calculation. The throughput handicap means that you are not setting any double precision LINPACK speed records with your GeForce card, but if double precision is a small part of your calculation, you should use it.

64 bit floating points did not work for my kernel on GT 520.

Exact same kernel worked fine for 32 bit floating points.

I have also seen at least one website stating that certain versions of maxwell do not support 64 bit floating point, so beware.

You need to compile for at least compute capabilities 1.3.
GT520 supports double precision computations , as every card released in the past 4-5 years.

Please, do not give wrong informations.

Kernel was compiled with compute capability 2.0.

Works for 32 bit floating point, not for 64 bit floating point.

Perhaps array of 64 bit floating point to big to fit into gpu for some reason, though this seems weird with 1 GB of ram on GPU. It was just a picture of reasonable dimensions.

Other problem I can think of is maybe parameters passed to kernel wrong but I don’t think so.

Most likely issue with cuda compiler itself or the hardware.

Perhaps GT 520 has issues with compute 2.0 kernels.

Kernel was compiled with compute capability 2.0.

Works for 32 bit floating point, not for 64 bit floating point.

Perhaps array of 64 bit floating point to big to fit into gpu for some reason, though this seems weird with 1 GB of ram on GPU. It was just a picture of reasonable dimensions.

Other problem I can think of is maybe parameters passed to kernel wrong but I don’t think so.

Most likely issue with cuda compiler itself or the hardware.

Perhaps GT 520 has issues with compute 2.0 kernels.

Have you considered the possibility that there is a bug in the code? In my experience that is a much more likely scenario than a compiler bug.

Not knowing the code, standard recommendations apply: make sure status returns of all API calls and kernel launches are checked, use cuda-memcheck to find out-of-bounds memory access and race conditions. Make sure the host code works correctly by using valgrind or similar tool.

Why would there be a bug ? It works fine in 32-bit floating point. Makes no sense.

My kernel is very simple, while the compiler is super complex.

Let’s assume for a moment that compiler was buggy.

Two situations can now exist:

  1. The bug was solved, or it’s still in the compiler.

To re-create this problem try the following:

Create a large 1D array of particles with all kinds of properties/fields. (It represents an image’s pixels which can then all move individually)

Make most fields 32 bit floating points.

Test if the code works.

Then simply flip a type and make it 64 bit floatin point (double).

If that works we’ll talk again.

I can’t remember a single instance from the forums (other than yours) where there was even a remote doubt on the double precision arithmetic capabilities of any CUDA device.

I do use beta versions and I have seen plenty of people report bugs in the compiler.

Plus I found bugs in other compilers as well. No compiler is without it’s bugs.

Today I have some time.

So I will downloaded the latest cuda 6 release.

I’ll try to install it and then I will do a compile.

If it’s still not working I will upload the kernel to my webdrive.

And then you guys and girlies can try for yourself.

As far as I am concerned CUDA is total crap now.

My application uses the device driver api version of CUDA.

And the application/the cuda driver won’t even load the module.

It complains of some kind of floating point error.

I will upload the video so you can see the crap in action for yourself.

And I will make my app distributeable so you guys at nvidia can test and debug it for yourself.

Perhaps my processor is not supported any more by the device driver api… perhaps it’s using some new floating point operations inside intel processors.

And I will upload my app to my web folder in a moment… I’ll just change some folders and so forth.

This problem was also present in cuda 5.5.

Links will follow in a moment and then nvidia will look like shit… and so will I but I dont care about that last part. To bad that it came to this.

Video created. I am now looking into this problem further.

It seems the problem is in the just in time compiler inside the driver. My app uses the driver api which is ofcourse much better than runtime crap. Cause driver api allows multi threading and multi language.

I will now make one last video comparing cuda toolkit compiler versions.

To see if older does or does not work.

Ok I will spare you the secondary video. The 64 bit floating point version did work, but only with cuda toolkit 4.2 compiler and only for debug code, and even that run buggy… sometimes it wouldn’t ran at all which is new behaviour.

All other compiler versions and settings failed.

It’s completely obvious that cuda 5 and 6 has turned into MAJOR CRAP.

Get kernel ptx and source and app here to try for yourself:

http://www.skybuck.org/CUDA/Cuda5And6HasTurnedIntoCrap.rar

Made this into a seperate topic…

https://devtalk.nvidia.com/default/topic/734827/cuda-programming-and-performance/cuda-has-turned-into-crap/