Best (Quality/Price) GPU for scientific computing with OpenCL

Hello,

I need to buy a GPU that will be used for scientific computing with OpenCL.

Requirements:
Fast double precision performance
Memory: <1GB

Considering: GeForce GTX 590/Quadro 5000/Quadro 6000
CUDA Cores: 1024/352/448
Gigaflops (double precision): unlisted/359.04/515.2
Memory: 3GB(1.5GB per GPU)/2.5GB/6GB
Memory Bandwidth (GB/s): 327.7/120/144

One concern I have about the GTX 590 is that it houses two GPUs, which slows down performance if memory needs to be transferred between the GPUs (and also complicates porting the program to OpenCL).

Any advice or other suggestions would be appreciated.

Thanks,
Kristen

GTX 590 most likely will underperform the quadros in double precision, due to the fact that GTX cards are artificially held back on DP operations. This is so that people buy the more expensive Quadro and Tesla cards. If you are solely looking for OpenCL and high double precision throughput, AMD cards would suit better. Radeons are not held back in any artificial way.

Consider: Radeon HD 6990
Stream Cores: 768 (3072 Stream Processors)
Gigaflops (double precision): 1.27/1.37 (1st/2nd BIOS setting)
Memory: 4GB(2GB per GPU)
Memory Bandwidth (GB/s): 320

Second BIOS settings increase power use, if you have a proper power supply that can handle it.

To be honest, if it is not very urgent, I would wait until November-January, as that will be most likely when new generation cards will be released from both NV and AMD, both generations being made on the new 28nm fabrication process. It will be a similarily big leap as it was from 65nm to 40nm. Card strengths will multiply by about 1.5

I know it is bold to ‘advertise’ red cards here, but I believe it is really bad habit of NV to hold back GeForce throughput for DP. If that is most important for you, AMD cards definately hold better quality/price ratio.

Thank you for your input - I hadn’t considered AMD. The Radeon HD 6990 looks like an interesting option.

However, the Radeon HD 6990 also houses 2 GPUs, and I read in AMD’s June 2011 OpenCL Programming Guide that multiple GPU devices are currently not supported. If I understand correctly, this means that OpenCL could only use half of the Radeon HD 6990.

I’ve added two AMD GPUs to compare:

Considering: Radeon HD 6990/Radeon HD 6970/GeForce GTX 590/Quadro 5000/Quadro 6000
Stream Processors/CUDA Cores: 3072/1536/1024/352/448
Gigaflops (double precision): 1270/683/unlisted/359.04/515.2
Memory: 4GB(2GB per GPU)/2GB/3GB(1.5GB per GPU)/2.5GB/6GB
Memory Bandwidth (GB/s): 320 (160 per GPU)/176/327.7(~160 per GPU)/120/144

At this point I’m leaning towards the Radeon HD 6970 since, putting the Radeon HD 6990 aside (due to AMD’s lack of multiGPU support), it has the highest gigaflops (double precision) and memory bandwidth (per GPU).

The dual GPU support is unofficial, so to say. It means that if something doesn’t work with multi-GPU, there’s room to report it, but not to complain about it. However I have been using 3 5970s in one machine (for a total of 6 GPUs), and they work alright. I have not encountered any problems with them. Something however must not totally work, since the support is not official. AMD employees are not allowed to talk about bug intimacies, why they arise, all that can be told is that dual GPU cards utilize internal CrossFire connection, which is most useful for games, but we all know that both SLI and CrossFire have to be disabled to do GPGPU. There is some internal optimization that is hard to come across with software solutions. If you read AMD forums, MANY people have been complaining about dual-GPU support being unofficial, and it was planned to be resolved this March-April (SDK 2.4). However the solution was not ready, so very much likely the SDK coming this August-September (2.5) will inlclude the support.

AMD APP SDKs has been released on a 6 months basis (for the past 2 years now) and it is planned to keep this track. Drivers are released every 1 month, and all SDK features and bug fixes in drivers are based on user/programmer demands. If you would chose to buy 6990, I would advise reading a little through AMD forums.

Anyhow, the chioce is yours. Single GPU cards are sure to work (from both vendors of course), and dual-GPU cards on AMD side might take a little foruming to get to work. Driver fixes come frequently, SDK fixes (and new features) come every 6 months.

Thanks for all of the information. I think we’re going to go with the Radeon HD 6970 for now.

Thank you again for your help!
Kristen

Just a few things to note that were a bit mis-represented regarding AMD

  1. AMD cores do not equal NVIDIA cores. They have different clocks and performance, so just comparing core count will not give you the information regarding the difference in performance.

  2. AMD indeed do not throttle their double precision performance, but their DP performance is 1/5 of the single precision unless something changed recently. NVIDIA give you 1/8 performance on consumer cards and 1/2 performance on server cards. I do believe that this is a hardware difference though and not a driver issue like it is with the dual DMA engine (changing the firmware on the geforces to make them into quadros “fixes” the dual dma engine issue, but not the double precision performance). If you look at the numbers, the 6970 gives you only slightly higher DP performance at a slightly lower memory speed than the quadro 6000 (the quadro is much more expensive though admittedly)

  3. NVIDIA cards generally have a higher global memory throughput. As most scientific computations I know are bandwidth limited and not compute limited, this may be more of an issue than double precision speeds.

  4. AMD cards are harder to program in order to get maximum performance. The cores on the AMD card are VLIW vector cores (similar to how sse works), where the cores on NVIDIA are scalar.

  5. The caches on AMD are read only and they are much more problematic than NVIDIA with writing 8 bits per access. Can affect performance, depending on you problem.

  6. One final thing which doesn’t sound like it affects you, but may be an issue as well, AMD doesn’t really have any meaningful server grade offerings.

  1. 1 Stream Core is one 4-way VLIW on 6970 cards. For marketing reasons they usually propagate Stream Processors, which is naturally this number muliplied by 4 (in reality the true processing units). I am fully aware of that, and if you saw, I showed the Strem Cores accordingly, because that is clser to how people used to CUDA think. It gives a better view of strength. I cannot be more diplomatic than that.

  2. Yes, something changed. The shift from 5 → 4-way VLIW caused DP performance to be 1/4 of the SP, not 1/5 as before. The question was how big was DP perf. and I pointed out official DP performance. In single precision 6970 are far more powerful than Quadros.

  3. That is true, however I do not feel that DP would affect this issue significantly. I give you that point.

  4. AMD cards are not harder to program. compilers have become very adept at vectorizing scalar code. It uses the potential of VLIW architecture and data dependency between operations. If buffers are set correctly (READ/WRITE_ONLY, READ_WRITE) and memory namespaces are used correctly, ALU heavy codes (fluid dynamics and the likes) utilize 95% of the ALU performance, even with scalar code. (I checked this personally)

  5. Cache hierarchy is more advanced on Fermi cards, yet another point given. However there are write caches from Cypress on (Cayman also).

  6. That is quite sad, and another point given. Hopefully server vendors like Supermicro and the likes realize that AMD is a possible and viable alternative.