Help choosing cuda adapter for research


I am going to start a study (master degree research) involving distributed parallel graph algorithms using cuda GPUs. Running in Linux, probably on Apache spark.

To start my studies I must use at least 2 computers with cuda GPUs. To finish my work I will have to use an proper environment, like Amazon AWS or Grid5000, with at least 8 nodes.

But now I just need a test environment.

I would appreciate any help.

  • Which cuda adapters should I choose? Must be linux compatible!

  • May I use graphic cards instead of tesla/HPC GPUs? What are the Pros and cons?

  • What is the minimun Compute Capability? 2.0?

The teslas I can afford are (used ones):

Tesla C2075, C2070, Tesla M2090, they are 2.0 if am not wrong.

But how about marwell Gtx 1060 6gb Cuda Core 1280, more cores, 6gb, 6.1. Less expensive. Looks to me a better option, isnt it?

I won’t use for games. I will use in general desktops, i7, 8 to 16gb ram.

Thanks a lot guys, I really appreciate your help.

Sorry about my English skill!

My exposure to graph algorithms is very limited, but I do know that they typically require a lot of memory bandwidth, so this is something you would want to pay attention to as you are selecting a GPU. You can find the (theoretical) bandwidth stated in NVIDIA’s specifications, e.g. here for the GTX 1060:

It is not clear what your actual financial budget is per GPU. You may want to clarify that to allow for more specific recommendations.

I have worked quite a bit on this topic, as have many other most notably gunrock;

As njuffa mentioned you want a GPU with high memory bandwidth, and the Pascal generation is not better in this regard when compared to Maxwell.

your best choice would be to find a used GTX 980ti which has 336 GBs of memory bandwidth. I think EVGA still sells them new at about as low as $400, though right now they are more expensive.

Your other best option would be the Pascal GTX 1070.

I checked a few websites, and new EVGA GTX 980 Ti do indeed still seem to be available, but are priced in the $700+ range. I see used / refurbished ones offered for as low as $399.

Thank you very much guys!

Actually, a intend to use gunrock in my studies. I am avaliating others gpus libraries too. I will have to integrate and run with spark.

I live in Brazil. Its difficult to import this stuff, lots of taxes and fees, delivery is VERY slow.

These are the prices in Brazil, used items.

GTX 980ti, gtx 1070 and tesla c2075, similar prices, between $460 to $500 each

tesla m2090, around $280 each

Gtx 1060, $360 each.

Well, after reading your advices I belive that I should by the gtx 1070! Thanks again!!

Another doubt!

I use two identical desktops, intel dx79to motherboard, 32gb RAM, i7 3780, 4tb sata3 hdd, 500gb sata2 hdd (boot only).

Do I need an additional display adapter? Can I use the gtx 1070 as display adapter while running my programs?

Do you guys know if this hardware is compatible with my needs?

Thanks a lot

Generally speaking, you can use the same GPU to drive the display and to run compute kernel kernels. However, when you do that, the operating system will impose a time limit on the run time of kernels, as a GPU cannot serve the GUI and run a kernel at the same time, and the OS must ensure that the GUI doesn’t “freeze” for extended periods of time. A typical “watchdog” timer limit is two seconds. There are OS specific ways of increasing the time limit, but the usability of the machine may be diminished, as the machine will be unresponsive whenever a CUDA kernel is running.

One way around this is to use two GPUs, one powerful expensive one for use with CUDA, and a cheap (~ $50) one to drive the display, and excluding the powerful GPU from connection with the display. Maybe CudaaduC can tell us whether that kind of setup is really necessary when working with graph algorithms, I don’t have sufficient hands-on experience.

You can run CUDA kernels using the GPU which is connected to your display without major issue. If you disable the watchdog timer then you can run even long kernels, but the display may be frozen (at least partially) during that run. Using this display GPU may result in slightly decreased performance when compared to the same application being run on a GPU with no video out.

As njuffa mentioned the best option is have a cheap GPU for display, and a “compute” GPU in another slot for CUDA applications. Need to have a motherboard which can support 2 GPUs at PCI-E 3.0 x16 with a CPU which has 40 lanes. Then you stick a lesser cheap GPU in a slot with the video out, and the more powerful GPU in another slot. CUDA applications usually will default to the better GPU, but you can also directly choose a GPU via cudaSetDevice().

For graph algorithms you need high GPU device memory bandwidth, and the fastest possible host-device and device-host memory bandwidth. For the configuration listed above the host-device and device-host copy times will be in the 11-13 GBs range with 2 copy engines for concurrent memory transfers in opposite directions. But you need a motherboard and CPU which can maximally support that specific hardware configuration.

Thank you njuffa, thank you CudaaduC

Well, unfortunately, I can not afford to buy new motherboards/processors at the moment.

CudaaduC, I don’t know if fully understood what you’d said. I don’t know much stuff about hardware. But before asking about hardware I will explain deeper my needs.

I am going to use these 2 computers AT HOME, to study CUDA programming, get used on Apache Stark, test several algorithms and gpu graph libraries (gunrock, medusa, …), test maximum size of graph I could handle in my system, …

To do it I believe that I need 2 dedicated computers/GPUs. I have a third computer, a very good laptop. I will use it to program in C++ and Java, eventually as an Namenode orchestrating the dedicated nodes.

Before publish any results I will have to run the experiments in a real (and adequate) environment, in many nodes. I have access to at the moment. Or I could rent some AWS Servers for a few hours, tough.

The problem is: It’s VERY hard to schedule/hold several nodes, there are to many researches sharing the same resources. I am a Brazilian master student with a guest account, very low priority :-).

That’s why I need my own minimal environment. I need accelerate my studies, otherwise my work would be called ‘Mastering waiting techniques for scheduling shared GPUs infrastructures’, hehehehe.

Finally, lets talk about HARDWARE.

I intend to buy the GTX 1070. But I don’t want to waste money. If my actual motherboard/processor is too obsolete maybe I should buy cheaper GPU. Shouldn’t I?

The kind of graph processing I will handle involves mainly graph traversal searchs (BFS) and graph diameter calculation, in distribute parallelization. Million vertices graphs.

That’s why I believe that bigger RAM and GPU memory is a must.

But the Tesla m2090 is much cheaper than 1070 and has Double-precision (64-bit) Floating Point Performance.

I really don’t know if double-precision is a priority. Gunrock implements diameter, BFS, SSSP, but does it need double precision to do these?

Well, my doubt is:

Should I save some money and use the M2090 with double precision?

Should I buy an less expensive GTX like 980 or 1060?

Should I Buy the 1070 because has 8GB and is much faster?

Well, THANKS A LOT guys.

I REALLY REALLY appreciate your tips and advises.

Below the hardware specs.

I7 3780 specs

Processor Number i7-3820

of Cores 4

of Threads 8

Processor Base Frequency 3.60 GHz
Max Turbo Frequency 3.80 GHz
Bus Speed 5 GT/s DMI2
Max Memory Size (dependent on memory type) 64.23 GB
Memory Types DDR3 1066/1333/1600
Max # of Memory Channels 4
Max Memory Bandwidth 51.2 GB/s


Memory Types Quad DDR3 2400
Max # of Memory Channels 4
Max # of DIMMs 8

Graphics Specifications
Discrete Graphics 2 PCIe 3.0 x16

Expansion Options
PCI Support 1
PCIe x16 Gen 3 2
PCIe x1 Gen 2.x 3

I/O Specifications

of USB Ports 16

USB 2.0 Configuration (External + Internal) 6,8
USB 3.0 Configuration (External + Internal) 2,0
Total # of SATA Ports 4
Max # of SATA 6.0 Gb/s Ports 2

of eSATA Ports 0

RAID Configuration 0,1,5,10
Audio (back channel + front channel) 6,2
Integrated LAN 10/100/1000
Firewire 2

Two things to seriously consider about the Tesla:
First, that GPU is designed for server type blade installation and requires an external fan for cooling.
Second, Fermi generation is quite old and I believe they just dropped support for it in Cuda 8.0. Either that or they announced that they’d drop it in the next version (don’t remember). Either way…

Ok, just confirmed it. Fermi is deprecated and will be unsupported after 8.0. Those Tesla parts are all Fermi architecture, so yeah.