Convincing skeptical bigwigs on the future of CUDA

Hello everyone,

I am a research scientist who recently made the effort to learn CUDA (had 2 free weekends:)), so I went on to write a prototype program to accelerate a performance-critical part of our data analysis software. With just a couple of hours of work I got amazing results of about 175x acceleration on 8800GT vs 3.2GHz Penryn Xenon.

The results seem very promising, but there is a question that is being asked by a lot of decision-makers: what is the long-time perspective of using CUDA, will it be supported on devices produced in two or three years from now or will this very specific code be useless as NVIDIA would stop supporting/developing it on the future hardware and move on completely to let’s say OpenCL? Some are skeptical to go on with CUDA as not to “marry” the code to this specific platform and hardware: even though the immediate benefits are huge, the code that is going to be written today should be used for large-scale data analysis (I mean hundreds of Terabytes of data) in several years to come.

Thus, I was wondering if there is any public info release on NVIDIA’s long-range plan about CUDA support and development. Any other talking points to make the push for CUDA? I would really appreciate any thoughts on this, as I am really enthusiastic about this project and would like to keep on developing it into a production software.

Best Regards,
Demq

I can’t give you any specific answers, but it seems to me that a large part of re-writing an application for CUDA is to identify good areas of your existing code to be parallelized, then working out some parallel algorithms to do the job. I think that nVidia will be supporting CUDA for a long time, and I’ve also seen some example projects where PTX (the intermediate language that is compiled into the .cubin binary for the GPU) has been ported some other systems (CELL, for example). Even if it wasn’t, I would think that your CUDA codes would be pretty easily moved over to something like OpenCL or DirectX Compute Shaders since they all generally work about the same (plus or minus a few abstractions). Also, from what I’ve read about OpenCL (the little that is available so far), the programming doesn’t really offer the same level of control (read: performance tuning) as CUDA does.

Since nVidia is also producing hardware specifically designed to go along with CUDA (the Tesla products), I imagine they have a lot invested in CUDA because it is much more costly to produce that kind of hardware than software. Not to mention, customers that are paying several thousand dollars for a top-of-the-line Tesla system are going to be majorly peeved if it’s not supported for a good long while.

Tmurray or some moderator of this forum had said that CUDA will be continuously supported by NVIDIA inspite of OpenCL. He also said more features will be added to CUDA (that probably may NOT be available in OpenCL as they may not be generic enough to include other architectures)

You can use the following argument:

Mainstream companies have adopted CUDA.
For example:

  1. Mathematica has annoucned CUDA capable version of their player
  2. Adobe CS4 is going CUDA enabled.
  3. FAH (Folding @ Home) project has reaped benefits of GPGPU using both ATI as well as CUDA
  4. Apple backing the technology.
  5. Many finance companies as well (henweck and the likes) are going with CUDA
  6. Similar case with Seismic companies…
    and so on…

These are just indications that CUDA is being taken seriously and has some applications in different verticals.

  1. CUDA already has 2 years of history of great support and feature expansions. I would expect no less over the next several years. Keep in mind that the Tesla hardware comes with a guarantee that replacement parts and support will be available for a certain number of years after purchase (I don’t know what that number is off the top of my head…)

  2. Regarding “marrying to a specific architecture”… Well, this limits you slightly, but opens up a new world of optimization strategies that a more general purpose language cannot expose. I would argue that any CUDA implementation of an algorithm can usually be faster than one written in another more general purpose language.

  3. Even in the unlikely event in 3+ years is that CUDA dies and OpenCL is king, then all is not lost even if you start CUDA now. Take the time to read the OpenCL specification. Much if it appears copied and pasted from the CUDA programming guide with just a few name changes. CUDA kernels can be converted to OpenCL ones with a minimum amount of effort (unless you use templates, but that is a topic for another post). So, any initial development effort in CUDA immediately translates to OpenCL for possible future use on “the other guys” GPU or any other hardware that OpenCL targets (i.e. CELL)

I will confirm, though, that Sarnath is correct in relaying that tmurray from NVIDIA has said CUDA will be supported parallel to OpenCL and with additional features not possible within the limitations of OpenCL.

  1. As you found, with a minimum amount of development effort 100x speedups are possible. Many have had those 100x speedups for 2 years already and have reaped the benefits (i.e. I’m a grad student publishing a research paper now that would not have been possible for us to do without GPU computing to perform the massive amount of simulations needed). Anyone who doesn’t jump on and start using CUDA now will be left in the dust…

I found that the hardest thing about CUDA was not vendor-specific: learning to see the potential for data parallelism in my existing code and isolating that piece so I could offload it to a GPU. Compilers are unlikely to become very good at parallelizing code automatically for a while, so this programmer-assistance will continue to be necessary, regardless of the acceleration technology. Today I’m offloading that piece with CUDA, tomorrow it could be OpenCL, in 3 years maybe it will be something special for AVX, Larabee, or AMD Fusion. Once you design a program with data-parallelism in mind, you can switch libraries/languages/platforms with a lot less trouble.

You got 175x after two weekend’s work, and they’re concerned about longterm viability? :blink:

Of course long-term viability is an issue, but as other posts state above, that sort of acceleration transforms what you can do, and the designing the parallel algorithm is the hard part, not implementing it. Beyond presenting the thorough arguments given above, you could point out that video cards aren’t fast because HPC users want them that way. Video cards are fast because teenaged boys like shooting each other - and the world contains many more teenaged boys than HPC users. You might also ask how well their existing codes work on a Death Station 9000 - that gives a useful measure of how important portability really is. You will, of course, want to do so in a tactful manner…

I am often grateful to the gaming community for funding the R&D which gives me access to 1+ TFLOPS for only $500. :)

Although, that does highlight another reason to trust in the longer-term stability of CUDA. NVIDIA is financially invested in it to help continue the market for their cards. Demand for better graphics is could easily slow down in the next few years, but CUDA expands the utility of these cards to physics simulation in games, video compression, etc. Moreover, it provides another reason to have two cards, much like SLI did. “Supercomputing for the masses” has the potential to provide much more market stability than the current multi-kilobuck niche HPC coprocessor flavor of the month.

The real danger with CUDA is the market being dominated by customers whose interests and requirements are not aligned with yours. That pressure could push CUDA development in a direction which does not address your needs in the future. A good example of this is the argument over how much NVIDIA should focus on double precision on GPUs. For a pure HPC market, one could argue that every streaming processor should be double precision. However, balancing 3d graphics with HPC has resulted in the current tradeoff of 8:1 single to double precision units. It is hard to predict how this ratio will change as NVIDIA moves beyond the 55 nm process and has more transistors to play with.

That said, for a 175x performance improvement, I don’t care if all my GPUs turn into pumpkins in 2011. The amount of money I save between now and then will more than make up for it. :)

From my perspective, I would emphasize that CUDA requires an extreme shift in programming paradigm more than some of the other posters.
For example, my finite element code already runs on massively parallel computers, but almost 0% of that parallelization is relevant to utilizing cuda.
Going to CUDA is a total rewrite of the computational kernels (maybe 10% of the total code) and might require emphasizing different algorithms than would be used otherwise.
To me this is the biggest issue; without incredible gains the rewrite is questionable. Also, the rewrite requires some awkward posturing to get optimal performance…

It will be very interesting to see what the future holds for this technology…

Getting 175x acceleration out of only 112 processors, running at half the speed? … is pretty amazing :-D

This is a good point, though it depends very strongly on the types of problems you work on. My usage has gravitated towards spotting high-CPU, ridiculously parallel portions of my code and moving those to pieces to CUDA only when it is easy to do so. That limits CUDA’s applicability to my work (after 2 years of developing with CUDA, I’ve only written or modified 3 programs to use it) but maximizes my reward/effort ratio.

It also helps that I tend to be CPU-starved, and usually only have access to O(10) computers at a time anyway. Enabling my two GPU workstations finish a task 50 times faster makes a big difference to my overall productivity. (Like this weekend: I took a task that used to require our 24 slot batch queue and sped it up to run comfortably on my MacBook Pro w/ 8600M. Huge win for development!)

If I were computing on a much larger scale (hundreds or thousands of nodes), CUDA would not be very helpful unless the cluster I had time on also had a large number of CUDA devices installed. A factor of 50 doesn’t help if you also need to purchase and maintain a rack of Teslas, and rewrite much of your code just to equal the the performance you already get on a large cluster. I think this is why, of all groups, CUDA will have the biggest impact on people who haven’t traditionally considered themselves “HPC users”.

This was also a kind of question I was asking myself, whether it wouldn’t be just easier and more universal to keep everything on CPU platform, since the availability of computational resources expands continuously. In the particular application of experimental data analysis there are a lot of SPMD problems and also very large computational resources available (large clusters in various national labs/universities), but in the end of the day these resources cannot keep up with the volume of the data that needs to be processed and CPU power required for that, especially in larger facilities where several major experiments are simultaneously in the analysis stage. Thus it might just be worth to offload at least parts of these computations to maybe several powerful GPU-equipped workstations which might be acquired at relatively acceptable price. I suppose that buying several workstations would be justified if it will save a month of analysis time for a large collaboration. This is the biggest point I suppose, the problem of taking the decision on the hardware acquisition. I think one large problem with adaptation of GPGPU in broader scientific circles is the problem of very large inertia a lot of scientists have regarding adaptation of new unproven methods.

I think in general the scaling of the number of CPU and GPU nodes within a clusters is very different, one presumably would need only a small fraction of nodes with powerful GPUs to make it a very effective hybrid CPU-GPU system. This is the key advantage of GPGPU in my opinion, either in large or small systems.

It would be also interesting to hear what people from nVidia can add to this discussion.

Why do you say a factor of 50 wouldnt help to much?

we have a 1000 CPU core cluster in production, we’re getting now somewhere around x40-x60 boost - i.e. one

GPU, btw a GTX and not a tesla, is equal 40-60 cores. A four GTX280 (or even C1060/S1070) machine can

replace 160-240 cores - hence I’ll need at the worst case 6-8 such machines. The price is at least x10 lower,

money on power will be reduced, space in server room can be saved, cooling is less of an issue.

Not to mention also the ability to compute work that needs over a month using CPUs in less then a day with such a GPU cluster.

BTW another issue, at least in our company, is that projects usually take a lot of time to compute and a lot of CPU power.

Therefore QA and R&D are having a lot of trouble to check a new version and to debug the new version. We usualy

end up uploading a partialy QAed version to production and continue the QA there on the 1K CPU cores.

With a x50 factor, I can give QA one computer with 4 GTXs for example, and voila a QA env :)

Am i missing something?

If the GPU cluster already exists, a factor of 50 is great! However, many HPC users get time on existing large clusters, and are not in a position to build something new just to run their code. Building one or two workstations is no problem, but if they need to purchase and maintain several new racks of computers just to run their code, that seems a lot more daunting. Not all groups are able to do that. There’s a lot of inertia when something is already built and in front of you.

I expect that we will see CUDA trickle up the HPC food chain (like most things) and not down. A number of places are installing tens of CUDA devices, and I believe the largest CUDA cluster will have a few hundred devices in Japan soon. In a few years, more and bigger clusters will possibly have CUDA devices, at which point it will be worthwhile for people working at that computing scale to switch.

Do you mean something like this? Getting small allocations on TeraGrid machines is usually just a matter of asking nicely…

Wow, not bad. So if I do the math right, that’s 384 (96 * 4) CUDA devices available in one cluster. The future is on its way. :)

Another angle on this subject is to consider those groups out there that consider them HPC users, but aren’t big enough to fund their own monster clusters or have the grants to buy 100’s of thousands of CPU-hours to do their research. My current research group (just me and my adviser) fall into this category. Even if most of the simulations in our current work could be done on a 1000-node cluster with existing CPU software, we just don’t have access to one. So GPUs are very helpful in enabling more HPC type computations to be done by smaller research groups. Plus it was a fun distraction to work on code for a while, developing the tools to do so :)

In that light, if I was at LANL (or Oak Ridge, etc…), say, and got to play on the big computers I may have never gotten as motivated as I did about CUDA in the first place. So I’m not arguing against anything that has been said, just putting my perspective on it.

Not to mention, if someone came along that did have CUDA experience, and used their big $$$ grants to fund a large CUDA cluster, they could easily (and relatively cheaply) build a cluster that would put them up near the top of the top 500 list (source). 100 1U Tesla servers could probably get you into (or very near) the top 10.

Yeah, I don’t get to play with Roadrunner. :)

Not directly CUDA related, but… as I said, getting onto TeraGrid machines is generally very easy (certainly if you’re at a university). For a 30,000 hour development allocation (up to 200,000 hour on some machines), you basically just have to write a paragraph, you can submit that at any time, and you’ll likely get an allocation within two weeks. For a research allocation, you can request as much as you like, but you have to write a full proposal and applications are only accepted four times a year. All the gory details are on the TeraGrid website. I’m not sure if the GPU cluster at NCSA is part of this (I’m still trying to get an account on it… will see if that happens before our cluster arrives here), but if you want computer time, there are a lot of options.

Houston, we have lift-off!!
Thanks everyone for the useful discussion! I got the initial commitment on both software side and hardware acquisition! I think the best arguments are the pure numbers of GPU x es :))

Cheers.