Cross-vendor GPU development strategy

What general approaches could one consider for deploying parallelized applications across several vendors:

  1. CUDA with NVIDIA’s high-end products
  2. Open CL for AMD’s high-end products, and
  3. Microsoft Accelerator 2.0 for NVIDIA T2 and Intel and anything else? The idea here would be to support the same application across just about all new devices, although of course the performance would vary considerably.

Or could one write mainly in Open CL, and use CUDA inferface? Is there some kind of common language and IDE that one could use for development, and then target CUDA, Open GL or Accelerator 2.0 from there?

Our application is deep semantic extraction, not primarily pixel-pushing; the core calculations are basically large matrices involving SVD, etc.

OpenCL was made for this.

It’s a common standard that’s supported by NVIDIA (works like CUDA for their GPUs), AMD (both for their GPUs and CPUs, also works on Intel CPUs), IBM (beta, works on CELL) and others.
It can replace CUDA, OpenMP, Intel TBB or other parallel programming technologies.
OpenCL implementations are still young and may have bugs but both NVDIA’s and AMD’s are now considered non-beta (production level).

Like Big_Mac said, I’d probably go with OpenCL. If you need to extract every last bit of performance though, then I’d write separate, vendor-specific implementations (e.g. CUDA, Stream, etc.) for each platform.

I would hesitate to say that OpenCL can replace OpenMP or Intel TBB. Those technologies also support task parallelism on multicore CPUs, which offer more flexibility than data parallelism model that GPUs are optimized for. (Although OpenMP and TBB can obviously implement data parallel algorithms.)

OpenCL is completely analogous to CUDA, though, and I agree will be your best bet for cross-platform parallel programming that can run on CPUs and GPUs from many vendors.

Thanks for this initial set of responses. They confirmed the general impression we’ve obtained. One suspects that there are a lot of “details” when one gets down to coding. On the other hand, we appreciate that this environment is relatively new, and such things go with the territory.

How about IDE? We’d prefer to use Visual Studio 2010, for the usual reasons.

Now, in our case, generalitiy is more important than absolute peak performance. Much of the processing we do is not really synched to real time events, at least not to the degree that a 2 x difference in performance would be meaningful.

We have been interested in Microsoft Research’s Accelerator 2, which is a task parallelizer that they believe can be set up without too much trouble to target just about anything. Thus, it’s similar (please correct me if my understanding is faulty) to OpenMP or Intel TBB.

If we go OpenCL, does it do multicore parallelism? After reading the comparision between OpenCL, OpenMB and Intel TBB, it’s not clear. Does “also” mean OpenMB and TBB support multicore parallelism, as an example of a feature that OpenCL does not? Or do all three of them do this?

The CUDA/OpenCL model is very different from traditional task parallel APIs. Being optimized for GPUs, they use a “data parallel” or “bulk synchronous parallel” model (as I’ve heard some people call it). Rather than the coarse grained parallelism of OpenMP-style programs, where you are trying to scale an algorithm to maybe dozens of threads, each running on a CPU core, CUDA/OpenCL is designed to scale to with thousands of very lightweight “threads” being multiplexed rapidly over devices which could be thought of massive multicore vector machines.

For example, the NVIDIA GTX 285 could be (though it usually isn’t) described as a hyperthreaded 30 core device, where each core can complete a 8-wide vector instruction per 1.476 GHz clock cycle. To fully hide memory and pipeline latency, the cores are designed to multiplex the vector unit over many threads simultaneously with no context switching overhead. This free context-switching is achieved by statically allocating to each thread all necessary register and shared memory resources for the entire thread lifetime. This is why each of the 30 cores has 16,384 registers available.

The idea in CUDA (I assume this all applies to OpenCL from what I’ve read) is to instead break down the algorithm into very tiny units, which all simultaneously execute the same instruction on different data elements. This model is more constrained than the task parallel systems you mention, but allows for very efficient execution on specialized hardware (like the GPUs from NVIDIA and ATI).

For best results, you want algorithms that are almost completely lock-free, requiring minimal or no synchronization between threads. For parallel programmers used to a toolbox filled with mutexes, semaphores, queues and critical sections, this can be very frustrating, but it is the price of scalability to chips with hundreds (or thousands) of processing elements.

I don’t know what the state of OpenCL documentation is, but certainly browsing through the CUDA manual will help explain more about this unusual approach to parallelism.

@seibert:
You’re right of course.
In this particular case, there’ll be a lot of data-parallel computations I presume (SVD and such).

@DigitalDirect
OpenCL should be IDE agnostic. CUDA, OpenMP and Intel TBB are not (or at least not entirely). CUDA doesn’t officially support VS2010 yet AFAIK, OpenMP requires at least the “Professional” version of VS (for 2008) but should work in 2010. I’m not sure about Intel TBB but it may require Intel’s compiler for some things (someone correct me on this).

OpenCL, OpenMP and TBB all support multicore parallelism. Additionally, OpenCL has a nice universal API for vector instructions that will use SSE/MMX (or SPE vectors on CELL, appropriate vector instructions on the GPU etc.). Neither of those technologies supports parallelism in the sense of distributed computing (like MPI). All work with the notion that the basic architecture of the underlying hardware is a multi/manycore machine with a uniform main memory and perhaps a core-local cache-like memory. OpenCL also recognizes that there may be a further layer of compute units within a “core” (like SIMD vector FPUs on each core of a CPU). OpenMP and TBB don’t expose that functionality and to use SSE/MMX with them you either need a good parallelizing compiler or embed non-portable assembly code.

I’ve no idea how Microsoft Accelerator works. I can’t even find the second version on the Internet. MS Accelerator 1 is built over DirectX 9 and will most definitely be slower (due to lack of local memory), less portable (won’t work on CPUs) and less general purpose (won’t have random memory access for example). It’s a framework released in 2006, before the new wave of GPUs went out, before CUDA, long before OpenCL.

Edit:
I thought I’d mention this: it’s generally easier to effectively execute code that was written for the GPU in mind on the CPU (assuming the language is convertible of course) than the other way around. That’s why you have OpenCL - it’s biased towards the GPU-like mode of operation with fine-grained parallelism and tens of thousands of logical “threads”, yet it can get very good performance on a multicore CPU as well (rivalling manually threaded and vectorized code, as an AMD employee told me once). This is mainly because the model of GPU parallelism is more constrained and geared towards high throughput computation. You’d have much more work to get a less-constrained, more general-purpose CPU-centred parallelization technology like Intel TBB to work well on a GPU.

Ah, clearly what I wrote above is over-emphasizes the limitations on getting good performance on GPUs. OpenCL appears to have broader scope than I realized. :)

Accelerator v2 has been out for a few weeks, I can’t remember where I got it though (from some Microsoft developer forum or somesuch). Anyway, DigitalDirect, it’s not what you think it is. Accelerator is a neat tool, but it is not nearly as flexible as CUDA, OpenCL, or any of the other tools that have been mentioned; it’s basically just a wrapper for HLSL that can be called from .NET, which means you don’t get access to a lot of the features you’ll want (control over memory transfers, memory layout, certain device operations (e.g. atomics), or double-precision). It’s also not a task-parallel, it’s data-parallel (like any other GPU code).

Read my post above…if you end up writing vendor-specific libraries, let me know…I’ve had to do this a bunch of times before and there’s a few .NET tricks to make it a bit easier than you think.

EDIT: Also, don’t forget about Nexus. It’ll be a lot easier to do GPU development on CUDA or OpenCL once it’s available.

This discussion is highly relevant to what we’re doing, and I’ll follow-up via PM as appropriate. The orientation and advice being provided is very helpful and most appreciated.

One quick question: how does DirectCompute fit in to the mix?

BTW I’ve been seeing mention that “Microsoft and Intel will support CUDA” and was wondering that they had in mind with this comment?

DirectX 11 introduces “compute shaders” - general purpose shader programs - to complement pixel and geometry shaders. They should work on Dx11 AMD and NVIDIA GPUs.

The bad news is this will only work under Windows Vista/7 and, of course, only on GPUs. Compute shaders are only part of DirectX 11 which is mainly a graphics API (but also handles keyboard/joystick input, sound and some other features for games).

The programming model is similar to CUDA and OpenCL. The API is, IMHO, much less elegant and more cluttered.

Basically, DirectCompute (compute shaders of DirectX) is the effect of game programmers and graphics people noticing that GPGPU can be useful in real-time graphics, especially in games. Microsoft then added a new type of shader, striving to maintain close coupling with other parts of the API.

You can do pretty much everything with OpenCL that you could with DirectCompute and more. The only thing that comes to mind OpenCL does not do is integration with the rest of DirectX 11 - this might be a selling point for people doing real-time graphics and/or games, because they can insert compute shaders in between pixel and geometry shaders in their rendering pipeline. OpenCL can integrate with OpenGL to achieve similar functionality.

Personally I don’t expect to hear much about DirectCompute from people outside the gamedev community. If you’re not pressed to integrate with the rest of DirectX 11 (which you may well be if you design games) there’s probably no compelling reason to use it.

No idea what that would be about. CUDA is NVIDIA’s proprietary technology, Intel has no business implementing support for it on their CPUs and Microsoft already has DirectCompute.

Now, on the runtime environment, what are the considerations?

Open CL is part of OS/X, very convenient, but what does one do on Windows? That .Net is preinstalled has many advantages, such as tiny runtimes (we like to use Silverlight for our client code) whch makes the initial install experience and upgrades very smooth.

Will users have to change the GPU driver and so forth? Is a libary download necessary? What are the weak points of an Open GL dev environment and subsequent implementation on target devices?

On Windows OpenCL runtime is included in the newest video drivers (both NVIDIA’s and AMD’s). Those are standard drivers, you can expect most users that keep their drivers up-to-date will have the component installed. There’s no need for the client to download additional software (unless their drivers are old).

If a user doesn’t have a GPU and wishes to run the computations on the CPU, he still needs an OpenCL runtime but he doesn’t have video drivers. In that case, you can either redistribute OpenCL.dll (with AMD’s runtime) along with the app or ask the user to install AMD’s compute toolkit. More options for this special case may appear in the future (like a standardized CPU runtime bundled with the system or as a small downloadable redist). Generally not a big deal in any case - not bigger than installing the .NET framework IMHO. The goal of OpenCL is to make the runtime ubiquitous and common and the standard implemented on as many different platforms as possible.

Try not to confuse OpenCL with OpenGL. CL is “computing language”, GL is “graphics library”. The latter is an API to render 3D graphics. It can be used in conjunction with OpenCL (ex. to visualise results) but is a different thing. The OpenGL runtime also comes with the video drivers by the way.

This is very helpful.

That the OpenCL runtime goes out with driver updates from NVIDIA and AMD is a positive step. Does it have to be installed in some way, or does that happen whenver the driver update occurs?

A catch is that for many laptops, it seems to be the case that you can’t just go ahead use the drivers provided directly from NVIDIA. Turns out the driver actually used on a specific laptop has additional features such as what to do on lid-up, lid-down and so forth.

One generally has to wait weeks to months, or forever, before the laptop-specific video drivers are available from the vendor (e.g., hp) download site, and who knows if they will include OpenCL? What a pain! This is the kind of consumer nightmare we have to watch out for, right from the start.

More questions :)

  1. How well does NVIDIA support Open CL on non-CUDA GPUs, such as T2, and legacy devices?
    1a. If Open CL can replace CUDA entirely, does anyone make Open CL equivalents of CULAPACK and similar libraries? Some of these are also available in .Net wrapers…
    1b. Which community (CUDA or OpenCL) is largest? which is most robust? Looks like Open CL is quite competitive now with CUDA in terms of performance, perhaps even ahead for some calculations.
  2. How’s Intel’s support for Open CL?

“Everyone” has signed on for Open CL, but some vendors may be more equal than others in terms for their commitment.

We definitely understand that Open CL and Open GL operate in two separate areas, but they’re related a bit for our purposes. Right now, we’re just working through all the Open CL issues. I mentioned the Silverligtht runtime as an example of seamless and effortless install, made possible by .Net being included in Windows these days, combined with SL now being an opt-out automatic install through Windows Update.

In the non-GPU case, although downloading the OpenCL.dll is probably not a problem (does this require a reboot?), the AMD runtime probably is. This is the kind of UX obstacle we have to avoid: We dealing with mass-market consumers, not enthusiasts. The competition, so to speak, is the browser-centric services, that don’t require any “system engineering” by the consumer.

It should happen whenever a driver update occurs.

The OpenCL runtime should become (or already is) and integral part of the driver (like the OpenGL runtime). I’m not sure what’s the current status of laptop drivers concerning the inclusion of OpenCL runtime but I’m pretty sure that vendor-supplied drivers will or already do include the runtime. The vendor drivers aren’t usually different anyway. I’ve never met anyone who couldn’t install a driver directly from NVIDIA on their theoretically vendor-supported laptop by modifying a single .inf file so that the installer recognizes the card. I’m not saying your customers will have to hack driver installers, I’m just stating the fact that vendor-supplied drivers are practically repacked original drivers that differ by this .ini file and that means they aren’t likely to cut OpenCL. I don’t think they’d even be allowed to.

They don’t at all. OpenCL supports exactly the same set of devices as CUDA, no more no less.

From the app you can query available devices and fall back to CPU if suitable GPUs aren’t found.

Not yet, at least I haven’t heard of such libraries yet. Production level runtimes have only been out for a few weeks (AMD) or months (NVIDIA).

CUDA definitely has a stronger community and the technology is more mature. It’s been around for over 3 years. If not for lack of portability between platforms, I’d generally recommend CUDA for GPGPU. But limiting the userbase to those with specific NVIDIA cards is unacceptable for some consumer application developers.

That’s a good question. They don’t have an OpenCL implementation and aren’t telling if they’re even working on one. I have a theory why this may be so, but I’ll spare the speculation.

AMD’s implementation works on Intel processors, that’s a substitute for now.

The .dll thing doesn’t require a reboot. Noticing that there are no available OpenCL implementations in the system and switching to such backup .dll can be made entirely transparent and managed by the app at runtime if necessary. As I said, this issue will likely disappear in the future anyway.

Has OpenCL caught up to CUDA in performance? I haven’t been keeping a close watch on the progress of OpenCL drivers, and I know several months ago there were complaints in NVIDIA OpenCL forums of poor performance compared to CUDA. If you have links to head-to-head benchmarks, I’d like to check them out.

(Eventually, I want to migrate some CUDA code to OpenCL to improve portability. With code for myself, I enjoy using CUDA, but I have one project which is used by people at several universities, and the CUDA restriction means that everyone except me uses the slow, handwritten CPU code path. Keeping the CUDA and CPU implementations in sync is annoying, and I’d like to go all OpenCL for this project when the performance and deployment issues are resolved.)

These responses are all incredibly helpful, and anyone else wanting to compare Open CL and CUDA will have a great primer in the form of this thread!

  1. Consumer laptop drivers: Keep in mind that consumer personal computing means laptops these days. The ratio of laptops to desktops on display at retailers is probably 5:1 laptops, with a large touchscreen as well.
    Andyway, if you go to the NVIDIA site to get a driver, and if (and only if) you elect to do an autodetect of your installed driver, you’ll get a message along the lines of the following:

The manufacturer of this system requires that you download the driver for your GPU from their support site.

The GeForce M series and GeForce Go series notebook GPUs use drivers that have been customized by the notebook manufacturers to support hot key functions, power management functions, lid close and suspend/resume behavior. NVIDIA has worked with some notebook manufacturers to provide notebook-specific driver updates, however, most notebook driver updates must come from the notebook manufacturer. Additionally, the desktop GeForce graphics drivers will not install on Geforce M series and Quadro M series notebook GPU’s.

1a. If you simply go and brute-force download the driver, which is about 144 megabytes, and go through the install procedure, it will eventually fail, with an error message similar to the above.

  1. Open CL and CUDA performance:

Looks like Open CL is highly competitive in several areas, and more importantly, showing major improvement compared to a year ago. SiSoiftware has released some benchmark utilities and initial results on a few devices:

[url=“http://www.sisoftware.net/index.html?dir=qa&location=gpu_opencl&langx=en&a=”]http://www.sisoftware.net/index.html?dir=q...langx=en&a=[/url]
[url=“http://www.sisoftware.co.uk/index.html?dir=news&location=opencl_release&langx=en&a=”]http://www.sisoftware.co.uk/index.html?dir...langx=en&a=[/url]

  1. Re Intel support fo Open CL, which is a bit of a mystery right now, what does “AMD’s implementation (of what?) works on Intel processors” mean more specifically? Intel CPUs or GPUs?

  2. Finally, how does NVIDIA Nexus fit into the picture?

The first OpenCL SDK that AMD released actually ran on the CPU rather than “their” (in the sense that AMD owns ATI now) GPUs. The SDK was obviously tuned and optimized for the AMD implementation of x86 and x86_64, but still runs on Intel’s CPUs as well.

I’m not aware of any OpenCL implementation for Intel GPUs, but they are generally so slow that it wouldn’t be very beneficial. Presumably, in whatever year Intel finally releases their Larrabee GPU, they will also release OpenCL drivers for it.

Wow, that is encouraging! I would consider the +/-5% differences to be in the noise, so it looks like on lower-end hardware OpenCL and CUDA have reached parity. It’s too bad their benchmark code isn’t public, and I wish they had also tried a GTX 285. :)

My 2 cents

  1. Note that “OpenCL” does NOT really support heterogeneous computation on a single system. So, either you can use CUDA devices OR AMD devices (heard AMD implementation supports both AMD CPU and AMD GPU simulataneously). The “platform” layer is NOT spelt out by the spec. I heard that OpenCL comitee is examining this problem. But no further news from their side.

  2. CUDA to OpenCL translator is NOT impossible because OpenCL is nothing but “CUDA Driver API + Some Spices”. I am surprised why there are no work towards this. May be, I am not aware of. If some1 could do that, it would help you come out of all your problems, You could code in CUDA and convert to OpenCL when required.