does ATI sleep?


I searched for something similar to CUDA. I didn’t find anything, either because their website is
that confusing or because there is nothing. I heard of ‘Close to Metal’ being something similar to CUDA,
but I didn’t find Close to Metal on
Do these ATI people sleep?
I send them a mail asking for Close to Metal, but they just didn’t answer.
Who will still use ATI in the feature when you have such great possibilities with nVidia and CUDA?


PS: this is not meant to be advertisement for nVidia, but in what concerns being user friendly and innovative
in software, I find nVidia much better!

ATI has or had close to metal indeed. Also stream computing is something they have. They recently have had some press releases stating that stream computing would be possible on the latest cards (previously it was only on expensive high-memory cards). They also announced that Tesla-like 4U rackservers would become available. My guess is they do not want to be forgotten until they support OpenCL (which is a lot like CUDA driver API).

It certainly looks like they have been surprised by the success of CUDA.

I’ve been trying for some time to figure out how GPGPU was supposed to work on ATI cards (aside from using the OpenGL or DirectX interfaces), so I could compare it to CUDA. It was frustrating for a while since it looked like ATI was not sure what form they wanted the software toolkit to take. First there was CTM, but programming that looked like as much fun as coding cubins by hand.

Now it looks like they’ve dropped CTM, and are pushing Brook+ and CAL. The low level interface is CAL, which sounds a lot like the PTX langauge. Brook+ is the high level interface (comparable to the CUDA extensions to C), based on the Stanford implementation of the Brook language for GPUs. After browsing through their SDK documentation, it appears that Brook+ is much more limited than CUDA, but the limitations mean that things like memory coalescing can be automatically handled by the compiler. It did not seem like my code would map well to it. I’ve decided that it would be better to wait for OpenCL to be released and implemented for the ATI cards, and in the meantime stick with CUDA.

If you want to learn more about ATI’s stream processing SDK, look here:

ATI sometimes shakes around a bit to remind people it’s not dead, but the poor thing has chosen to conserve resources and not invest in a proprietary vendor-specific toolkit.

That’s the smart strategy, really. Proprietary solutions will fail (and it’s not clear what returns NVIDIA will reap from CUDA, aside from a temporarily better stock price. That’s probably why there isn’t a wholesale push for it and the CUDA development team isn’t huge.)

Vendor-neutral approaches like OpenCL and DirectX 11 compute shaders will be the direction that GPGPU will soon take.

In the end, NVIDIA and ATI chips are almost identical in function (conforming to DirectX 10 and nothing more), and whatever tools that will emerge will work a lot like CUDA and all your knowledge will carry over smoothly. (Although I worry that the host-side API won’t be anywhere as elegant as the Runtime API.)

Well, one benefit for NVIDIA (although it will ultimately benefit ATI as well when their tools catch up) is growing the market for their products. Three years ago, I never would have imagined paying more than $150 for a graphics card. Now, given what I can do with it, I consider $400 for a GTX 280 to be incredibly cheap. CUDA is a demonstration that massively parallel SIMT coprocessors are very useful, and more than an HPC niche product (which is all Cell-chips-on-a-card is aspiring to be).

As you mention, when OpenCL becomes readily available, the CUDA experience will ensure there is a sizable development community out there who knows how to make use of this programming model. This will help both NVIDIA and ATI, though given their several-year experience with CUDA, I would not be surprised if NVIDIA cards do better on OpenCL benchmarks early on.

One major functional difference in ATI and NVIDIA cards is ATI’s continued emphasis on vector over scalar operations. The matrix multiplication example in the ATI Stream SDK manual used float4 operations for maximum performance. From a GPGPU coding perspective, this is awkward for many kinds of algorithms, and I would not be surprised to see ATI migrate to a scalar architecture like NVIDIA in the next generation of chips.

That probably hasn’t been updated since the x1900 cards. Their DX10 cards are all scalar, like NVIDIA’s.

I also reached the same conclusion about the ATI current SDK, and from what i know i openCL should be the answer. I personally prefer working with the driver API over the run-time, the only problem is that the emulator and the the debugger only work with the runtime. I think it’s a bit silly to limit the driver api as it is supposed to be the “real thing” that companies use to make software products, but as one mentioned the cuda team is small. The ATI vector advantage makes even less sense when you look at the IO to compute, most problems i came across were in the end IO bound. NVIDIA hardware currently has a much better IO to compute ratio then ATI, so the fact that they have the small vector units in the cores, doesn’t really help much.

Why do you say that? No need to get matcho. The runtime api is much more convenient and needs 10x less code. Meanwhile the driver api has no advantages (despite what one might suppose from it being “low-level”). Companies typically use the most efficient tools, so they would use the runtime api.

The only purpose of the driver api is compatibility, since it’s an unadorned C dll.

It has two API advantages right now:

  • thread migration

  • ability to specify the scheduling mode for contexts

Obviously that’s not very awesome, and we are aware of it.

Ability to yield is important but is available from the Runtime API. (You have to call the driver api to enable this, but you’re allowed to code everything else in runtime api. So I count it as “available in the runtime api.”)

Migrating contexts between threads might be useful to some people, but it seems pretty niche. (P.S. is it also useable from Runtime?)

Anyway, I think it’s great that there’s no reason to do things the hard way and that the runtime api is so successful at getting everything done. What I think you guys should do is better communicate in the Programming Guide that people shouldn’t feel that “real programmers” use the Driver API. It’s very important to have a standard-compatible C dll, and you should explain that the driver api is just that.

It’s not as niche as you’d think, depending on how your app is set up. It’s not usable from runtime, but we’re talking about how to do this right.

And yeah, obviously I know about the cuCtxCreate hack, but it’s a hack and isn’t “runtime-y.” But yeah, overall the runtime API works just fine (and handles a lot of annoying things for you, like alignment of function arguments).

Hey as I (still) have ATI cards in both of my computers I had a look at the ATI website if they don’t have anything similar to CUDA.
And yes, I found something - ATI Stream (as seibert has told me).

Ok I downloaded the stuff and installed it. Does not support my ATI Radeon 9250 PCI I have here (until my GeForce 8400 GS PCI arrives).
I worked out the differences and commonalities between CUDA and ATI Stream:

-both are said to do the GPU computing
-both don’t support junk cards (neither Radeon 9250 nor GeForce 7x and below)

-ATI Stream is not debugged yet (see here:…p;enterthread=y )
-ATI Stream seems to have a lot of internal hacks (see here:…p;enterthread=y )
-ATI Stream’s compiler seems to be a rip of some Stanford University compiler I found getting rotten on a cheap internet site back in 2004.

If I am wrong with anything (I don’t have the total survey yet ;) ) please correct me! (don’t take this post too serious)

Yes, this is the Stanford implementation of Brook, a C-language extension for data-parallel calculations. The original Stanford compiler (called BrookGPU) generated code which used OpenGL/DirectX calls and shader languages to control the GPU, rather than a more direct layer, like PTX or CAL. BrookGPU was a great environment, but the implementation was limited by the capabilities of graphics cards at the time. There is nothing like shared memory, scatter/gather operations are very awkward, and you have to use vector-operations to get maximum performance.

If you are stuck with some older cards (not sure if the Radeon 9 is too old) and feel like experimenting, you might want to play with the original BrookGPU to get an appreciation for how revolutionary the GeForce 8 and CUDA were. (or not, you could just take our word for it…) I programmed in BrookGPU for a few weeks, then the 8800 GTX and CUDA came out, and I never looked back. :)

Yeah, I saw the webpage of the original Brook. Looked like some kiddies from my old school would have designed it (if you could call that paged ‘designed’ at all).

The last Brook version is from 2004.

No, thanks, I’ll buy my 8400 GS and then maybe a GTX260 and use CUDA.

I hung around a bit in the ATI Stream forum. I asked ‘will ATI Stream work at least on my X1650 Pro?’

Someone answered ‘yes, maybe, but you won’t get hardware acceleration’. No idea what this means in detail but it sounds a bit like

‘Can I install a navigator in this car?’ - ‘yes, but the car can’t drive!’ (maybe it’s one of those wood-gas cars from the 1940s).

Can I buy this computer? - yes, but it doesn’t calculate.

Can I get on this plane? - yes, but it doesn’t fly.

Do these ATI guys want to fool me?..

I have rather limited experience with AMD Stream, but I’m pretty much sure it will not work on X1650. Very initial support for CAL (Compute Abstraction Layer, that’s ‘CUDA from ATI’) was added in HD2000-series, but it was so basic that I even cannot compile my kernel for it :)
Next, I would say that Brook+ programming model is more restricitve than CUDA, so chances thet you will need to write your program in IL (ATI’s name for PTX) are greater…
And CAL compiler itself is not very solid (e.g. there’s a bug in expression evaluation which is still there since 1.1).
In terms of performance I can’t confirm that ATI cards are much better. I have some algorithms which translates to compute-bounded kernels for NVIDIA and ATI. GTX280 gives about 11’500 performance points while 4870 gets about 7200. I suspect difference it’s not only in hardware but also in compiler which is not very good for ATI as I said before.

Hey Andrej, you are the guy that gave the interview to ‘PC Pr@xis’! :)
I read your interview in this magazine, it was about password recovery.
That’s how I came to GPU computing, through this article!

Wow, nice to meet you here! :)

I’m really glad such articles attract people’s attention to GPU computing and some of them choose to try it on their own :)

Things are not that terrible for AMD here :-)

AMD Stream supports cards starting HD2xxx, and software backend in Brook+ is available on any hardware. Software backend is being compiled on your favorite C++ compiler, so you may get nice performance using vector data types & Intel C++ compiller with SIMD instructions enabled.

Personally, I’ve implemented my program (BarsWF, MD5 bruteforcer) for both SSE2, CUDA, and Brook+(brook version is in beta now here :

Performance is like this:

Core2Quad 3Ghz SSE2 core: ~200Mhash/sec (non-SSE2 core shows around 8-10Mhash/core)

GTX280 CUDA: ~720MHash/sec (this is not a CUDA implementation pitfall, other competitors are even slower)

4870 Brook+: ~1266Mhash/sec.

But I must say that development for CUDA was easier then Brook+ and SSE2 :-)

So tell me, can I use SSE2 with Visual Studio 2005 (Windows XP 32-bit) and CLR (Common Language Runtime,

enables you to use .NET, which makes window programming etc. much easier) support?

I tried that but my app always crashed and someone told me it would not be possible to use

SSE2 and CLR together?!?


I don’t know much about CLR, but I thought it was like Java where the program is compiled to a bytecode. In that case, SSE would not work unless the bytecode compiler / virtual machine was capable of handling it.