AMD and OpenCL

If I wasnt too paranoid, I’d think AMD deserted their GPU
“AMD does reverse GPGPU, announces OpenCL SDK for x86”…sdk-for-x86.ars


Yes, it has no-sense at this point, because the main problem will be to target GPU, and it’s a matter of algorithms more than a matter of code optimization, so developping on a CPU without any access to real GPU is pointless.

You may develop on a CPU (for CUDA), but you should test, compare and challenge your code on GPU at some point.

I hope AMD will have a good OpenCL GPU implementation as fast as possible, we all need nVidia to be challenged :-)

Actually, having a CPU code-path for OpenCL is very smart. The most annoying thing about using CUDA in my current work is having to provide separate CPU code paths for systems which do not have a GPU. I am not very experienced in writing multicore SSE code, but am quite happy writing CUDA code. I have long wished for a compiler option to nvcc which could convert my CUDA code to an optimized CPU implementation. (There was a flurry of work on this a year ago, including some academic papers and rumors of nvcc being updated to do this, then silence. I assume that means it turned out to be harder than it sounds.)

Once the CPU-only implementations of OpenCL get good, there will be even greater incentive for people to learn the language, because it will benefit all of their users, rather than just those with supported GPUs. And you will only have to implement your algorithm once, not twice. (Or three times. AMD’s Bulldozer CPUs should be a very interesting fusion of CPU and GPU design.)

The fact that they shipped a CPU version first is what I find weird… like their GPU version (OpenCL or not) is not good enough.

In anycase I think this dual thing will become un-needed. You currently need both CPU and GPU versions because GPU is not mainstream.

Once its mainstream and your GPU code runs x50 times faster, why would you want CPUs? it will only make your code slower…

The scenario is here today… if you have X tasks that you run on the GPU, and it takes time, why won’t you offload, say, 5-20% of the work

to your 2 quad core CPUs? you’ll have to have a really good reason to do so, it might only set you back, performance wise…

my 1 cent :)


If you need double precision in your kernel, then using both CPU cores and GPU makes a lot of sense.


GPU path and CPU path are totally different.

Good CPU-optimized code (with intensive SSE use) may be 4X faster than “basic” C cpu-code, so if you have a quadcore, using “basic” c code you will end-up with same level of performance of a mono-threaded application: it has absolutely no-sense for me, especially considering compute-intensive application.

So you CPU path will be totally CPU-optimized, and may even be optimized for intel SSE instead AMD SSE for maximum performance (core microarchitecture optimizations and trying to obtain 2 SSE operations per cycle.core).

Good GPU-optimized code won’t be the same as there’s no SSE (naturally), no cache, different memory access path, and they usually don’t even use the same algorithms.

So producing today CPU-oriented code, “basic” C code or SSE optimized by hand, won’t give you any insight to GPU-oriented code,

and for my 2 cents, I wont go to produce OpenCL-code just because I am not able to understand or use libpthread :-)

…Of course, unless AMD’s OpenCL SDK actually has some optimizations for specific platforms.

By the way, is this OpenCL release really just for CPUs? I thought they’re supporting both CPUs and GPUs.

You could optimize code-generation for an architecture, says Intel core2, AMD Phenom or nVidia GT200 GPU, and there’s no doubt that nVidia, Intel and AMD/ATI will do their best to have the best performance-level on any algorithm that we may throw at their CPU or GPU.

But when you develop for nVidia’s GPU, you don’t use the same algorithm than the CPU-optimized code, because CPU-optimized algorithm may be 10X slower on GPU!!!

If you throw CPU-optimized algorithm, whatever your best efforts, optimizing by hand, you will end-up running slower than generic C CUDA-code with a GPU-optimized algorithm.

You may even use different algorithm for nVidia’s GPU and ATI’s GPU, because of their architectural differences!

For me, that’s a problem on OpenCL: C source-code may be generic, but algorithms must be fine-tuned for each platform (CPU, ATI GPU or nVidia GPU), with a hint of SSE inline code (or MACROS) for CPUs. There’s nothing generic to exploit the whole potential of a modern computing platform!