Bootstrapping with OpenCL! Advice me please

Hi All,

I have been very vocal in the forums against “OpenCL” , espescially calling it a “dead technology”…
But now, I am interested in doing some “post-mortem” work.
I would be glad if you could throw some light on the following…

I have reaasonably good experience on CUDA. Right now, we are in CUDA 2.3 on Linux/Windows setup.

We want to play around with OpenCL. I am looking for your useful peices of advice…

Few questions:

  1. Will CUDA 2.3 support OpenCL out of the box?
  2. If not, can CUDA 2.3 and OpenCL co-exist on the same machine?
  3. Is there a way to just upgrade CUDA 2.3 to support both CUDA and OpenCL?

Will be happy if you guys could share your experience as well,

Few things that I already know about OpenCL are:

  1. Functional Portability != Performance Portability (from GTC slides)
  2. It mimics the CUDA Driver API
  3. It does “not” offer True Heterogeneous computation (every vendor provides his own version and I can work only with only one kindaa device at a time)
    Not sure if something changed with OpenCL 1.1.
  4. Stuff like “array of pointers” are not possible with OpenCL. Memory is opaque.
  5. Lastly…Err…mmm… OpenCL is a dead technology…lol…

Best Regards,

I’m not sure what you mean here, the OpenCL driver/compiler dll is included in our standard driver install these days, so it should work regardless of what version of the CUDA toolkit you have installed.

OpenCL now supports an ICD (installable client driver) model like OpenGL, which allows multiple OpenCL implementations to co-exist on the same machine and applications to select between them at runtime.

Nothing could be farther from the truth. In my opinion, your comments are totally off-base and demonstrate a lack of experience in the graphics industry.

Understand that when vendors work together (in this case through Khronos), things take time. Fortunately, we’ve seen them pick up the pace in the past 8 months, with GL 3, 4, and CL 1.1. There have been many exciting features added across the board, and vendors have been quick to provide support in their drivers. Nvidia provided support for OpenGL 4 and OpenCL 1.1 the day they were announced.

Anyone who has experience with GPUs will tell you not to pigeon-hole yourself into vendor-specific APIs, you are throwing away market share by doing so. Why do you think rendering engines are so massive and expensive? Because they support many different APIs, many different subsets of functionality within those APIs, in order to run on as much hardware as possible.

It is so pitty that gpu vendors are losing their opportunities by messing things.

+1 Lev

Hi Simon,

So, its just a driver update? Will CUDA 2.3 continue to work properly if I upgrade the driver alone to the latest version? Thanks for any help,

But still… One vendor at a time… I can’t use all the available power for my application…(as was promied by OpenCL standard). Has that changed with 1.1 by any chance?

Hi gogogpu,

Thanks for voicing an alternate opinion. Nice to hear from the other side as well.

What has OpenCL got to do with Graphics?

btw, What is the point in having a standard if I have to re-write kernels for each and every architecture for performance? This is the single most irritating thing about OpenCL.

Anyway, Looks like the market is picking up…So,we will also sing da chorus, I guess…

Lev, iApx,

I dint quite get it. I have little graphics experience… Can you elaborate a bit? Thanks,

Best Regards,


It’s been multiple vendors for a while now with 1.0. Here’s what AMD’s CLinfo.exe (SDK sample) shows on my machine:

C:\ATI Stream\samples\opencl\bin\x86_64>CLInfo.exe

Number of platforms:							 2

  Plaform Profile:							   FULL_PROFILE

  Plaform Version:							   OpenCL 1.0 CUDA 3.0.1

  Plaform Name:								  NVIDIA CUDA

  Plaform Vendor:								NVIDIA Corporation

  Plaform Extensions:					cl_khr_byte_addressable_store cl_khr_ic

d cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing c

l_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

  Plaform Profile:							   FULL_PROFILE

  Plaform Version:							   OpenCL 1.0 ATI-Stream-v2.0.1

  Plaform Name:								  ATI Stream

  Plaform Vendor:								Advanced Micro Devices, Inc.

  Plaform Extensions:					cl_khr_icd

Plaform Name:								  NVIDIA CUDA

Number of devices:							   1

  Device Type:								   CL_DEVICE_TYPE_GPU

  Device ID:									 4318

  Max compute units:							 16

  Max work items dimensions:					 3

	Max work items[0]:						   512

	Max work items[1]:						   512

	Max work items[2]:						   64

  Max work group size:						   512

  Preferred vector width char:				   1

  Preferred vector width short:				  1

  Preferred vector width int:					1

  Preferred vector width long:				   1

  Preferred vector width float:				  1

  Preferred vector width double:				 0

  Max clock frequency:						   1625Mhz

  Address bits:								  32

  Max memeory allocation:						134217728

  Image support:								 Yes

  Max number of images read arguments:   128

  Max number of images write arguments:  8

  Max image 2D width:					8192

  Max image 2D height:				   8192

  Max image 3D width:					2048

  Max image 3D height:   2048

  Max image 3D depth:					2048

  Max samplers within kernel:			16

  Max size of kernel argument:				   4352

  Alignment (bits) of base address:			  256

  Minimum alignment (bytes) for any datatype:	16

  Single precision floating point capability

	Denorms:									 No

	Quiet NaNs:								  Yes

	Round to nearest even:					   Yes

	Round to zero:							   Yes

	Round to +ve and infinity:				   Yes

	IEEE754-2008 fused multiply-add:			 Yes

  Cache type:									None

  Cache line size:							   0

  Cache size:									0

  Global memory size:							519634944

  Constant buffer size:						  65536

  Max number of constant args:				   9

  Local memory type:							 Scratchpad

  Local memory size:							 16384

  Profiling timer resolution:					1000

  Device endianess:							  Little

  Available:									 Yes

  Compiler available:							Yes

  Execution capabilities:

	Execute OpenCL kernels:					  Yes

	Execute native function:					 No

  Queue properties:

	Out-of-Order:								Yes

	Profiling :								  Yes

  Platform ID:								   0000000002635740

  Name:										  GeForce 8800 GTS 512

  Vendor:										NVIDIA Corporation

  Driver version:								197.13

  Profile:									   FULL_PROFILE

  Version:									   OpenCL 1.0 CUDA

  Extensions:									cl_khr_byte_addressable_store c

l_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_s

haring cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics

Plaform Name:								  ATI Stream

Number of devices:							   1

  Device Type:								   CL_DEVICE_TYPE_CPU

  Device ID:									 4098

  Max compute units:							 2

  Max work items dimensions:					 3

	Max work items[0]:						   1024

	Max work items[1]:						   1024

	Max work items[2]:						   1024

  Max work group size:						   1024

  Preferred vector width char:				   16

  Preferred vector width short:				  8

  Preferred vector width int:					4

  Preferred vector width long:				   2

  Preferred vector width float:				  4

  Preferred vector width double:				 0

  Max clock frequency:						   2500Mhz

  Address bits:								  64

  Max memeory allocation:						1073741824

  Image support:								 No

  Max size of kernel argument:				   4096

  Alignment (bits) of base address:			  32768

  Minimum alignment (bytes) for any datatype:	128

  Single precision floating point capability

	Denorms:									 Yes

	Quiet NaNs:								  Yes

	Round to nearest even:					   Yes

	Round to zero:							   No

	Round to +ve and infinity:				   No

	IEEE754-2008 fused multiply-add:			 No

  Cache type:									Read/Write

  Cache line size:							   64

  Cache size:									65536

  Global memory size:							3221225472

  Constant buffer size:						  65536

  Max number of constant args:				   8

  Local memory type:							 Global

  Local memory size:							 32768

  Profiling timer resolution:					1

  Device endianess:							  Little

  Available:									 Yes

  Compiler available:							Yes

  Execution capabilities:

	Execute OpenCL kernels:					  Yes

	Execute native function:					 No

  Queue properties:

	Out-of-Order:								No

	Profiling :								  Yes

  Platform ID:								   000000000373F598

  Name:										  Pentium(R) Dual-Core  CPU

E5200  @ 2.50GHz

  Vendor:										GenuineIntel

  Driver version:								1.0

  Profile:									   FULL_PROFILE

  Version:									   OpenCL 1.0 ATI-Stream-v2.0.1

  Extensions:									cl_khr_icd cl_khr_global_int32_

base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomic

s cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_ext

ended_atomics cl_khr_byte_addressable_store

To elaborate slighty…

nVidia wants only to have OpenCL Kernels running on it’s GPU, no interoperability in any way, even with CPU kernels.
ATI wants only… same blabla
AMD wants to have kernels running on it’s CPU, as they offers up to 48-core systems at this time, but not on GPU.
Intel don’t want to hear about OpenCL at all, they want to push their CPU solutions and don’t want interoperability with GPU.

OpenCL promised to be able to runs Kernels written in C (at least), on CPU and GPU, to be multi-vendor, and all that at the same time. None of the actor want it to happen.

Isn’t that a problem for everyone though, even from a competitive business perspective?

You need to give developers a reason to write in OpenCL, which needs to be a combination of better performance than existing languages and a large enough user-base that there will be a large market for the product. From a hardware vendor’s perspective, you want to bias the market such that the majority of people end up running existing applications on your hardware, but only AFTER the market is big enough such that your majority of the market generates a significant volume of sales. In other words, it doesn’t matter if you own 95% of a 10,000 unit market.

All major graphics players (AMD, Intel, NVIDIA, PowerVR) and even parallel processor vendors (IBM, ARM, etc) should be pushing cross platform compatibility to drive GPU/parallel computing towards a tipping point such that the market volume becomes comparable to say graphics, or general purpose compute, or embedded compute. At that point you want to start edging out your competition by providing significantly better marketing, performance, value, support, etc.

I personally think that in order to get parallel compute out of the classification as “value added” that is used to entice consumers in a particular market (graphics) away from the competition into a market in of itself, these tools should live in the public domain and be maintained by companies that know that they have a stake in their existence (along the lines of gcc).

Thanks for your time. 2 OpenCL implementations can simultaneously reside on the same system. I dont contend that fact.
However, Can an OpenCL kernel run on both of them simultaneously?? That was my question.
Your app can link only to one of the OpenCL implementations – meaning your app can launch openCL kernels either on NVIDIA or ATI Streams.
Not on both, right? The spec does NOT spell it out at all. Although, it is the primary requirement of the spec. :(
Thats why I have an opinion that OpenCL is hurriedly developed spec… Its like an itch… Couple with performance portability issues…mm…I dont like it at all.
I raised this issue with OpenCL long time back and they did acknowledge this fact. Not sure 1.1 changed something.…f=28&t=1895

iAPX, Thanks for clarifying.

Greg, Always nice to hear from you. But thats exactly what consumer graphic cards have done. Isnt it? They have derived HPC out of off the shelf, cheap GPUs… So, I guess the market is increasing. I see lot of people gearing up to this pheonomenon… and in a couple of years, it must grow really big…
In any case, my grunt has always been that OpenCL does not provide true heterogenity (I cant run my application to launch kernels on all available HPC devices at a time)

ATI is AMD. AMD’s SDK supports both GPUs (Radeons etc) and CPUs (Athlons but also Intels - any x86 with SSE3). Plus you can mix vendors.

So it goes like this:

  • You have an NVIDIA GPU, install NVIDIA’s OpenCL ICD.

  • You have an AMD GPU, install AMD’s OpenCL ICD.

  • You have a CPU (either AMD or Intel), install AMD’s OpenCL ICD.

  • You have a CELL processor, there’s IBM’s beta ICD (haven’t checked personally).

  • You have a mix of those devices, install all ICDs.

Everything is covered.

Sarnath - you can run the same kernel on AMD’s implementation (say, on a CPU or a Radeon) and on NVIDIA’s concurrently.

In code, you query available platforms - say you get two: AMD Stream and NVIDIA CUDA. You then create contexts for each of the platforms, read in the kernel code and compile it for all the devices you want to use (you need separate Program and Kernel objects, they can’t be shared between contexts but can be created from the same source code). Then you simply launch those kernels. If you want, you can put this stuff in threads - OpenCL 1.0 is partially thread safe, 1.1 should be almost completely thread safe.

So yes, you can write a single kernel and run it both on an NVIDIA GPU and an Intel CPU at the same time, in a single application.

It was once so that you could only have one vendor/platform. This was due to bugs in ICD implementations (both on AMD and NVIDIA’s part). Once they sorted it out (AMD around SDK 2.0.1, NVIDIA around 197.13 IIRC) things work as advertised. I can’t say anything about IBM’s implementation.


This is not how it works. On Windows your app links to OpenCL.lib/.dll which isn’t really an implementation. It’s a bridge to vendor implementations that get installed with ICDs. Once an ICD installs, it registers itself in the system so that OpenCL.dll can find it at runtime (in Windows registry). It should work similarly on Linux but I don’t know the mechanisms.

Big Mac, Thanks a lot for opening my eyes… I had been out of touch for sometime… Lot of things have changed, I guess… Its not really dead as I thought… Hmmm…

Many Thanks to all of you guys for responding,

Best REgards,

I agree with you _Big_Mac, depending on the hardware present, you install one driver or the other, or eventually 2 at once or … and it’s a mess to handle for end-user and moreover developpers that may have some drivers installed at once, and no unification, promised by the standard, to be able to launch the SAME KERNEL, for a SINGLE APPLICATION, SIMULTANEOUSLY on multi-core CPU, nVidia GPU, ATI/AMD GPU, or cell!

It only exists on Mac OS X, thanks to Apple, where you could launch the kernel on your available hardware and even mix it! No end-user mess, no developer mess!

Yeah, actually I didn’t think that HPC was comparable to graphics from a market size perspective, but I may be wrong about that. It looks like the total market size for supercomputers in 2009 was around $8.6 billion. This is for the entire system, not just the GPUs, but it is much bigger than I thought . In 2009, the combined integrated and discrete GPU market was right around $10 billion, so they could actually be comparable.

So it looks like I was wrong, just goes to show that I should actually look up the numbers before making assumptions :) .

Maybe it actually does make sense to try to nudge out the competition rather than grow the market with cross platform tools…

Edit: Oops, I meant GPU not GPU/CPU. The CPU market is significantly larger.

THIS IS POSSIBLE. Geez. You can have your ONE APP run the SAME KERNEL kernel on devices from DIFFERENT VENDORS, CONCURRENTLY. With current hardware and software and without any dll juggling tricks. You might want to actually optimize the kernel differently for specific devices, which is exactly what shader programmers do or what you need to do in, say, CUDA (GTX 275 vs Fermi), but the priniciple holds.

Naturally if you have an NVIDIA video card, you need NVIDIA drivers (OpenCL compliant), same with AMD. If you have both, you need both drivers. That’s by design, I don’t see how you might want to be able to run a kernel on wildly varying devices, with completely different hardware architectures, without drivers that do the actual code generation at some point. You need something that will actually do the work, implementing the specs, and those are the drivers.

There’s no mess at all really, save from the CPU. If you have a GPU installed, you’re bound to already have drivers installed, and those drivers will have OpenCL support built in (if they’re fresh enough).

The CPU is different because you don’t usually install anything like CPU drivers after buying a new processor. That’s why, for multi-core CPUs, unless an OpenCL x86 implementation comes with the OS, you need to install an extra package (like AMD’s SDK). MacOS happens to ship with x86 OpenCL runtime (I believe), Windows and Linux do not. They also don’t come with an OpenGL implementation and yet you usually don’t have any problems running OpenGL applications, do you? That’s because somewhere along the way (with the drivers for example) the implementation finds it way onto your computer. This is exactly how OpenCL works (and actually it also comes with the drivers).

So for the time being the only issue with “end-user madness” is having a CPU implementation installed in the absence of a GPU. You can just treat it as a software requirement, sort of like a game wanting DirectX 9.0c.

What you meant to say is that is doesn’t exist on OS X, because Apple don’t support the Khronos standardized ICD mechanism at all. They have their own, vendor specific version implemented in the OS X framework which won’t work on any other OpenCL platform. So lots of developer mess if you want to port OpenCL code to/from OS X.

Actually, he meant to say that. You are correct there is no ICD, It’s better. Apple’s Intel device even has Images, unlike ATi’s CPU device. Old legacy hardware aside, they have a finite # of display adapters. If they support OpenCL on the device, it just works.

I agree in spirit, but in practice “SAME KERNEL” is factually in-correct, even if the devices are in the same platform & context. I realize this is just a thread talking about theoretical capabilities, but the correct statement would replace it with “COPIES of the SAME KERNEL”. Mentioning DIFFERENT VENDORS implies multiple contexts, so it could be argued that copies was implied, but I will elaborate why this is required even if only one context is used.

The assumption is a kernel usually has at least 1 argument set/changed each time it is enqueued. A commandQueue (read device) is specified when a kernel is enqueued, and if there are multiple command queues in the context any one of them could be used. The arguments, however, are each set in a separate API call prior, and no command queue can be specified when doing this.

This is combined with the fact that in order to be truly concurrent, some way around blocking needs to be employed. The best way is using a separate host thread for each device. With 2 or more threads setting args sometimes overlapping IS NOT going to work. Creating a copy of the kernel for each device solves this issue.

Good to see the numbers out there… 2 years back, we were searching for CUDA jobs in INdia… There were 25 resumes in the job portal… Now, it is more than 300… I can see lot of college goers and small companies (startups…) turning their attention towards CUDA… So, I just guessed so… But yeah, numbers always speak for themselves. Thanks for the link!

O… Looks like the debate is on. THis is exactly what I thought as well… But how do you link to two different vendor libraries which export the same symbols… Wont you have linking problems? I think I need to read upon the ICD mechanism…

Big Mac, Woud like to hear your view points as well.

Copies of the kernel - yes. You obviously can’t use a kernel object that’s a handle to binary code compiled for x86 and try to run it on a Tesla. You also shouldn’t use the same instance of a kernel and run it concurrently on two devices or from two threads (race). But that’s a code technicality, IMO. The important part is that you create all those kernel objects in the same way, from the same source code. All the copies represent the same function with the same arguments and same behavior.

So, conceptually:

foreach(Platform p in available platforms)

#pragma omp parallel for

	foreach(Device d in p.devices) 


		Kernel k = d->context->createKernel('vectorAdd', '');

		k->launch(a, b, c, n); //There, this executes concurrently on all devices, assuming a,b,c,n are local


Naturally there’s much more boilerplate code and I’m simplifying things.

BTW you should actually be able to code your apps with this high-level API in the near future…