BLAS on openCL?

sorry if this was the wrong sub-forum to post on. ive heard that there is no BLAS library as such just yet for openCL but does Nvidia have any plans to bring one out. also if they were to bring a library out would it be compatible with ATI since openCL works with ati as well or would it be system specific (Nvidia only)

also just out of curiosity (i asked this on a thread a long time ago) but does Nvidia actually intend on bringing out LAPACK or is it a dead cause?


Presumably, a library written in OpenCL should compile and link for any OpenCL arch. I don’t know enough about OpenCL to know if the binaries will also work cross-hardware. The OpenCL library calls themselves are the same, but does the binary format of the kernels change? I don’t know. At this point one really cannot test it, as NVIDIA is the only one with an available OpenCL toolchain :)

You missed the news then. NVIDIA partenered with another company to provide LAPACK:

Actually, ATI has made a beta of their OpenCL implementation available, but it’s currently CPU only:…es/default.aspx

It seems like you might also be able to get access to Apple’s OpenCL implementation:

It’s a funny thing with OpenCL libs. I don’t know who would be “responsible” for supplying them.

If NVIDIA released their BLAS for OpenCL, it would (should ;)) compile and work for AMD cards and CPUs (assuming they didn’t use NVIDIA-specific extensions) but they sure as hell won’t be optimized for those architectures. For example, AMD cards love vector intrinsics while CUDA works with scalar processors (and vector instructions get serialized). AMD’s local memory works much differently (in 48xx hardware, they implement random access local memory as a block of global(!) memory because their shared registers are read-all, write-owner only thus not entirely up to spec), optimal workgroup sizes and other heuristics (occupancy, register usage) are different, as is the SIMD group size (“warp” or “wavefront”) and there’s possibly a whole lot of other details. And then there’s a CPU, there’s a CELL, DSPs…

BLAS is a library that usually gets optimized furiously and it’s harder to do when the hardware comes from many vendors. I don’t see how any single vendor could write a great portable BLAS library so either each will supply their own (some common interface maybe?) or there will be a third party dedicated to making adaptive libraries that switch implementation depending on the platform.

I somehow think OpenCL will die soon.

I think you’re being a bit naive then.

We had some demos at PPAM ( running the exact same code on a (then nonexistent) AMD GPU and an Intel i7. Sadly, we didn’t incorporate NVIDIA’s toolchain…

The thing about OpenCL is that it is explicitly designed low-level and middleware vendors/developers will provide infrastructure. If you write your own BLAS in CL, then you will get good scaling across CPU cores or from CPU to GPU with out-of-the-box code. You will be slightly slower than vendor-tuned codes. But so what? OpenCL provides a great leap forward. Currently, top-off-the-chart CPU performance requires hacking in ASM or SSE, and you need to change your “optimal” parameters when switching from an i7 to a Core2 to a Santa Rosa to whatever the chip is named, just because your L1 and L2 differ. Try understanding the build process of the GotoBLAS, the best performing BLAS on CPUs I am aware of. OpenCL exposes L1 essentially as shared memory in CUDA speak, for starters… On the CPU we just yell at the compiler folks. On the GPU, we hand-tune ourselves. Feels weird.

The real question here is about the future programming model, and about persuading the single thread folks to pick up parallel computing. In this respect, CUDA is OpenCL. Migrating kernels from CUDA to CL is smart copy and paste. Like any open industry standard, CL has its drawbacks and the extension model will be a mood-killer in practice (I remember incompatibilities between AMD and NVIDIA back in the OpenGL days), but so what?

That being said, I am still into CUDA and I’ll continue to be, but vendor-lock is something to consider. It was fun to do SSE for a while, and it was really cool to realize that the tiling I came up with just needs smart copy and paste to run on GPU and CPU. If you design your algorithms to be reasonably blocking or tile-based, you will be happy on any arch!