Drop-in Acceleration of GNU Octave

Originally published at: https://developer.nvidia.com/blog/drop-in-acceleration-gnu-octave/

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API: a CUDA context first needs to be created a cuBLAS handle needs…

Typo: "terraflops" = "teraflops"

I didn't understand your first paragraph. What's the point of drop-in library if refactoring existing code and recompiling is required? Maybe I misunderstood what cuBLAS drop-in feature is.

cuBLAS is NVIDIA's BLAS implementation for NVIDIA GPUs. It's not the drop-in library and if its interface is closed to BLAS, it's not exactly the same. nvBLAS, on the other hand, is the drop-in library. You can see it as a wrapper of cuBLAS which can be used in place of any other BLAS.

Hi, Sorry to come late to this discussion, hope to get an answer...
I was looking at the test on matrix-matrix multiplication.

As far as I see you get a speed-up of 306 only using OpenBLAS (765 / 2.5).

I'am also using OpenBLAS, and I have a similar architecture (Intel Xeon CPU E5-2620 v4 @ 2.10GHz, 16 cores). However I get a speed-up of "only" 90, measured exactly in the same way you did, and using your octave script above.

I was wondering if this may be due to the slightly different CPU (I have used E5-2620, you have used E5-2690) or to particular switches that must be used when compiling OpenBLAS in order to get an fully optimized library.
Did you compile OpenBLAS with the "standard" command line or with particular options ?
Any other suggestion ?

Any help would be more than appreciated
Kind regards


The library was built using gfortran from GCC 4.7.3 following the installation guide from OpenBLAS webpage. We also verified the Makefile log mentions the library was tuned for sandybridge cores.

I can see one discrepancy in the post – wrong link to the Intel website. It should be E5-2690 v2 (25M Cache, 3.00 GHz). 2x Ivy Bridge sockets totaling 20 cores. We will fix this. Though it does not completely explain the difference you are seeing. Could you please check absolute FLOPs you are measuring? More recent Octave might just do a better job with SGEMMs with built-in BLAS

Thanks for the answer. In the meanwhile I took a look with the system administrator working in my lab, and I actually discovered that OpenBLAS was not compiled at all, it was installed directly as a Debian package, and that's the reason of the difference in the benchmark. We re-compiled OpenBLAS on our architecture and we get impressive performances, as you have: 2.65 Gflops with standard BLAS, 238 with OpenBLAS as Debian package, 874 Gflops with re-compiled OpenBLAS, single precision. With double precision speed-up is smaller but still impressive (112 Gflops vs. 355 Gflops). I'm using 16 cores every time, not 20, that's why I think this is impressive.

So, is there a step-by-step installation instruction for dummies? you
know, not every scientist/engineer in this world is a programmer and
system administrator. Is there a way to use it with Octave for Win64 ?

Nice. I'm getting 25 GFLOPs on my default install on a i7-5930K. 395 GFLOPs using 11 threads on OpenBlas, and at least 2.7 TFLOPs using two GTX-1070s.

("at least" because I'm in the middle of training a wide-resnet on both GPUs at the time test was run, so I suspect my GPUs performance would otherwise be even higher than 2.7 TF.)

I'll take a 100x speedup any day of the week.

If one has a function which consumes 80% of calculation time, the gpu acceleration ( to simplify the function calculation time is zeroed ) would be only 5 times. If one is processing big data mainly in IO manner it would be better accelaraion via buying an external RAID controller to SSD's in RAID 0 manner.

Cool results - consider shared memory technique in LINUX: https://stackoverflow.com/q...
and RAMDISK. If you are looking for max performance you should keep data in GPU and use CPU RAM as ordinary programmable disk. Did anyone seen sth. comparable to Octave enviroment for CUDA C ( interactive console mode on gpu-on-air data )?