Drop-in Acceleration of GNU Octave

jwitsoe · June 5, 2014, 4:20pm

Originally published at: https://developer.nvidia.com/blog/drop-in-acceleration-gnu-octave/

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API: a CUDA context first needs to be created a cuBLAS handle needs…

anon7688106 · June 5, 2014, 5:32pm

Typo: "terraflops" = "teraflops"

anon42998046 · June 6, 2014, 10:55am

I didn't understand your first paragraph. What's the point of drop-in library if refactoring existing code and recompiling is required? Maybe I misunderstood what cuBLAS drop-in feature is.

anon18555434 · June 7, 2014, 5:40pm

cuBLAS is NVIDIA's BLAS implementation for NVIDIA GPUs. It's not the drop-in library and if its interface is closed to BLAS, it's not exactly the same. nvBLAS, on the other hand, is the drop-in library. You can see it as a wrapper of cuBLAS which can be used in place of any other BLAS.

anon35711774 · October 23, 2016, 12:34pm

Hi, Sorry to come late to this discussion, hope to get an answer...
I was looking at the test on matrix-matrix multiplication.

As far as I see you get a speed-up of 306 only using OpenBLAS (765 / 2.5).

I'am also using OpenBLAS, and I have a similar architecture (Intel Xeon CPU E5-2620 v4 @ 2.10GHz, 16 cores). However I get a speed-up of "only" 90, measured exactly in the same way you did, and using your octave script above.

I was wondering if this may be due to the slightly different CPU (I have used E5-2620, you have used E5-2690) or to particular switches that must be used when compiling OpenBLAS in order to get an fully optimized library.
Did you compile OpenBLAS with the "standard" command line or with particular options ?
Any other suggestion ?

Any help would be more than appreciated
Kind regards

Marco

anon85101101 · October 25, 2016, 7:21pm

The library was built using gfortran from GCC 4.7.3 following the installation guide from OpenBLAS webpage. We also verified the Makefile log mentions the library was tuned for sandybridge cores.

I can see one discrepancy in the post – wrong link to the Intel website. It should be E5-2690 v2 (25M Cache, 3.00 GHz). 2x Ivy Bridge sockets totaling 20 cores. We will fix this. Though it does not completely explain the difference you are seeing. Could you please check absolute FLOPs you are measuring? More recent Octave might just do a better job with SGEMMs with built-in BLAS

anon35711774 · October 26, 2016, 12:53pm

Thanks for the answer. In the meanwhile I took a look with the system administrator working in my lab, and I actually discovered that OpenBLAS was not compiled at all, it was installed directly as a Debian package, and that's the reason of the difference in the benchmark. We re-compiled OpenBLAS on our architecture and we get impressive performances, as you have: 2.65 Gflops with standard BLAS, 238 with OpenBLAS as Debian package, 874 Gflops with re-compiled OpenBLAS, single precision. With double precision speed-up is smaller but still impressive (112 Gflops vs. 355 Gflops). I'm using 16 cores every time, not 20, that's why I think this is impressive.

anon21344523 · February 3, 2017, 2:59am

So, is there a step-by-step installation instruction for dummies? you
know, not every scientist/engineer in this world is a programmer and
system administrator. Is there a way to use it with Octave for Win64 ?

anon19982686 · March 29, 2017, 3:57am

Nice. I'm getting 25 GFLOPs on my default install on a i7-5930K. 395 GFLOPs using 11 threads on OpenBlas, and at least 2.7 TFLOPs using two GTX-1070s.

("at least" because I'm in the middle of training a wide-resnet on both GPUs at the time test was run, so I suspect my GPUs performance would otherwise be even higher than 2.7 TF.)

I'll take a 100x speedup any day of the week.

anon53073872 · August 18, 2017, 10:56am

If one has a function which consumes 80% of calculation time, the gpu acceleration ( to simplify the function calculation time is zeroed ) would be only 5 times. If one is processing big data mainly in IO manner it would be better accelaraion via buying an external RAID controller to SSD's in RAID 0 manner.

anon53073872 · August 18, 2017, 11:01am

Cool results - consider shared memory technique in LINUX: https://stackoverflow.com/q...
and RAMDISK. If you are looking for max performance you should keep data in GPU and use CPU RAM as ordinary programmable disk. Did anyone seen sth. comparable to Octave enviroment for CUDA C ( interactive console mode on gpu-on-air data )?

Topic		Replies	Views
Problems with CUBLAS, dlopen and GNU Octave GPU-Accelerated Libraries	2	1278	December 15, 2014
New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs Technical Blog	0	520	February 1, 2023
The best input layout settings in CuBlas GPU-Accelerated Libraries cublas	4	182	August 27, 2024
Performance Issues with CUDA and Python/R CUDA Programming and Performance	16	6265	August 26, 2016
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2190	June 8, 2021
Will cuBLAS ever be completed? CUDA Programming and Performance	15	29113	August 31, 2009
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	712	October 12, 2021
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4979	February 10, 2011
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	936	August 23, 2018
my speedy SGEMM CUDA Programming and Performance	91	275904	May 29, 2013

Drop-in Acceleration of GNU Octave

Related topics