Jetson TK1 / Octave / CUDA Toolkit / OpenBLAS

Hi I’m hoping someone with better knowledge of the Jetson TK1, CUDA Toolkit & OpenBLAS could hopefully help please.

My goal is get Octave running on a Jetson TK1 - of course with cuBLAS GPU acceleration.

I’m referencing a devblog covering Octave / NVIDIA at:

http://devblogs.nvidia.com/parallelforall/drop-in-acceleration-gnu-octave/

So far I’ve managed to get Octave compiled and running and CUDA Toolkit & OpenBLAS installed also.

Versions are:

Octave: 3.8.0 (built from source)
CUDA Toolkit: 6.5 (installed via apt-get)
OpenBLAS r0.2.14 (built from source)

The devblog page contains a simple Octave script called sgemm.m for exercising BLAS routines.

When I execute ‘octave ./sgemm.m’ on the TK1, I get this output:

1076.7
1.0212

According to the article, this represents around 1 GFLOPS performance (i.e. no GPU acceleration is occurring).

When I attempt to run using OpenBLAS, I get the following output:

250.89
4.3824

I’m using LD_PRELOAD on the command line to run with OpenBLAS:

OMP_NUM_THREADS=20 LD_PRELOAD=/home/ubuntu/OpenBLAS/lib/libopenblas_armv7p-r0.2.14.so octave ./sgemm.m

I believe from the article that using OpenBLAS should result in Octave utilising cuBLAS, and therefore resulting in a far higher performance gain than the roughly 4x improvement I’m seeing.

It seems to me that a 4x performance improvement could possibly be explained as a result of all 4 ARM cores being utilised (I’m imgagining that Octave BLAS routines might only be using a single core, but of course).

The devblog article also suggests the use of nvBLAS. It seems that nvBLAS is not available under CUDA Toolkit on the TK1 - I imagine because nvBLAS requires a 64 bit platform, and the TK1 is 32 bit.

(Hence the attempt to use OpenBLAS - which the devblog article seems to indicate should be able access cuBLAS under the CUDA Toolkit).

Any help would be much appreciated!

Thanks a lot - Evan

First, what version of L4T are you running?
Second, how did you install CUDA toolkit? Did you use JetPack for installation, or did you download the .deb file from the NVIDIA site?

Hi,
Apologies for the delay - I can see ‘R19’ is contained in the first line of /etc/nv_tegra_release.
I installed the CUDA toolkit from cuda-repo-l4t-r21.2-6-5-prod_6.5-34_armhf.deb, downloaded from the NVIDIA site.
It seems from the .deb file name that the version of L4T (installed via the CUDA Toolkit) should be at R21?
I don’t recall seeing anything else needed, e.g. perhaps to update export definitions?
Thanks for your help!

Hi just a further update - I realised that I would need to re-flash the TK1 to get L4T up to r21 in order to use CUDA 6.5, which I don’t want to do unless necessary as it took a fair amount of work to get octave built successfully.
Instead I’ve removed CUDA Toolkit 6.5 & installed 6.0.
Following this I was able to build and run the CUDA samples ok.
Unfortunately I’m still not seeing any change in performance yet when I run octave with OpenBLAS (as described above)
Thanks

Looking at Drop-in Acceleration of GNU Octave | NVIDIA Technical Blog, I’d like to try nvblas. Is it available for the Jetson TK1 with toolkit 6.5, I see a /usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/include/nvblas.h, but no libnvblas.so?

Thanks

The NVBLAS Library is part of the CUDA Toolkit 6.5, however, it is only available on 64-bit operating systems.