Hi I’m hoping someone with better knowledge of the Jetson TK1, CUDA Toolkit & OpenBLAS could hopefully help please.
My goal is get Octave running on a Jetson TK1 - of course with cuBLAS GPU acceleration.
I’m referencing a devblog covering Octave / NVIDIA at:
http://devblogs.nvidia.com/parallelforall/drop-in-acceleration-gnu-octave/
So far I’ve managed to get Octave compiled and running and CUDA Toolkit & OpenBLAS installed also.
Versions are:
Octave: 3.8.0 (built from source)
CUDA Toolkit: 6.5 (installed via apt-get)
OpenBLAS r0.2.14 (built from source)
The devblog page contains a simple Octave script called sgemm.m for exercising BLAS routines.
When I execute ‘octave ./sgemm.m’ on the TK1, I get this output:
1076.7
1.0212
According to the article, this represents around 1 GFLOPS performance (i.e. no GPU acceleration is occurring).
When I attempt to run using OpenBLAS, I get the following output:
250.89
4.3824
I’m using LD_PRELOAD on the command line to run with OpenBLAS:
OMP_NUM_THREADS=20 LD_PRELOAD=/home/ubuntu/OpenBLAS/lib/libopenblas_armv7p-r0.2.14.so octave ./sgemm.m
I believe from the article that using OpenBLAS should result in Octave utilising cuBLAS, and therefore resulting in a far higher performance gain than the roughly 4x improvement I’m seeing.
It seems to me that a 4x performance improvement could possibly be explained as a result of all 4 ARM cores being utilised (I’m imgagining that Octave BLAS routines might only be using a single core, but of course).
The devblog article also suggests the use of nvBLAS. It seems that nvBLAS is not available under CUDA Toolkit on the TK1 - I imagine because nvBLAS requires a 64 bit platform, and the TK1 is 32 bit.
(Hence the attempt to use OpenBLAS - which the devblog article seems to indicate should be able access cuBLAS under the CUDA Toolkit).
Any help would be much appreciated!
Thanks a lot - Evan