Deeplearning4j and cuda

a.elkayesh · February 5, 2018, 10:59am

I am trying to get dl4j running on a computer that have Nvidia GPUs.
dl4j requires cuda toolkit v8. Our ops guys installed different versions of cuda under /usr/local/cuda-xx/
where xx is 5.5, 7.5,8…etc up till v 9.

Now when I try to run a java application that uses dl4j lib to train a model, it prints that it uses JCublasBackend then freezes for more than 24 hours then throw an exception that no cuda devices found.

I talked to the guys at dl4j, they said that since I have multiple cuda versions on that host, this could be the problem because dl4j required exactly v8 and if v 5.5 is already loaded, v8 won’t be picked up even when I add it to the system PATH.

My question is, how can I check which version of cuda is loaded?? Also, is there a way to let my application load v8?

Thank you

Robert_Crovella · February 5, 2018, 7:02pm

set your LD_LIBRARY_PATH to point to the v8 CUDA libraries you want to use

instructions are contained in the CUDA linux install guide

a.elkayesh · February 6, 2018, 3:05pm

Thank you for your reply.

I did it:

setenv CUDA=/usr/local/cuda-8.0/bin
setenv PATH=$PATH:${CUDA}
setenv LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64

when I do cat /proc/driver/nvidia/version , it gives the following:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

Which is the same info before setting LD_LIBRARY_PATH to cuda 8, any help?

Robert_Crovella · February 6, 2018, 3:24pm

You’re displaying the driver version. That driver (384.81) will work with any CUDA version of 9.0 or less. Setting PATH and LD_LIBRARY_PATH does not change your installed driver version, and for this situation there is no need to do so. GPU drivers are backward compatible with previous CUDA versions. 384.81 is compatible with CUDA 9, CUDA 8, CUDA 7.5, CUDA 7, etc.

If you want to see the cuda version currently selected for compilation, try:

nvcc --version

a.elkayesh · February 6, 2018, 5:03pm

I print nvcc --version before running my example dl4j:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

When I run my java example I get this error:
CUDA error at /home/jenkins/workspace/dl4j/all-multiplatform@2_linux-x86_64/stream1/libnd4j/blas/cuda/NativeOps.cu:4729 code=46(cudaErrorDevicesUnavailable) “result”
Exception in thread “main” java.lang.ExceptionInInitializerError
at org.deeplearning4j.nn.conf.NeuralNetConfiguration$Builder.seed(NeuralNetConfiguration.java:777)
at org.deeplearning4j.examples.feedforward.mnist.MLPMnistSingleLayerExample.main(MLPMnistSingleLayerExample.java:64)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:6212)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:6087)
at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:201)
… 2 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:6188)
… 4 more
Caused by: java.lang.ExceptionInInitializerError
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.(JCublasNDArrayFactory.java:86)
… 9 more
Caused by: java.lang.NullPointerException
at org.nd4j.jita.allocator.pointers.CudaPointer.(CudaPointer.java:22)
at org.nd4j.jita.allocator.pointers.cuda.cudaStream_t.(cudaStream_t.java:17)
at org.nd4j.linalg.jcublas.context.CudaContext.initOldStream(CudaContext.java:161)
at org.nd4j.jita.allocator.context.impl.BasicContextPool.createNewStream(BasicContextPool.java:175)
at org.nd4j.jita.allocator.context.impl.LimitedContextPool.fillPoolWithResources(LimitedContextPool.java:93)
at org.nd4j.jita.allocator.context.impl.LimitedContextPool.(LimitedContextPool.java:56)
at org.nd4j.jita.handler.impl.CudaZeroHandler.(CudaZeroHandler.java:131)
at org.nd4j.jita.allocator.impl.AtomicAllocator.(AtomicAllocator.java:128)
at org.nd4j.jita.allocator.impl.AtomicAllocator.(AtomicAllocator.java:74)

Robert_Crovella · February 6, 2018, 5:09pm

This looks like a machine configuration problem:

code=46(cudaErrorDevicesUnavailable)

I would make sure that ordinary CUDA codes can run in your setup. Also make sure your GPUs are in Default compute mode (use nvidia-smi)

Also, for CUDA 8, I would not recommend CUDA 8.0.44 but instead CUDA 8.0.61

a.elkayesh · February 8, 2018, 8:26am

Thank you again.

I took dl4j out of the way. Now, I have only Jcuda, Java and cuda software.

I tried the example here: jcuda.org - Tutorial

It failed because it was looking for gclib v 2.14 and the one installed was 2.12.

I installed 2.14 in a different location and added that location to LD_LIBRARY_PATH. Now the error disappears but the application freezes. no output no error no exit, just running forever.

Robert_Crovella · February 8, 2018, 8:42am

After installing CUDA, and before using it for any other activity, I recommend performing the verification steps indicated in the appropriate (i.e. linux, in this case) install guide. For that matter, I would suggest making sure you followed all instructions in the linux install guide.

a.elkayesh · February 8, 2018, 9:41am

I can’t find the samples folder nor the sample installation script in my cuda-8.0 folder. Is there a way to verify the installation without having to install samples?

Anyway to get deviceQuery installed without samples?

If no, anyway to install samples atop existing installation through wget? I can’t run sudo commands.

a.elkayesh · February 8, 2018, 2:59pm

I found devicequery installed under/usr/local/cuda-8.0/extras/demo_suite/

When I run it, I get the following message:
"
[a@g demo_suite]$ ./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

"

Then it hangs. Nothing more is printed and it doesn’t exit.
We have different cuda versions installed. I tried also setting PATH and LD_LIBRARY_PATH
export PATH=/usr/local/cuda-8.0:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64

any ideas?

Robert_Crovella · February 8, 2018, 3:10pm

It’s evident that your cluster install is not working correctly.

You should probably address this issue with the cluster admin, who have sudo access.

a.elkayesh · February 8, 2018, 4:11pm

Just before talking to admins, could it be that the gpu is extremely busy that it does not react to deviceQuery?

Robert_Crovella · February 8, 2018, 6:20pm

no, not in my experience

a.elkayesh · February 9, 2018, 5:02pm

So, I found a way to login to the computer with GPUs. I ran deviceQuery and it outputs the following :
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: “Tesla K10.G1.8GB”
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3527 MBytes (3698524160 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 38 / 0
Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

Device 1: “Tesla K10.G1.8GB”
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3527 MBytes (3698524160 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 39 / 0
Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

Device 2: “Tesla K10.G1.8GB”
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3527 MBytes (3698524160 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 42 / 0
Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

Device 3: “Tesla K10.G1.8GB”
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3527 MBytes (3698524160 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 43 / 0
Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

Peer access from Tesla K10.G1.8GB (GPU0) → Tesla K10.G1.8GB (GPU1) : Yes
Peer access from Tesla K10.G1.8GB (GPU0) → Tesla K10.G1.8GB (GPU2) : Yes
Peer access from Tesla K10.G1.8GB (GPU0) → Tesla K10.G1.8GB (GPU3) : Yes
Peer access from Tesla K10.G1.8GB (GPU1) → Tesla K10.G1.8GB (GPU0) : Yes
Peer access from Tesla K10.G1.8GB (GPU1) → Tesla K10.G1.8GB (GPU2) : Yes
Peer access from Tesla K10.G1.8GB (GPU1) → Tesla K10.G1.8GB (GPU3) : Yes
Peer access from Tesla K10.G1.8GB (GPU2) → Tesla K10.G1.8GB (GPU0) : Yes
Peer access from Tesla K10.G1.8GB (GPU2) → Tesla K10.G1.8GB (GPU1) : Yes
Peer access from Tesla K10.G1.8GB (GPU2) → Tesla K10.G1.8GB (GPU3) : Yes
Peer access from Tesla K10.G1.8GB (GPU3) → Tesla K10.G1.8GB (GPU0) : Yes
Peer access from Tesla K10.G1.8GB (GPU3) → Tesla K10.G1.8GB (GPU1) : Yes
Peer access from Tesla K10.G1.8GB (GPU3) → Tesla K10.G1.8GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla K10.G1.8GB, Device1 = Tesla K10.G1.8GB, Device2 = Tesla K10.G1.8GB, Device3 = Tesla K10.G1.8GB
Result = PASS

Previously deviceQuery was freezing when I would submit a job from one computer outside the grid to the one above which has the GPUs.

Now the problem is that with basic example from JCuda, I still can’t get it to run. the program failed because glibc v 2.14 was not there. I installed it alongside existing one v 2.12 and added that to the LD_LIBRARY_PATH. Now the program does not crash but it’s freezing again.