CUDA error when running matrixMulCUBLAS sample - Ubuntu 16.04

Hello I am trying to install tensorflow but I am getting an error when I run a basic cuda example.

./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting…
GPU Device 0: “GeForce GTX 1080 Ti” with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
CUDA error at matrixMulCUBLAS.cpp:277 code=1(CUBLAS_STATUS_NOT_INITIALIZED) “cublasCreate(&handle)”

I have tried for two days to fix this as it also prevents me from running a basic tensorflow demo as well which gives the following error also.

2017-05-03 12:26:24.945147: E tensorflow/stream_executor/cuda/cuda_blas.cc:365] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

Thanks.

Your CUDA install may be broken.

Follow the instructions in the linux install guide including the verification steps.

I followed this guide CUDA Installation Guide for Linux and both tests passed for me.

./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 1080 Ti”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11171 MBytes (11713708032 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080 Ti
Result = PASS

./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 1080 Ti
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11321.7

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12880.9

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 345021.6

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

$PATH
bash: /usr/local/cuda-8.0/bin:/home/fernando/bin:/home/fernando/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin:

$CUDA_HOME
bash: /usr/local/cuda: Is a directory

$LD_LIBRARY_PATH
bash: /usr/local/cuda/lib64: Is a directory

All the cuda examples work EXCEPT the ones that use CUBLAS for some reason.

is /usr/local/cuda symlinked to /usr/local/cuda-8.0 ?

alternatively, what is the result of:

ls /usr/local/cuda/lib64

?

yes its symlinked, I actually just changed my path to use the symlink instead.

$PATH
bash: /usr/local/cuda/bin:/home/fernando/bin:/home/fernando/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: No such file or directory

here is the command you wanted.

ls /usr/local/cuda/lib64

libcublas_device.a
libcusparse.so
libnppist.so.8.0
libcublas.so
libcusparse.so.8.0
libnppist.so.8.0.61
libcublas.so.8.0
libcusparse.so.8.0.61
libnppisu.so
libcublas.so.8.0.61
libcusparse_static.a
libnppisu.so.8.0
libcublas_static.a
libnppc.so
libnppisu.so.8.0.61
libcudadevrt.a
libnppc.so.8.0
libnppitc.so
libcudart.so
libnppc.so.8.0.61
libnppitc.so.8.0
libcudart.so.8.0
libnppc_static.a
libnppitc.so.8.0.61
libcudart.so.8.0.61
libnppial.so
libnpps.so
libcudart_static.a
libnppial.so.8.0
libnpps.so.8.0
libcudnn.so
libnppial.so.8.0.61
libnpps.so.8.0.61
libcudnn.so.5
libnppicc.so
libnpps_static.a
libcudnn.so.5.1.10
libnppicc.so.8.0
libnvblas.so
libcudnn_static.a
libnppicc.so.8.0.61
libnvblas.so.8.0
libcufft.so
libnppicom.so
libnvblas.so.8.0.61
libcufft.so.8.0
libnppicom.so.8.0
libnvgraph.so
libcufft.so.8.0.61
libnppicom.so.8.0.61
libnvgraph.so.8.0
libcufft_static.a
libnppidei.so
libnvgraph.so.8.0.61
libcufftw.so
libnppidei.so.8.0
libnvgraph_static.a
libcufftw.so.8.0
libnppidei.so.8.0.61
libnvrtc-builtins.so
libcufftw.so.8.0.61
libnppif.so
libnvrtc-builtins.so.8.0
libcufftw_static.a
libnppif.so.8.0
libnvrtc-builtins.so.8.0.61
libcuinj64.so
libnppif.so.8.0.61
libnvrtc.so
libcuinj64.so.8.0
libnppig.so
libnvrtc.so.8.0
libcuinj64.so.8.0.61
libnppig.so.8.0
libnvrtc.so.8.0.61
libculibos.a
libnppig.so.8.0.61
libnvToolsExt.so
libcurand.so
libnppim.so
libnvToolsExt.so.1
libcurand.so.8.0
libnppim.so.8.0
libnvToolsExt.so.1.0.0
libcurand.so.8.0.61
libnppim.so.8.0.61
libOpenCL.so
libcurand_static.a
libnppi.so
libOpenCL.so.1
libcusolver.so
libnppi.so.8.0
libOpenCL.so.1.0
libcusolver.so.8.0
libnppi.so.8.0.61
libOpenCL.so.1.0.0
libcusolver.so.8.0.61
libnppi_static.a
stubs
libcusolver_static.a
libnppist.so

well, I’m running out of ideas.

what GPU driver do you have installed? Stated another way, what is the output of

nvidia-smi

on your machine?

nvidia-smi
Wed May 3 13:39:46 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 381.09 Driver Version: 381.09 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 0000:02:00.0 On | N/A |
| 23% 33C P5 15W / 250W | 367MiB / 11171MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1073 G /usr/lib/xorg/Xorg 251MiB |
| 0 1769 G compiz 110MiB |
| 0 2012 G /usr/lib/firefox/firefox 2MiB |
±----------------------------------------------------------------------------+

should be 381 as I should have for the 1080 ti.

I am running into the same problem here. I am also using the GTX 1080 TI. The driver version here is slightly different from yours (Driver Version: 381.22).

Did you manage to solve this issue??

I managed to solve the problem.

I realized that there was an error with my CUDA installation, specifically with the cuBLAS library. You can check if yours has the same problem by running the sample program simpleCUBLAS:

cd /usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS # check if your samples are in the same directory
make
./simpleCUBLAS

I was getting an error when I tried to run it, so I reinstalled CUDA 8.0 and it solved the issue.

Update:

I ran into the same issue again after a while, and this time I solved it by simply erasing the cache on the directory ~/.nv.

sudo rm -rf .nv/

I hope it helps you.

How can your pinned memory host<->device transfer so fast? Do you use DDR4 memory as host memory? Mine is only about 6GB/s

6GB/s is consistent with a PCIE Gen2 link. 12GB/s is consistent with a PCIE Gen3 link. So the difference is due to the type of systems you are comparing here.

Did you try to run the cuBLAS examples as super user? And did you try to remove the directory ~/.nv before running the cuBLAS example?

Yes, but I see a pci-e x16 gen3 slot is in use. My workstation is HPz820 with E52687W cpus, it is unreasonable that the card is working on the Gen2 mode, so confusing. Is there anything that I should do to overcome the issue?

I suppose this is the driver problem. In windows 8,1 when confronting the same phenomenon I can try to modify register table to force my card run under pci-e 3.0, by setting RMPcieLinkSpeed into 4 and restarting the system. However, it seems I have no way to do this in linux.

What’s your system and driver? Thanks in advance.

Hi taironemagalhaes, your solution works perfectly on my computer. Do you have any idea why does the solution work? Thank you very much.

I have the same error. But it goes away when I run using sudo

eg:
python filname.py - has error
2017-12-30 21:17:38.603832: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

sudo python filename.py - no error!!


/usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS$ ./simpleCUBLAS
GPU Device 0: “GeForce GTX 1050 Ti” with compute capability 6.1

simpleCUBLAS test running…
!!! CUBLAS initialization error

/usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS$ sudo ./simpleCUBLAS
GPU Device 0: “GeForce GTX 1050 Ti” with compute capability 6.1

simpleCUBLAS test running…
simpleCUBLAS test passed.

why is this?

Just updating…
this thread helped

https://stackoverflow.com/questions/42488615/failed-to-create-cublas-handle-tensorflow-interaction-with-opencv

removed this directory!!

sudo rm -rf ~/.nv/

I am tying to compile caffe:

https://github.com/CMU-Perceptual-Computing-Lab/caffe_train

I run successfully :

make all
make test

But when I run

make runtest

I get the following error, please note line 9-10, 90-:

Cuda number of devices: 1
Setting to use device 1
Current device id: 0
Current device name: GeForce GTX 950M
Note: Randomizing tests' orders with a seed of 48454 .
[==========] Running 2081 tests from 277 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from PowerLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[b][ RUN      ] PowerLayerTest/0.TestPowerTwo
E0504 19:57:11.898780 14435 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.[/b]
[       OK ] PowerLayerTest/0.TestPowerTwo (550 ms)
[ RUN      ] PowerLayerTest/0.TestPowerOne
[       OK ] PowerLayerTest/0.TestPowerOne (0 ms)
[ RUN      ] PowerLayerTest/0.TestPowerOneGradient
[       OK ] PowerLayerTest/0.TestPowerOneGradient (1 ms)
[ RUN      ] PowerLayerTest/0.TestPower
[       OK ] PowerLayerTest/0.TestPower (0 ms)
[ RUN      ] PowerLayerTest/0.TestPowerGradient
[       OK ] PowerLayerTest/0.TestPowerGradient (3 ms)
[ RUN      ] PowerLayerTest/0.TestPowerGradientShiftZero
[       OK ] PowerLayerTest/0.TestPowerGradientShiftZero (5 ms)
[ RUN      ] PowerLayerTest/0.TestPowerTwoGradient
[       OK ] PowerLayerTest/0.TestPowerTwoGradient (1 ms)
[ RUN      ] PowerLayerTest/0.TestPowerTwoScaleHalfGradient
[       OK ] PowerLayerTest/0.TestPowerTwoScaleHalfGradient (2 ms)
[ RUN      ] PowerLayerTest/0.TestPowerZero
[       OK ] PowerLayerTest/0.TestPowerZero (0 ms)
[ RUN      ] PowerLayerTest/0.TestPowerZeroGradient
[       OK ] PowerLayerTest/0.TestPowerZeroGradient (1 ms)
[----------] 10 tests from PowerLayerTest/0 (563 ms total)

[----------] 3 tests from SplitLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] SplitLayerTest/1.Test
[       OK ] SplitLayerTest/1.Test (0 ms)
[ RUN      ] SplitLayerTest/1.TestGradient
[       OK ] SplitLayerTest/1.TestGradient (3 ms)
[ RUN      ] SplitLayerTest/1.TestSetup
[       OK ] SplitLayerTest/1.TestSetup (0 ms)
[----------] 3 tests from SplitLayerTest/1 (3 ms total)

[----------] 2 tests from EuclideanLossLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] EuclideanLossLayerTest/1.TestGradient
[       OK ] EuclideanLossLayerTest/1.TestGradient (1 ms)
[ RUN      ] EuclideanLossLayerTest/1.TestForward
[       OK ] EuclideanLossLayerTest/1.TestForward (0 ms)
[----------] 2 tests from EuclideanLossLayerTest/1 (1 ms total)

[----------] 8 tests from SliceLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] SliceLayerTest/3.TestSetupChannels
[       OK ] SliceLayerTest/3.TestSetupChannels (9 ms)
[ RUN      ] SliceLayerTest/3.TestSliceAcrossNum
[       OK ] SliceLayerTest/3.TestSliceAcrossNum (1 ms)
[ RUN      ] SliceLayerTest/3.TestTrivialSlice
[       OK ] SliceLayerTest/3.TestTrivialSlice (3 ms)
[ RUN      ] SliceLayerTest/3.TestSetupNum
[       OK ] SliceLayerTest/3.TestSetupNum (2 ms)
[ RUN      ] SliceLayerTest/3.TestGradientAcrossNum
[       OK ] SliceLayerTest/3.TestGradientAcrossNum (411 ms)
[ RUN      ] SliceLayerTest/3.TestGradientAcrossChannels
[       OK ] SliceLayerTest/3.TestGradientAcrossChannels (414 ms)
[ RUN      ] SliceLayerTest/3.TestGradientTrivial
[       OK ] SliceLayerTest/3.TestGradientTrivial (18 ms)
[ RUN      ] SliceLayerTest/3.TestSliceAcrossChannels
[       OK ] SliceLayerTest/3.TestSliceAcrossChannels (2 ms)
[----------] 8 tests from SliceLayerTest/3 (860 ms total)

[----------] 8 tests from LRNLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] LRNLayerTest/0.TestForwardAcrossChannelsLargeRegion
[       OK ] LRNLayerTest/0.TestForwardAcrossChannelsLargeRegion (0 ms)
[ RUN      ] LRNLayerTest/0.TestSetupWithinChannel
[       OK ] LRNLayerTest/0.TestSetupWithinChannel (0 ms)
[ RUN      ] LRNLayerTest/0.TestSetupAcrossChannels
[       OK ] LRNLayerTest/0.TestSetupAcrossChannels (0 ms)
[ RUN      ] LRNLayerTest/0.TestGradientAcrossChannelsLargeRegion
[       OK ] LRNLayerTest/0.TestGradientAcrossChannelsLargeRegion (533 ms)
[ RUN      ] LRNLayerTest/0.TestForwardWithinChannel
[       OK ] LRNLayerTest/0.TestForwardWithinChannel (0 ms)
[ RUN      ] LRNLayerTest/0.TestForwardAcrossChannels
[       OK ] LRNLayerTest/0.TestForwardAcrossChannels (0 ms)
[ RUN      ] LRNLayerTest/0.TestGradientAcrossChannels
[       OK ] LRNLayerTest/0.TestGradientAcrossChannels (483 ms)
[ RUN      ] LRNLayerTest/0.TestGradientWithinChannel
[       OK ] LRNLayerTest/0.TestGradientWithinChannel (438 ms)
[----------] 8 tests from LRNLayerTest/0 (1454 ms total)

[----------] 50 tests from NeuronLayerTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] NeuronLayerTest/2.TestLogGradient
[       OK ] NeuronLayerTest/2.TestLogGradient (15 ms)
[ RUN      ] NeuronLayerTest/2.TestLogLayerBase2Shift1Scale3
F0504 19:57:14.253590 14435 math_functions.cu:85] Check failed: status == CUBLAS_STATUS_SUCCESS (1 vs. 0)  CUBLAS_STATUS_NOT_INITIALIZED
*** Check failure stack trace: ***
    @     0x2b111fffadaa  (unknown)
    @     0x2b111ffface4  (unknown)
    @     0x2b111fffa6e6  (unknown)
    @     0x2b111fffd687  (unknown)
    @     0x2b1122183d17  caffe::caffe_gpu_scal<>()
    @     0x2b1122176279  caffe::LogLayer<>::Forward_gpu()
    @           0x477e46  caffe::Layer<>::Forward()
    @           0x548d90  caffe::NeuronLayerTest<>::TestLogForward()
    @           0x8fca63  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0x8f3747  testing::Test::Run()
    @           0x8f37ee  testing::TestInfo::Run()
    @           0x8f38f5  testing::TestCase::Run()
    @           0x8f6c38  testing::internal::UnitTestImpl::RunAllTests()
    @           0x8f6ec7  testing::UnitTest::Run()
    @           0x46cbbf  main
    @     0x2b1122ff1f45  (unknown)
    @           0x474819  (unknown)
    @              (nil)  (unknown)
make: *** [runtest] Aborted (core dumped)

A thread with my problem points to this this discussion to solve the problem: