Cuda cublasSsyrk error

Hi all,

I am new with cuda and I only use it, because a program that I use requires cuda. I used a fresh install of ubuntu 18.04, kernel 5.4, drivers 455 (Quadro T1000) and cuda 9.1. I installed cuda using methods in this link after trying many other methods: https://gist.github.com/DaneGardner/accd6fd330348543167719002a661bd5. Installation completed without errors.

The program I use is FSL (for medical imaging computations). I contacted the developers from this tool because I could not find the cause of the error. However, they say it is coming from cuda (error message below).

The only cause for this error I could find was that cuda 9.1 is not compatible with gcc/g+±6. I installed version 6, but version 7 seems also installed. I tried the following commands to force using version 6:

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 6
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g+±6 6
sudo update-alternatives --install /usr/bin/cc gcc /usr/bin/gcc-6 6
sudo update-alternatives --install /usr/bin/cc g++ /usr/bin/g+±6 6

sudo ln -s /usr/bin/gcc-6 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g+±6 /usr/local/cuda/bin/g++

without success…

Does anyone know what the problem could be?
Please let me know when more (system) info is required.

Many thanks,
Anouk

Error message:
EDDY::: EddyInternalGpuUtils::make_XtX_cuBLAS: cublasSsyrk error
EDDY::: cuda/EddyInternalGpuUtils.cu::: static NEWMAT::Matrix EDDY::EddyInternalGpuUtils::make_XtX_cuBLAS(const EDDY::CudaVolume4D&): Exception thrown
EDDY::: cuda/EddyInternalGpuUtils.cu::: static double EDDY::EddyInternalGpuUtils::param_update(const NEWIMAGE::volume&, std::shared_ptr<const NEWIMAGE::volume >, std::shared_ptr<const NEWIMAGE::volume >, const NEWIMAGE::volume&, EDDY::Parameters, bool, float, const EDDY::PolationPara&, unsigned int, unsigned int, unsigned int, EDDY::ECScan&, NEWMAT::ColumnVector*): Exception thrown
EDDY::: cuda/EddyGpuUtils.cu::: static double EDDY::EddyGpuUtils::MovAndECParamUpdate(const NEWIMAGE::volume&, std::shared_ptr<const NEWIMAGE::volume >, std::shared_ptr<const NEWIMAGE::volume >, const NEWIMAGE::volume&, bool, float, const EDDY::PolationPara&, EDDY::ECScan&): Exception thrown
EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&): Exception thrown
EDDY::: Eddy failed with message EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::DoSliceToVolumeRegistration(const EDDY::EddyCommandLineOptions&, unsigned int, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*): Exception thrown

I am unable to edit the previous post, therefore I expand this post with a reply.

Based on the error messaga above, It seems that cublasssyrk != CUBLAS_STATUS_SUCCESS, but I cannot find a method to test this. I am not familiar with cuda coding myself, so I have no idea how to verify installation.

I ran a few sample codes from cuda and these are the outputs:

$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro T1000 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.1 / 9.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 3912 MBytes (4101898240 bytes)
MapSMtoCores for SM 7.5 is undefined.  Default to use 64 Cores/SM
MapSMtoCores for SM 7.5 is undefined.  Default to use 64 Cores/SM
  (14) Multiprocessors, ( 64) CUDA Cores/MP:     896 CUDA Cores
  GPU Max Clock rate:                            1350 MHz (1.35 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS

bandwidthTest

$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro T1000 with Max-Q Design
 Quick Mode

  Host to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)	Bandwidth(MB/s)
    33554432			12615.7

  Device to Host Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)	Bandwidth(MB/s)
    33554432			12519.9

  Device to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)	Bandwidth(MB/s)
    33554432			138322.9

 Result = PASS

 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Cublas tests:

$ ./simpleCUBLAS
MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM
GPU Device 0: "Quadro T1000 with Max-Q Design" with compute capability 7.5
simpleCUBLAS test running..
simpleCUBLAS test passed.

$ ./simpleCUBLASXT
MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM
GPU Device 0: "Quadro T1000 with Max-Q Design" with compute capability 7.5
simpleCUBLASXT test running..
simpleCUBLASXT test passed.

$ ./simpleDevLibCUBLAS
simpleDevLibCUBLAS test running...
MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM
GPU Device 0: "Quadro T1000 with Max-Q Design" with compute capability 7.5

Host and device APIs will be tested.
!!!! device to host memory copy error

The only samplecode with an error is simpeDevLibCUBLAS, but I have no idea what this error exactly means or where to look for a solution.

Does anyone know what the problem is or how to find the source of the errors?

Many thanks!