cublasSgemm results in null matrix

Hi all, I have a problem with cublasSgemm. I have a NVIDIA GEFORCE 610M, a fresh installation of Ubuntu 16.04LTS and NVIDIA toolkit 9.1 + driver version 390.87.

I have copied and pasted the example at page 120 in https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf, compiled the code, executed, but the resulting matrix is in fact a matrix whose entries are exactly 0, as withnessed here (code compiled with nvcc filename.c -lcublas, i.e., according to the commands given at page 120 of the referenced document)

a:
   11   17   23   29   35
   12   18   24   30   36
   13   19   25   31   37
   14   20   26   32   38
   15   21   27   33   39
   16   22   28   34   40
b:
   11   16   21   26
   12   17   22   27
   13   18   23   28
   14   19   24   29
   15   20   25   30
c after  Sgemm :
      0      0      0      0
      0      0      0      0
      0      0      0      0
      0      0      0      0
      0      0      0      0
      0      0      0      0

Here are some outputs after the installation of the drivers and toolkit 9.1

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.87  Tue Aug 21 12:33:05 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
nvidia-smi 
Fri May 24 11:47:09 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 610M        Off  | 00000000:01:00.0 N/A |                  N/A |
| N/A   50C    P8    N/A /  N/A |    232MiB /   964MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+
~/Desktop/CUDA_SAMPLES/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 610M"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 964 MBytes (1011351552 bytes)
MapSMtoCores for SM 2.1 is undefined.  Default to use 64 Cores/SM
MapSMtoCores for SM 2.1 is undefined.  Default to use 64 Cores/SM
  ( 1) Multiprocessors, ( 64) CUDA Cores/MP:     64 CUDA Cores
  GPU Max Clock rate:                            1480 MHz (1.48 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 65536 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS
~/Desktop/CUDA_SAMPLES/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce 610M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6408.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6434.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11234.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Strange thing, at least to me, is that after installing the samples, I cannot issue either “deviceQuery” or “bandwidthTest” directly from terminal, but I need to go to the folder containing the executable. However, I guess it is just a question of exporting the path.

Forgot to mention that I have slighlty changed the code in
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf: the matrix “c” is initialized to 0, i.e., the values in the input/output matrix “c” are unchanged.

A GPU of compute capability 2.1 is not supported by CUDA toolkit 9.1

If you do proper error checking in your code, you will discover this.

Switch to CUDA toolkit 8.0

Super, thanks. Driver’s version 390.87 should be fine with toolkit 8.0.

Can you please point me to a document in which I can learn how to do proper error checking? Frustrating thing with the code I’ve run is that everything seemed fine, no error messages etc.
Best,
N

google “proper CUDA error checking”

take the first hit

read it and apply it to your code.

Not trying to be snarky here. If I give you a link you will never remember that exact link a month from now, and you may not even remember how to get back to this thread.

If I teach you how to find your own answers, you may remember that without having to come back here.

Generally speaking, proper error checking for the CUDA runtime API (e.g. cudaMalloc, etc.) would be by testing the function return value. You can find the function return value in the CUDA runtime API docs. All CUDA docs are at docs.nvidia.com (usually people can remember that).

CUBLAS calls also return an error status. You test it in a similar fashion. Where are the CUBLAS docs? At docs.nvidia.com

If you write your own CUDA kernels, there are more details associated with error checking that. In that case I go back to the first recommendation I made with a google search.

I agree that it does look like the deviceQuery ran successfully and that is confusing. It is a peculiar nuance of CUDA in this particular case (Fermi GPU + R390 driver + CUDA 9.x). However even there, there was some indication that things were not right:

MapSMtoCores for SM 2.1 is undefined.  Default to use 64 Cores/SM
MapSMtoCores for SM 2.1 is undefined.  Default to use 64 Cores/SM

And if you run a code like vectorAdd in your current setup, you will get a very evident failure.

Generally speaking, the CUDA sample codes also demonstrate proper CUDA error checking. So if you run the vectorAdd code you will see an error. You can then study the error checking in that code, to learn where the error originated, and how it was tested for.

Thanks, now everything is working properly. I think that your reply was not snarky at all as I understand that the forum is plenty of, trivial, questions like mine. I think that the problem is that NVIDIA documentation on CUDA is vast and not well organized. For instance, I’ve learnt to program in C and how to call BLAS/LAPACK routines in C quite easily as BLAS/LAPACK docs are quite coincise and “focused”, they just give you what is necessary.

You spotted from the output of my deviceQuery that something was wrong, however for a beginner the output of that function is hard to read; bottom line is that you end up being reassured by the final line, i.e., “Result = PASS”. Plus, the only other output of a devideQuery I’ve had the opportunity to read/look at was the one included in the “Installation Guide Linux”: just a screen shot.

I do not want to seem argumentative, it is just to explain that even if you are willing to learn you’ll have a non-negligible chance to end up reading useless infos: it took me 3 days to set up my machine and I succeeded just because of you.