Hi all, I have a problem with cublasSgemm. I have a NVIDIA GEFORCE 610M, a fresh installation of Ubuntu 16.04LTS and NVIDIA toolkit 9.1 + driver version 390.87.
I have copied and pasted the example at page 120 in https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf, compiled the code, executed, but the resulting matrix is in fact a matrix whose entries are exactly 0, as withnessed here (code compiled with nvcc filename.c -lcublas, i.e., according to the commands given at page 120 of the referenced document)
a:
11 17 23 29 35
12 18 24 30 36
13 19 25 31 37
14 20 26 32 38
15 21 27 33 39
16 22 28 34 40
b:
11 16 21 26
12 17 22 27
13 18 23 28
14 19 24 29
15 20 25 30
c after Sgemm :
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Here are some outputs after the installation of the drivers and toolkit 9.1
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.87 Tue Aug 21 12:33:05 PDT 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
nvidia-smi
Fri May 24 11:47:09 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87 Driver Version: 390.87 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 610M Off | 00000000:01:00.0 N/A | N/A |
| N/A 50C P8 N/A / N/A | 232MiB / 964MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
~/Desktop/CUDA_SAMPLES/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 610M"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 964 MBytes (1011351552 bytes)
MapSMtoCores for SM 2.1 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 2.1 is undefined. Default to use 64 Cores/SM
( 1) Multiprocessors, ( 64) CUDA Cores/MP: 64 CUDA Cores
GPU Max Clock rate: 1480 MHz (1.48 GHz)
Memory Clock rate: 800 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS
~/Desktop/CUDA_SAMPLES/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce 610M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6408.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6434.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11234.4
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Strange thing, at least to me, is that after installing the samples, I cannot issue either “deviceQuery” or “bandwidthTest” directly from terminal, but I need to go to the folder containing the executable. However, I guess it is just a question of exporting the path.