cublasZgemm fails on FERMI but not on TESLA CUBLAS_STATUS_NOT_INITIALIZED even if 'cublasInit()&

Dear all,

I am trying to solve a issue that I have running my program on a Tesla C2050 card (Fermi architecture). The ‘cublasZgemm()’ call fails with error CUBLAS_STATUS_NOT_INITIALIZED. cublasInit() returns CUBLAS_STATUS_SUCCESS. I am using proper data structures, the parameters are transa: ‘n’, transb: ‘n’, m: 512, n: 544, k: 880, lda: 512, ldb: 880, ldc: 512.

If I run the same program using the same parameters, the same CUDA toolkit (3.2), the same driver (260.19.26) same compile flags on another workstation equipped with a Tesla C1060 card… it works without problem.

If I replace ‘cublasZgemm()’ with ‘cublasDgemm()’ or ‘cublasSgemm()’ (and also replace the type of the matrices according), no problems appear on both gpu cards.

So, I am stuck. The error is apparently clear and for some unknow reasons there is a mismatching between the status after ‘cublasInit()’ and ‘cublasZgemm()’. I am looking for a way to debug/understand what is going on inside the cublas library (or better a way to solve my problem). What can I do to solve this problem on my C2050 card?

Any help is really appreciated!

Regards,

What kind of OS ( Linux/Windows) and compiler are you using for the Host part.

Would you mind post a simple repro case?

The Fermi machine is a dual Intel® Xeon® CPU X5650 @ 2.67GHz (12 cores in total, no hyper-threading) with 24 GByte of memory on the host and 2 [font=“arial, verdana, tahoma, sans-serif”]C2050 cards (3GByte each of memory on the device). It has installed the Intel compilers, version 11.1. [/font]There also is GCC 4.4.5-8. [font=“arial, verdana, tahoma, sans-serif”]The operative system is a Debian Squeeze 6.0. The kernel is 2.6.32-5-amd64 x86_64.[/font]

The Tesla machine si a single Intel® Coreâ„¢ i7 CPU 975 @ 3.33GHz (4 cores in total, no hyper-threading) with 12 GByte pf memory on the host and 3 C1060 card (4 GByte each of memory on the device). It has installed Intel compilers, version 12.0.2(*). There also is GCC 4.3.4. [font=“arial, verdana, tahoma, sans-serif”] The operative system is an Ubuntu Karmic. The kernel is 2.6.31-14-generic x86_64.[/font]

[font=“arial, verdana, tahoma, sans-serif”] [/font]

[font=“arial, verdana, tahoma, sans-serif”]On both machines X and other graphical services such GDM are disabled and off my default. [/font][font=“arial, verdana, tahoma, sans-serif”]Here (**) the output of ‘deviceQuery’.[/font]

[font=“arial, verdana, tahoma, sans-serif”] [/font]

My program is “big”, I cannot post here all the source code. What I can do is to provide a small example in a separate source code. Anyway, I am sure that I used cuBlas in the right way otherwise also cublasDgemm or cublasSgemm will fail…

(*): Intel 12.0.2 is not officially compatible with CUDA but, changing one line in a header file, it is possible to compile every example in SDK and all the test work perfectly!

(**): ‘deviceQuery’ outputs

fspiga@tesla:~$ deviceQuery

deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: "Tesla C1060"

  CUDA Driver Version:               			3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:     			4294770688 bytes

  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)

  Total amount of constant memory:   			65536 bytes

  Total amount of shared memory per block:   	16384 bytes

  Total number of registers available per block: 16384

  Warp size:                         			32

  Maximum number of threads per block:   		512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                 			256 bytes

  Clock rate:                                    1.30 GHz

  Concurrent copy and execution:     			Yes

  Run time limit on kernels:         			No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Exclusive (only one host thread at a time can use this device)

  Concurrent kernel execution:       			No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:   			No

Device 1: "Tesla C1060"

  CUDA Driver Version:               			3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:     			4294770688 bytes

  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)

  Total amount of constant memory:   			65536 bytes

  Total amount of shared memory per block:   	16384 bytes

  Total number of registers available per block: 16384

  Warp size:                         			32

  Maximum number of threads per block:   		512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                 			256 bytes

  Clock rate:                                    1.30 GHz

  Concurrent copy and execution:     			Yes

  Run time limit on kernels:         			No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Exclusive (only one host thread at a time can use this device)

  Concurrent kernel execution:       			No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:   			No

Device 2: "Tesla C1060"

  CUDA Driver Version:               			3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:     			4294770688 bytes

  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)

  Total amount of constant memory:   			65536 bytes

  Total amount of shared memory per block:   	16384 bytes

  Total number of registers available per block: 16384

  Warp size:                         			32

  Maximum number of threads per block:   		512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                 			256 bytes

  Clock rate:                                    1.30 GHz

  Concurrent copy and execution:     			Yes

  Run time limit on kernels:         			No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Exclusive (only one host thread at a time can use this device)

  Concurrent kernel execution:       			No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:   			No

Device 3: "GeForce 9500 GT"

  CUDA Driver Version:               			3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:     			1073020928 bytes

  Multiprocessors x Cores/MP = Cores:            4 (MP) x 8 (Cores/MP) = 32 (Cores)

  Total amount of constant memory:   			65536 bytes

  Total amount of shared memory per block:   	16384 bytes

  Total number of registers available per block: 8192

  Warp size:                         			32

  Maximum number of threads per block:   		512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                 			256 bytes

  Clock rate:                                    1.62 GHz

  Concurrent copy and execution:     			Yes

  Run time limit on kernels:         			No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Prohibited (no host thread can use this device)

  Concurrent kernel execution:       			No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:   			No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla C1060, Device = Tesla C1060

PASSED

Press <Enter> to Quit...

-----------------------------------------------------------
fspiga@fermi:~$ deviceQuery

deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "Tesla C2050"

  CUDA Driver Version:                   		3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:         		2817720320 bytes

  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)

  Total amount of constant memory:       		65536 bytes

  Total amount of shared memory per block:   	49152 bytes

  Total number of registers available per block: 32768

  Warp size:                             		32

  Maximum number of threads per block:   		1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                     		512 bytes

  Clock rate:                                    1.15 GHz

  Concurrent copy and execution:         		Yes

  Run time limit on kernels:             		No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:           		Yes

  Device has ECC support enabled:                Yes

  Device is using TCC driver mode:       		No

Device 1: "Tesla C2050"

  CUDA Driver Version:                   		3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:         		2817982464 bytes

  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)

  Total amount of constant memory:       		65536 bytes

  Total amount of shared memory per block:   	49152 bytes

  Total number of registers available per block: 32768

  Warp size:                             		32

  Maximum number of threads per block:   		1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid: 	65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                     		512 bytes

  Clock rate:                                    1.15 GHz

  Concurrent copy and execution:         		Yes

  Run time limit on kernels:             		No

  Integrated:                                    No

  Support host page-locked memory mapping:   	Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:           		Yes

  Device has ECC support enabled:                Yes

  Device is using TCC driver mode:       		No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = Tesla C2050, Device = Tesla C2050

PASSED

Press <Enter> to Quit...

-----------------------------------------------------------