The Fermi machine is a dual Intel® Xeon® CPU X5650 @ 2.67GHz (12 cores in total, no hyper-threading) with 24 GByte of memory on the host and 2 [font=“arial, verdana, tahoma, sans-serif”]C2050 cards (3GByte each of memory on the device). It has installed the Intel compilers, version 11.1. [/font]There also is GCC 4.4.5-8. [font=“arial, verdana, tahoma, sans-serif”]The operative system is a Debian Squeeze 6.0. The kernel is 2.6.32-5-amd64 x86_64.[/font]
The Tesla machine si a single Intel® Coreâ„¢ i7 CPU 975 @ 3.33GHz (4 cores in total, no hyper-threading) with 12 GByte pf memory on the host and 3 C1060 card (4 GByte each of memory on the device). It has installed Intel compilers, version 12.0.2(*). There also is GCC 4.3.4. [font=“arial, verdana, tahoma, sans-serif”] The operative system is an Ubuntu Karmic. The kernel is 2.6.31-14-generic x86_64.[/font]
[font=“arial, verdana, tahoma, sans-serif”] [/font]
[font=“arial, verdana, tahoma, sans-serif”]On both machines X and other graphical services such GDM are disabled and off my default. [/font][font=“arial, verdana, tahoma, sans-serif”]Here (**) the output of ‘deviceQuery’.[/font]
[font=“arial, verdana, tahoma, sans-serif”] [/font]
My program is “big”, I cannot post here all the source code. What I can do is to provide a small example in a separate source code. Anyway, I am sure that I used cuBlas in the right way otherwise also cublasDgemm or cublasSgemm will fail…
(*): Intel 12.0.2 is not officially compatible with CUDA but, changing one line in a header file, it is possible to compile every example in SDK and all the test work perfectly!
(**): ‘deviceQuery’ outputs
fspiga@tesla:~$ deviceQuery
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA
Device 0: "Tesla C1060"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4294770688 bytes
Multiprocessors x Cores/MP = Cores: 30 (MP) x 8 (Cores/MP) = 240 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 1: "Tesla C1060"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4294770688 bytes
Multiprocessors x Cores/MP = Cores: 30 (MP) x 8 (Cores/MP) = 240 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 2: "Tesla C1060"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4294770688 bytes
Multiprocessors x Cores/MP = Cores: 30 (MP) x 8 (Cores/MP) = 240 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 3: "GeForce 9500 GT"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 1073020928 bytes
Multiprocessors x Cores/MP = Cores: 4 (MP) x 8 (Cores/MP) = 32 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.62 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Prohibited (no host thread can use this device)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 4, Device = Tesla C1060, Device = Tesla C1060
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------
fspiga@fermi:~$ deviceQuery
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA
Device 0: "Tesla C2050"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817720320 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device 1: "Tesla C2050"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = Tesla C2050, Device = Tesla C2050
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------