Hi,
I would like to use Tensorflow with Windows, using a Tesla K40c, which requires Cuda 8.
Following the Cuda Toolkit documentation, I downloaded and installed Cuda 8.0.44, and aimed to verify the installation using ‘deviceQuery’ and ‘bandwidthTest’. While deviceQuery recognizes the Tesla K40c, bandwidthTest just hangs (no error messages, system keeps running, 1 CPU is fully used) and does not complete. However, when I de-install Cuda 8 and install Cuda 7.5 instead, deviceQuery and bandwidthTest just run smoothly. For command line outputs, see below.
I double checked by repeatedly uninstalling/installing Cuda 7.5/8 – I consistently get that 7.5 runs nicely, while 8 hangs (also on other test programs, e.g. matrixMul and Tensorflow).
The symptoms are the same as in
https://devtalk.nvidia.com/default/topic/516727/devicequery-ok-everything-else-hangs-cuda-sdk-4-1-examples-simply-hang-no-errors-no-warnings/
, so I checked that the IOMMU feature is deactivated in bios (which is called VT-d for Intel). I also tried it with the feature turned on – same problem.
I updated the NVIDIA graphic driver from 369.30 to 376.33 – same probelm.
Any help would be highly appreciated.
Thanks a lot,
Robert
Operating system:
Microsoft Windows 10 Education
10.0.10240
Motherboard:
Manufacturer Product Version
ASUSTeK COMPUTER INC. H97-PRO GAMER Rev X.0x
Outputs of deviceQuery and bandwidthTest:
For Cuda 7.5
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “Tesla K40c”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079398912 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla K40c
Result = PASS
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>bandwidthTest.exe
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: Tesla K40c
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1478.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1625.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 182837.6
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
For Cuda 8
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “Tesla K40c”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11446 MBytes (12001869824 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K40c
Result = PASS
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Release>bandwidthTest.exe
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: Tesla K40c
Quick Mode