K80 crashed or wrong computation results on K80

I compile the code a same way for GTX680 and K80 in matlab via

!"%VS110COMNTOOLS%vsvars32.bat" & nvcc -c -m64 -gencode=arch=compute_20,code=\"sm_20,compute_20\" -gencode=arch=compute_30,code=\"sm_30,compute_30\" -gencode=arch=compute_35,code=\"sm_35,compute_35\" -gencode=arch=compute_37,code=\"sm_37,compute_37\" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" Ax_tomo_gpu_ray.cu

and then mex,

but I get correct computation results when using GTX 680 while get K80 crashed (maybe memory violation) or obtain wrong computation from K80.

I also get this when run gpuDevice
Error using gpuDevice (line 26)
An unexpected error occurred trying to retrieve CUDA device properties. The CUDA error was:
CUDA_ERROR_UNKNOWN

Did anyone experience this issue?

Failure to retrieve the device properties likely means CUDA did not initialize at all. The K80 is a much newer device than the GTX 680. Do you have a recent driver package installed? Likewise, are you using a recent CUDA version with support for sm_37? What is the operating system? When you run nvidia-smi, does it show the K80?

Windows7, and the newest version driver, and tried both cuda 7.0 and cuda 7.5

Run the cuda device query sample code from either cuda 7 or cuda 7.5 and paste the results here.

Are you using this driver? [url]http://www.nvidia.com/download/driverResults.aspx/91482/en-us[/url]
Is nvidia-smi able to see the K80?

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 3 CUDA Capable device(s)

Device 0: “Tesla K80”
CUDA Driver Version / Runtime Version 7.5 / 7.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079398912 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “GeForce GT 610”
CUDA Driver Version / Runtime Version 7.5 / 7.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 1) Multiprocessors, ( 48) CUDA Cores/MP: 48 CUDA Cores
GPU Max Clock rate: 1620 MHz (1.62 GHz)
Memory Clock rate: 667 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: “Tesla K80”
CUDA Driver Version / Runtime Version 7.5 / 7.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079398912 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 7 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Tesla K80 (GPU0) → Tesla K80 (GPU2) : Yes
Peer access from Tesla K80 (GPU2) → Tesla K80 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 3, Device0 = Tesla K80, Device1 = GeForce GT 610, Device2 = Tesla K80
Result = PASS

Yes, I tried this and many new versions, and even in Windows 10.
how to use nvidia-smi in Windows system?

Thanks.

nvidia-smi works the same on all operating systems, as far as I know. If the question is, “where can I find the executable?” check Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe . To get some basic data about the device, run

> nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Sat Sep 19 20:59:40 2015
Driver Version                      : 353.82

Attached GPUs                       : 1
GPU 0000:01:00.0
    Product Name                    : Quadro K2200
    Product Brand                   : Quadro
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : N/A
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    [...]

For what it’s worth, your output from the device query application shows that CUDA is initialized and can access the K80 successfully. I suspect the issue is elsewhere in your software stack, not in the CUDA portion of it. Since I don’t know what that code looks like, I can only suggest double checking the code that interfaces with CUDA. Maybe it makes some assumptions that are no longer valid when CUDA is running on a K80 (such as device memory size <= 4 GB), or maybe it doesn’t properly check all status returns from CUDA.

Thanks for your analysis and really appreciate your time. I can get right results in early about five minutes computation but wrong after ever. Do you think there is a problem in my compilation with this dual-gpu card? Is there anything more I need to do with the dual-gpu card than single gpu card?

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Sun Sep 20 13:00:13 2015
Driver Version                      : 353.82

Attached GPUs                       : 3
GPU 0000:01:00.0
    Product Name                    : GeForce GT 610
    Product Brand                   : GeForce
    Display Mode                    : N/A
    Display Active                  : N/A
    Persistence Mode                : N/A
    Accounting Mode                 : N/A
    Accounting Mode Buffer Size     : N/A
    Driver Model
        Current                     : WDDM
        Pending                     : WDDM
    Serial Number                   : N/A
    GPU UUID                        : GPU-9118e757-b350-4ca0-4a15-ad54cb4d2e41
    Minor Number                    : N/A
    VBIOS Version                   : 75.19.56.00.04
    MultiGPU Board                  : N/A
    Board ID                        : N/A
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x104A10DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x36111458
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : N/A
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 45 %
    Performance State               : P8
    Clocks Throttle Reasons         : N/A
    FB Memory Usage
        Total                       : 1024 MiB
        Used                        : 132 MiB
        Free                        : 892 MiB
    BAR1 Memory Usage
        Total                       : N/A
        Used                        : N/A
        Free                        : N/A
    Compute Mode                    : Default
    Utilization
        Gpu                         : N/A
        Memory                      : N/A
        Encoder                     : N/A
        Decoder                     : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 48 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : N/A

GPU 0000:06:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : N/A
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : TCC
        Pending                     : TCC
    Serial Number                   : 0322515108462
    GPU UUID                        : GPU-0e2e56a9-4080-d1e5-160c-eeb34cda5a0d
    Minor Number                    : N/A
    VBIOS Version                   : 80.21.1B.00.01
    MultiGPU Board                  : Yes
    Board ID                        : 0x400
    Inforom Version
        Image Version               : 2080.0200.00.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x06
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102D10DE
        Bus Id                      : 0000:06:00.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : PLX
            Firmware                : 0xF0472900
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 23 MiB
        Free                        : 11496 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 20 MiB
        Free                        : 16364 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 28.37 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

GPU 0000:07:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : N/A
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : TCC
        Pending                     : TCC
    Serial Number                   : 0322515108462
    GPU UUID                        : GPU-d6fd176d-4e0a-5e6d-ab17-36d9f42328fc
    Minor Number                    : N/A
    VBIOS Version                   : 80.21.1B.00.02
    MultiGPU Board                  : Yes
    Board ID                        : 0x400
    Inforom Version
        Image Version               : 2080.0200.00.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x07
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102D10DE
        Bus Id                      : 0000:07:00.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : PLX
            Firmware                : 0xF0472900
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 23 MiB
        Free                        : 11496 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 20 MiB
        Free                        : 16364 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 72 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 34.73 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

After about five minutes computation, the GPU is lost. I don’t know why.

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Sun Sep 20 13:10:27 2015
Driver Version                      : 353.82

Attached GPUs                       : 3
GPU 0000:01:00.0
    Product Name                    : GeForce GT 610
    Product Brand                   : GeForce
    Display Mode                    : N/A
    Display Active                  : N/A
    Persistence Mode                : N/A
    Accounting Mode                 : N/A
    Accounting Mode Buffer Size     : N/A
    Driver Model
        Current                     : WDDM
        Pending                     : WDDM
    Serial Number                   : N/A
    GPU UUID                        : GPU-9118e757-b350-4ca0-4a15-ad54cb4d2e41
    Minor Number                    : N/A
    VBIOS Version                   : 75.19.56.00.04
    MultiGPU Board                  : N/A
    Board ID                        : N/A
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x104A10DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x36111458
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : N/A
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : 45 %
    Performance State               : P8
    Clocks Throttle Reasons         : N/A
    FB Memory Usage
        Total                       : 1024 MiB
        Used                        : 207 MiB
        Free                        : 817 MiB
    BAR1 Memory Usage
        Total                       : N/A
        Used                        : N/A
        Free                        : N/A
    Compute Mode                    : Default
    Utilization
        Gpu                         : N/A
        Memory                      : N/A
        Encoder                     : N/A
        Decoder                     : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 49 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : N/A

Unable to determine the PCI bus id for the target device: GPU is lost

Unable to determine the PCI bus id for the target device: GPU is lost

I am not an expert on “GPU falling of the PCIe bus” issues, I just know they do occur. Make sure your system has the latest system BIOS version installed. What kind of system is this?

A few other things to look into: Check the power supply. Is the K80 properly seated in the PCIe slot? Are the PCIe power cables plugged in properly (I have not used a K80, but it probably has one 6-pin and one 8-pin connector based on its wattage)? What’s the rating of your power supply unit (PSU)? With two GPUs in the system, you would probably want a 1000W power supply. I usually suggest 80plus gold rated PSUs. Is the cooling adequate? Does nvidia-smi give any indications of overheating (monitor “GPU Current Temp”) ?

Thank you for your advices. I will double check these. My system is using a mother board of ASUS P9X79 WS.

Judging by the pictures I found on the ASUS website, that motherboard doesn’t look like a server-class motherboard, it looks like something that goes into an ordinary PC enclosure. That raises the question: how you are cooling the K80?

To my knowledge, there are no actively-cooled K80s (i.e. with a built-in fan), only passively cooled ones that rely for cooling on the massive airflow provided by the fans in a server enclosure. If you install a K80 into an ordinary PC case, it will very likely overheat and switch itself off to prevent permanent damage to the board. Which jibes with your description of the K80 “disappearing” after a few minutes.

[Later:] Checking K80 specs ([url]http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf[/url]), I noticed that this GPU does not use traditional PCIe power connectors, and that interfacing with legacy PCIe power connectors requires special care (emphasis mine):

“The board provides a single EPS12V CPU 8-pin power connector on the “east” edge of the board. The Tesla K80 no longer uses the PCI Express auxiliary connectors. For backward compatibility with existing systems, NVIDIA will provide a power dongle that converts the CPU 8-pin to two PCI Express 8-pin connectors. The two PCI Express cables must be from a common rail on the system power supply and together must be able to supply sufficient power as specified in Table 3.”

Since it is not clear whether your system is a “home-brew” system or a vendor-configured system from one of NVIDIA’s integration partners, you may want to double check on the power connection if it is the former.

I think you are right. I just found out that the temperature is about 90C when the cuda crashed.
The K80 will shutdown when the temperature is above 93.
Thank you so much for your analysis and your solution. I will have a try.