Why is Nvidia K10 identifying as Quadro CentOS 7 w/ Cuda 7-8*

Hello,

First I would like to thank anyone who might lend a hand to enlighten me. I’m attempting to run IBM Intelligent Video Analytics 2.0.0 on a CentOS 7 x64 machine. The software requires a cluster of machines in which the one in question is what we refer to as the ‘Deep Learning Engine’. The (DLE) makes use of cuda 7.5 and during install of the Nvidia driver found here: http://us.download.nvidia.com/XFree86/Linux-x86_64/352.99/NVIDIA-Linux-x86_64-352.99.run (p.s. I have tested with various drivers in addition) … We are getting the following error during processing:

Image Link: (imgbb.com) [url]https://ibb.co/fVvJra[/url]

When issuing a lspci we see that the card is identifying as a Quadro?

[root@nxch101 ~]# lspci -nn | grep ‘[03’
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [GRID K1] [10de:0ff2] (rev a1)

Image Link: (imgbb.com) [url]https://ibb.co/jKEdra[/url]

Through the use of Cuda 8 or 7.5 we still get this response from both lspci & DLE software.

Image Link: (imgbb.com) [url]https://ibb.co/h31PWa[/url]

The Nouveau driver has been blacklisted as shown below:

Image Link: (imgbb.com) [url]https://ibb.co/c9AOPv[/url]

Is it that it may be possible the driver is giving a generic model response in which the software reads and determines it is not the correct architecture? There is (3) video cards in this server, however none meet these spec’s listed above.

This is a bare metal machine.

-----------------------------+

Thank you in advance… Any suggestions are welcome and anything that would assist I can provide on request.

… Just to clarify (I am helping this customer).

It is RHEL 7.3 (not CentOS so as to avoid that escape hatch by support :)
[root@nxch101 ~]# uname -a
Linux nxch101.ibm.aessatl.arrow.com 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Setup done was very simple:
Hardware: 5465AC1 Lenovo dx360m5 used for GPU /VDI POCs Lenovo NeXtScale nx360 M5 (E5-2600 v4) Product Guide (withdrawn product) > Lenovo Press

[root@nxch101 nVidiaDownload_temp]# cd /etc/default/
[root@nxch101 default]# vi grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=“$(sed ‘s, release .*$,g’ /etc/system-release)”
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=“console”
#GRUB_CMDLINE_LINUX=“crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap nomodeset rhgb quiet”
GRUB_CMDLINE_LINUX=“crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap nomodeset rhgb quiet rdblacklist=nouveau nouveau.modset=0”
GRUB_DISABLE_RECOVERY=“true”
[root@nxch101 default]# grub2-mkconfig -o /boot/grub2/grub.conf
Generating grub configuration file …
Found linux image: /boot/vmlinuz-3.10.0-514.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-514.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-8149e5c7c2e747179f991e25f64208b5
done
[root@nxch101 default]#
[root@nxch101 grub2]# cd /etc/modprobe.d/
[root@nxch101 modprobe.d]# ls
blacklist.conf lockd.conf mlx4.conf nvidia-installer-disable-nouveau.conf tuned.conf
[root@nxch101 modprobe.d]# cat nvidia-installer-disable-nouveau.conf

generated by nvidia-installer

blacklist nouveau
options nouveau modeset=0
[root@nxch101 modprobe.d]# systemctl set-default multi-user.target

CentOS7 nVidia Setup seems to be done. The nouveau driver is not present and the nVidia vendor one is.

##########
[root@nxch101 ~]# lsmod |grep nvid
nvidia_uvm 76758 2
nvidia 8541136 33 nvidia_uvm
drm 372540 4 ttm,drm_kms_helper,nvidia
i2c_core 40756 6 drm,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nvidia
[root@nxch101 ~]# lsmod |grep nouveau
[root@nxch101 ~]# cd ~
[root@nxch101 ~]# mkdir nVidiaDownload
[root@nxch101 ~]# wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/patches/2/cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm
[root@nxch101 ~]# wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-rhel7-375.66-1.x86_64.rpm
[root@nxch101 nVidiaDownload_temp]# rpm -ivh cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm
warning: cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
Preparing… ################################# [100%]
Updating / installing…
1:cuda-repo-rhel7-8-0-local-ga2-8.0################################# [100%]
[root@nxch101 nVidiaDownload_temp]# rpm -ivh cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm
warning: cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
Preparing… ################################# [100%]
package cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64 is already installed
[root@nxch101 nVidiaDownload_temp]#

################
List card. Noteing that PCI-E Slot 0b:00. Is the Tesla
################

[root@nxch101 nVidiaDownload_temp]# lspci -vvv

08:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID K1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1012
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 115
NUMA node: 0
Region 0: Memory at 91000000 (32-bit, non-prefetchable)
Region 1: Memory at 38008000000 (64-bit, prefetchable)
Region 3: Memory at 38006000000 (64-bit, prefetchable)
Region 5: I/O ports at 3f80
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee004d8 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

09:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID K1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1012
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 116
NUMA node: 0
Region 0: Memory at 92000000 (32-bit, non-prefetchable)
Region 1: Memory at 38018000000 (64-bit, prefetchable)
Region 3: Memory at 38016000000 (64-bit, prefetchable)
Region 5: I/O ports at 4f80
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee004f8 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #9, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

0a:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID K1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1012
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 117
NUMA node: 0
Region 0: Memory at 93000000 (32-bit, non-prefetchable)
Region 1: Memory at 38028000000 (64-bit, prefetchable)
Region 3: Memory at 38026000000 (64-bit, prefetchable)
Region 5: I/O ports at 5f80
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00718 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #16, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

0b:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID K1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1012
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 118
NUMA node: 0
Region 0: Memory at 94000000 (32-bit, non-prefetchable)
Region 1: Memory at 38038000000 (64-bit, prefetchable)
Region 3: Memory at 38036000000 (64-bit, prefetchable)
Region 5: I/O ports at 6f80
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00738 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #17, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest+
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

#######################

What we need is

  1. Confirmation that the Tesla K10 is (as I don’t have a second to compare to) showing up to system correctly by PCI-Device ID
  2. The device drivers based on best practices are loaded and configured properly
  3. That the CUDA driver for the K10 is also loaded per best practice and is correct.

That would then leave the issue with a configuration issue with the IBM DLE software. Which may be something someone else has run into which we can check. Please share experience or recommendations.

Thanks,

What is displayed by lspci comes from the PCI ID database
[url]http://pciids.sourceforge.net/v2.2/pci.ids[/url]
So it doesn’t mean anything if the chip name displayed is wrong. It also tells (correctly) Grid K1.
If you have any problems with your setup, please run nvidia-bug-report.sh and attach output file.

Regarding your problem with cuda, your gpu is switched to graphics mode

08:00.0 VGA compatible controller [0300]

0300 means graphics, 0200 compute mode
use

gpumodeswitch --gpumode compute

to switch
further infos:
http://images.nvidia.com/content/pdf/grid/guides/GRID-gpumodeswitch-UserGuide.pdf

I attempted to execute the gpumodeswitch as mentioned… Here was the response… What stands out to me is the Unconfigured display adapter found? I will run the nvidia-bug-report.sh and post that additionally.

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Update GPU Mode of all adapters to “compute”?
Press ‘y’ to confirm or ‘n’ to choose adapters or any other key to abort:
y

Updating GPU Mode of all eligible adapters to “compute”

NOTE: Unconfigured display adapter found, device not accessible:
PLX (8747h) (10B5,8747,10B5,8747) H:–:NRM S:00,B:06,PCI,D:00,F:00

GRID K1 (10DE,0FF2,10DE,1012) H:07:SP8 S:00,B:08,PCI,D:00,F:00
Adapter: GRID K1 (10DE,0FF2,10DE,1012) H:07:SP8 S:00,B:08,PCI,D:00,F:00

Identifying EEPROM…
EEPROM ID (C8,4012) : GD GD25Q20 2.7-3.6V 2048Kx1S, page
Cannot set GPU mode for this adapter

GRID K1 (10DE,0FF2,10DE,1012) H:07:SP9 S:00,B:09,PCI,D:00,F:00
Adapter: GRID K1 (10DE,0FF2,10DE,1012) H:07:SP9 S:00,B:09,PCI,D:00,F:00

Identifying EEPROM…
EEPROM ID (C8,4012) : GD GD25Q20 2.7-3.6V 2048Kx1S, page
Cannot set GPU mode for this adapter

GRID K1 (10DE,0FF2,10DE,1012) H:07:SP16 S:00,B:0A,PCI,D:00,F:00
Adapter: GRID K1 (10DE,0FF2,10DE,1012) H:07:SP16 S:00,B:0A,PCI,D:00,F:00

Identifying EEPROM…
EEPROM ID (C8,4012) : GD GD25Q20 2.7-3.6V 2048Kx1S, page
Cannot set GPU mode for this adapter

GRID K1 (10DE,0FF2,10DE,1012) H:07:SP17 S:00,B:0B,PCI,D:00,F:00
Adapter: GRID K1 (10DE,0FF2,10DE,1012) H:07:SP17 S:00,B:0B,PCI,D:00,F:00

Identifying EEPROM…
EEPROM ID (C8,4012) : GD GD25Q20 2.7-3.6V 2048Kx1S, page
Cannot set GPU mode for this adapter

Bug report can be found here: https://ufile.io/ceida

(If theres a better way to upload/share let me know)

Sorry, I was mistaken there, gpumodeswitch only applies to Maxwell not to Kepler. Forget about it.
The ‘unconfigured adapter’ is just the onboard server video, a Matrox G200, you can ignore that.
Does Linux run on bare metal or do you use virtualization?
You can attach files to existing posts, next to edit, hovering above your post will reveal the options.

Sorry, just found in your wall of text that’s bare metal.

Grid K1 has compute capability 3.0 I think, requirements of DLE says Tesla K80, which has compute capability 3.7. Did you contact them if your K1 is supported or needs some tweaks at setup?

Sorry for the delayed response… Let me verify that…

We have tested a GTX 1080, however the compute capability was much higher (6.1) per CUDA GPUs - Compute Capability | NVIDIA Developer … with that being said let me check with IBM on their support w/ the mentioned K10 card. I’m almost positive we asked in the past, but I don’t want to give you incorrect information so I will reach out to them again just to ensure.

Thanks for your patience and help!