Titan V tensorflow performance

I just got tensorflow working with my new Titan V, but I’ve run into some unexpected performance issues. I have both a GeForce GTX 1060 6GB and a Titan V installed on a system. Neither of the cards are used for display. After running some tutorial jobs it appears that my 1060 is more than 4 times faster than the Titan V. I’ve removed the 1060 out of the system to verify jobs are running slow on the titan v.

Assuming a working tensorflow installation (using directions from https://www.tensorflow.org/install/install_linux ) and using the examples from https://github.com/tensorflow/models/tree/master/tutorials/image/mnist

I’m using ubuntu 17.04. I’m using 387.34 of the nvidia driver. Here’s some gory system details. After the system details will be timing from running on the v then the 1060.

lspci | grep NVIDIA

0a:00.0 VGA compatible controller: NVIDIA Corporation GV100 (rev a1)
0a:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
41:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)

nvidia-smi

Tue Dec 19 18:00:44 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:0A:00.0 Off | N/A |
| 28% 42C P0 35W / 250W | 0MiB / 12058MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 106… Off | 00000000:41:00.0 Off | N/A |
| 0% 52C P5 14W / 150W | 0MiB / 6070MiB | 2% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

pip list | grep tensorflow

tensorflow-gpu (1.4.1)

sudo dmidecode

dmidecode 3.1

Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Table at 0x000EA800.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 0804
Release Date: 11/30/2017
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16 MB
Characteristics:
PCI is supported
APM is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.13

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: System manufacturer
Product Name: System Product Name
Version: System Version
Serial Number: System Serial Number
UUID: 3E098800-8A17-11E7-A55C-107B44927248
Wake-up Type: Power Switch
SKU Number: SKU
Family: To be filled by O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: ROG ZENITH EXTREME
Version: Rev 1.xx
Serial Number: 170808564400306
Asset Tag: Default string
Features:
Board is a hosting board
Board is removable
Board is replaceable
Location In Chassis: Default string
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
Manufacturer: Default string
Type: Desktop
Lock: Not Present
Version: Default string
Serial Number: Default string
Asset Tag: Default string
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: None
OEM Information: 0x00000000
Height: Unspecified
Number Of Power Cords: 1
Contained Elements: 0
SKU Number: Default string

Handle 0x0004, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: M.2(WIFI)
Internal Connector Type: None
External Reference Designator: M.2(WIFI)
External Connector Type: Other
Port Type: Network Port

Handle 0x0005, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_5678
Internal Connector Type: None
External Reference Designator: U31G1_5678
External Connector Type: Access Bus (USB)
Port Type: USB

Handle 0x0006, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_34
Internal Connector Type: None
External Reference Designator: U31G1_34
External Connector Type: Access Bus (USB)
Port Type: USB

Handle 0x0007, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_12
Internal Connector Type: None
External Reference Designator: U31G1_12
External Connector Type: Access Bus (USB)
Port Type: USB

Handle 0x0008, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G2_EC1
Internal Connector Type: None
External Reference Designator: U31G2_EC1
External Connector Type: Access Bus (USB)
Port Type: USB

Handle 0x0009, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G2_E2
Internal Connector Type: None
External Reference Designator: U31G2_E2
External Connector Type: Access Bus (USB)
Port Type: USB

Handle 0x000A, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: LAN
Internal Connector Type: None
External Reference Designator: LAN
External Connector Type: RJ-45
Port Type: Network Port

Handle 0x000B, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: AUDIO
Internal Connector Type: None
External Reference Designator: AUDIO
External Connector Type: Other
Port Type: Audio Port

Handle 0x000C, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: CPU_FAN
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x000D, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: CPU_OPT
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x000E, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: H_AMP_FAN
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x000F, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: CHA_FAN1
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0010, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: CHA_FAN2
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0011, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: W_PUMP+
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0012, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: W_FLOW
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0013, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: W_IN
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0014, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: W_OUT
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0015, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: WB_SENSOR
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0016, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: COV_FAN
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0017, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: EXT_FAN
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0018, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: T_SENSOR1
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0019, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: T_SENSOR2
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x001A, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: SATA6G_12
Internal Connector Type: SAS/SATA Plug Receptacle
External Reference Designator: Not Specified
External Connector Type: None
Port Type: SATA

Handle 0x001B, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: SATA6G_34
Internal Connector Type: SAS/SATA Plug Receptacle
External Reference Designator: Not Specified
External Connector Type: None
Port Type: SATA

Handle 0x001C, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: SATA6G_56
Internal Connector Type: SAS/SATA Plug Receptacle
External Reference Designator: Not Specified
External Connector Type: None
Port Type: SATA

Handle 0x001D, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: M.2_1(SOCKET3)
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x001E, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U.2
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x001F, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G2_1
Internal Connector Type: Access Bus (USB)
External Reference Designator: Not Specified
External Connector Type: None
Port Type: USB

Handle 0x0020, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_910
Internal Connector Type: Access Bus (USB)
External Reference Designator: Not Specified
External Connector Type: None
Port Type: USB

Handle 0x0021, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_1112
Internal Connector Type: Access Bus (USB)
External Reference Designator: Not Specified
External Connector Type: None
Port Type: USB

Handle 0x0022, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: U31G1_1314
Internal Connector Type: Access Bus (USB)
External Reference Designator: Not Specified
External Connector Type: None
Port Type: USB

Handle 0x0023, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: RGE_HEADER1
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0024, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: RGE_HEADER2
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0025, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: LED_CON1
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0026, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: LED_CON2
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0027, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: LED_CON3
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0028, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: OLED_HEADER
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x0029, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: ADD_HEADER
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x002A, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: AAFP
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x002B, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: TMP
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x002C, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: F_PANEL
Internal Connector Type: Other
External Reference Designator: Not Specified
External Connector Type: None
Port Type: Other

Handle 0x002D, DMI type 9, 17 bytes
System Slot Information
Designation: PCIEX16_1
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 0
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x002E, DMI type 9, 17 bytes
System Slot Information
Designation: PCIEX8_2
Type: x8 PCI Express
Current Usage: In Use
Length: Long
ID: 1
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x002F, DMI type 9, 17 bytes
System Slot Information
Designation: PCIEX4
Type: x4 PCI Express
Current Usage: In Use
Length: Short
ID: 2
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x0030, DMI type 9, 17 bytes
System Slot Information
Designation: PCIEX16_3
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 3
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x0031, DMI type 9, 17 bytes
System Slot Information
Designation: PCIEX1
Type: x1 PCI Express
Current Usage: In Use
Length: Short
ID: 4
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x0032, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE_X8/X4_4
Type: x8 PCI Express
Current Usage: In Use
Length: Long
ID: 5
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:00.0

Handle 0x0033, DMI type 10, 6 bytes
On Board Device Information
Type: Video
Status: Enabled
Description: To Be Filled By O.E.M.

Handle 0x0034, DMI type 11, 5 bytes
OEM Strings
String 1: Default string
String 2: Default string
String 3: PIPEWORKS
String 4: Default string

Handle 0x0035, DMI type 12, 5 bytes
System Configuration Options
Option 1: Default string
Option 2: Default string
Option 3: Default string
Option 4: Default string

Handle 0x0036, DMI type 32, 20 bytes
System Boot Information
Status: No errors detected

Handle 0x0037, DMI type 40, 32 bytes
Additional Information 1
Referenced Handle: 0x0033
Referenced Offset: 0x01
String: To Be Filled By O.E.M. 1
Value: 0x00000000
Additional Information 2
Referenced Handle: 0x0001
Referenced Offset: 0x0f
String: Not Specified
Value: 0x0000001e
Additional Information 3
Referenced Handle: 0x003f
Referenced Offset: 0x01
String: Mordor
Value: 0x00000000

Handle 0x0038, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard IGD
Type: Video
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:02.0

Handle 0x0039, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard LAN
Type: Ethernet
Status: Enabled
Type Instance: 1
Bus Address: 0000:00:19.0

Handle 0x003A, DMI type 41, 11 bytes
Onboard Device
Reference Designation: Onboard 1394
Type: Other
Status: Enabled
Type Instance: 1
Bus Address: 0000:03:1c.2

Handle 0x003B, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x003C, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 512 GB
Error Information Handle: 0x003B
Number Of Devices: 8

Handle 0x003D, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Array Handle: 0x003C
Partition Width: 8

Handle 0x003E, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1 - Cache
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 1536 kB
Maximum Size: 1536 kB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: 1 ns
Error Correction Type: Multi-bit ECC
System Type: Unified
Associativity: 8-way Set-associative

Handle 0x003F, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2 - Cache
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Back
Location: Internal
Installed Size: 8192 kB
Maximum Size: 8192 kB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: 1 ns
Error Correction Type: Multi-bit ECC
System Type: Unified
Associativity: 8-way Set-associative

Handle 0x0040, DMI type 7, 19 bytes
Cache Information
Socket Designation: L3 - Cache
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 32768 kB
Maximum Size: 32768 kB
Supported SRAM Types:
Pipeline Burst
Installed SRAM Type: Pipeline Burst
Speed: 1 ns
Error Correction Type: Multi-bit ECC
System Type: Unified
Associativity: 32-way Set-associative

Handle 0x0041, DMI type 4, 48 bytes
Processor Information
Socket Designation: SP3r2
Type: Central Processor
Family: Zen
Manufacturer: Advanced Micro Devices, Inc.
ID: 11 0F 80 00 FF FB 8B 17
Signature: Family 23, Model 1, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
HTT (Multi-threading)
Version: AMD Ryzen Threadripper 1950X 16-Core Processor
Voltage: 1.1 V
External Clock: 100 MHz
Max Speed: 4200 MHz
Current Speed: 3400 MHz
Status: Populated, Enabled
Upgrade: Socket SP3r2
L1 Cache Handle: 0x003E
L2 Cache Handle: 0x003F
L3 Cache Handle: 0x0040
Serial Number: Unknown
Asset Tag: Unknown
Part Number: Unknown
Core Count: 16
Core Enabled: 16
Thread Count: 32
Characteristics:
64-bit capable
Multi-Core
Hardware Thread
Execute Protection
Enhanced Virtualization
Power/Performance Control

Handle 0x0042, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0043, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0042
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0044, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0043
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x0045, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0046, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0045
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0047, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0046
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x0048, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0049, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0048
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x004A, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0049
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x004B, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x004C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x004B
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x004D, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x004C
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x004E, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x004F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x004E
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL C
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0050, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x004F
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x0051, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0052, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0051
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL C
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0053, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0052
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x0054, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0055, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0054
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL D
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0056, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0055
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x0057, DMI type 18, 23 bytes
32-bit Memory Error Information
Type: OK
Granularity: Unknown
Operation: Unknown
Vendor Syndrome: Unknown
Memory Array Address: Unknown
Device Address: Unknown
Resolution: Unknown

Handle 0x0058, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: 0x0057
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL D
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: F4-3000C14-16GVRD
Rank: 2
Configured Clock Speed: 1067 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V

Handle 0x0059, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0058
Memory Array Mapped Address Handle: 0x003D
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown

Handle 0x005A, DMI type 127, 4 bytes
End Of Table

job1 using titan V

set | grep CUDA

CUDA_HOME=/usr/local/cuda
CUDA_VISIBLE_DEVICES=GPU-32219c7a-dcc6-7405-4c0a-a0ae7bada1cc

(job output truncated)

time python convolutional.py

Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
2017-12-19 17:46:59.801747: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-19 17:47:02.117832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Graphics Device major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:0a:00.0
totalMemory: 11.78GiB freeMemory: 11.36GiB
2017-12-19 17:47:02.117867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:0a:00.0, compute capability: 7.0)
Initialized!
Step 0 (epoch 0.00), 2262.5 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%
Step 100 (epoch 0.12), 4.0 ms
Minibatch loss: 3.226, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 8.3%

Step 8500 (epoch 9.89), 4.0 ms
Minibatch loss: 1.608, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.9%
Test error: 0.8%

real 4m22.197s
user 4m23.014s
sys 0m10.937s

Now I switch to the 1060 and run the same job:

nvidia-smi -L ; export CUDA_VISIBLE_DEVICES=

GPU 0: Graphics Device (UUID: GPU-32219c7a-dcc6-7405-4c0a-a0ae7bada1cc)
GPU 1: GeForce GTX 1060 6GB (UUID: GPU-54d97f12-a6ea-dcc9-2924-d292e86d3573)

export CUDA_VISIBLE_DEVICES=GPU-54d97f12-a6ea-dcc9-2924-d292e86d3573

time python convolutional.py

Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
2017-12-19 17:55:19.284550: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-19 17:55:21.185571: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-12-19 17:55:21.185887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:41:00.0
totalMemory: 5.93GiB freeMemory: 5.85GiB
2017-12-19 17:55:21.185904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:41:00.0, compute capability: 6.1)
Initialized!
Step 0 (epoch 0.00), 36.2 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%

Step 8500 (epoch 9.89), 6.1 ms
Minibatch loss: 1.604, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 1.0%
Test error: 0.8%

real 0m58.582s
user 0m51.250s
sys 0m16.414s

The job using the titan took 4 minutes 22 seconds, and the job using the 1060 took 58 seconds. My understanding is that the Titan V is supposed to be significantly faster than the 1060 at machine learning applications. Can someone help me understand what I’ve done wrong?

Thanks,

It appears that I need to figure out how to get tensorflow to work with later versions of cuda and cuDNN.

I am not going to dig through the posted wall of text. The Titan V is a very new GPU, and as such will need to be run with the latest:

(1) CUDA version
(2) CUDNN version
(3) driver package

for best results. What software is installed? Based on other posts in these forums: the Tensorflow binaries being distributed right now don’t support the latest GPU software, instead you need to build Tensorflow yourself. Have you done that?

It appears that actually it is training faster on the Titan V:

Step 8500 (epoch 9.89), 4.0 ms
                        ^^^^^^

as compared to GTX 1060:

Step 8500 (epoch 9.89), 6.1 ms
                        ^^^^^^

However, some things are taking a lot longer on Titan V:

Step 0 (epoch 0.00), 2262.5 ms

vs GTX 1060:

Step 0 (epoch 0.00), 36.2 ms

Since we’re only talking about 1 minute or 4 minutes here, you might want to pay close attention to the TF output, since it includes timing, to see where the missing 3 minutes is. It doesn’t appear to be incurred during the training steps themselves (right?). For example, is the Titan V run spending several minutes getting to the point where it prints out the time for the first step? If so, that is a clue.

  • Is your tensorflow compiled against CUDA 8 or CUDA 9?

If it is compiled against CUDA 8 (you would either have to compile from sources or pull a special wheel: https://github.com/tensorflow/tensorflow/issues/14244 to get TF 1.4 compiled against CUDA 9) then you may be experiencing a substantial JIT-compilation delay. It’s not unheard of for JIT to add minutes of start-up time to an application that uses many libraries (like TF). The best practice is to build TF from sources and make sure to specify the compute capability of the GPUs you are going to use:

https://www.tensorflow.org/performance/performance_guide#building_and_installing_from_source

You would need CUDA 9 and build with compute capability 7.0 for the Titan V (plus 6.1 for your GTX 1060).

  • I notice your motherboard has a number of slots. You’d want to make sure the Titan V is plugged into a x16 slot.

  • I don’t really expect MNIST type work to be a great differentiator between GPUs. That’s really small input data (usually 28x28) and the tutorial type models are usually also small. A bigger training workload will help show the difference, but on a training step basis, your Titan V already appears to be noticeably faster, to me. Many folks when benchmarking simply discard everything up to the first few training steps as “setup time” and are focused on the per-step performance. In that case, Titan V appears to be faster.

Finally, you could try NGC. It has optimized containers with TF included, set up for cc 7.0 and cc 6.1 devices:

http://docs.nvidia.com/ngc/index.html

http://docs.nvidia.com/ngc/ngc-titan-setup-guide/index.html

Based on txbob’s analysis, I’d say the most likely scenario is use of Tensorflow built against CUDA 8 (which has no support for the Volta architecture used by the Titan V), leading to substantial overhead for JIT compilation.

Yes, as I mentioned in my immediate follow up, I started building tensorflow with current libs.

I expect things to work once built with cuda 9.1 and cuDNN 7.

Thanks for all the pointers, will follow up with results shortly.

decided to install ubuntu 16.04, then ran into this:

which depends on this:

so I guess I’ll keep a watch for tensorflow commits.

Thanks for the pointers so far!

So the workaround for the file location issue given at the GitHub link doesn’t work?

My tired eyes missed the workaround all together - thank you so much for pointing that out. That was the missing key.

Tensorflow has compiled with cuda 9.1 and cudnn7.0.5. What was taking ~4.5 minutes now takes ~.5 minutes (given cost of card, work involved, and marketing claims I’m still quite disappointed). Hopefully I’ll find use cases where this card blows away the 1060. The card is installed in a 16x pcie slot.

mike@einstein:~/models/tutorials/image/mnist$ time python convolutional.py
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
2017-12-20 07:59:31.760083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1202] Found device 0 with properties:
name: Graphics Device major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:0a:00.0
totalMemory: 11.78GiB freeMemory: 11.36GiB
2017-12-20 07:59:31.760115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1296] Adding visible gpu device 0
2017-12-20 07:59:31.936593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10984 MB memory) -> physical GPU (device: 0, name: Graphics Device, pci bus id: 0000:0a:00.0, compute capability: 7.0)
Initialized!
Step 0 (epoch 0.00), 32.4 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%

Step 8500 (epoch 9.89), 3.2 ms
Minibatch loss: 1.607, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.8%
Test error: 0.8%

real 0m32.305s
user 0m31.320s
sys 0m9.484s

As txbob pointed out, the data set you are working with is small, and likely too small to effectively utilize a giant GPU like the Titan V.

It is also quite possible that the deep-learning software is not fully optimized for the Volta architecture yet (e.g. use of FP16), and not all types of machine learning may benefit equally from Volta’s special instructions targeted at deep learning (-> “Tensor Cores”) in which case Volta effectively turns into a super-charged Pascal.

If the marketing folks wouldn’t be highlighting the best case performance, they wouldn’t be doing their job. CUDA users should look carefully before leaping onto the currently expensive Volta-based products.

[Later:]
https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/

nvidia: our new cards are between 5x and 9x faster than our old cards! (https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/ “Every industry needs AI, and with this massive leap forward in speed, AI can now be applied to every industry. Equipped with 640 Tensor Cores, Volta delivers over 100 Teraflops per second (TFLOPS) of deep learning performance, over a 5X increase compared to prior generation NVIDIA Pascal™ architecture.” – https://nvidianews.nvidia.com/news/nvidia-titan-v-transforms-the-pc-into-ai-supercomputer “New Tensor Cores designed specifically for deep learning deliver up to 9x higher peak teraflops. …”)

customer: uhh, no they aren’t.

nvidia: marketing, lol!

customer: stupid me. :( Yes, customers should think carefully before believing marketing. In any case, everything is up and running, so I’m going to go play with some more benchmarks.

The operative phrase in marketing materials is typically: up to … times faster. And since there are cases where that holds true, they are not lying. Just recently we had someone post here that they got 90% of theoretical peak out of the tensor cores (that’s from memory, but I don’t think I am far off with the 90%). My understanding is that the Tensor Cores in V100 are most effective when doing deep learning with large-ish images. Presumably this is where most of the money in that market is (think Baidu, Facebook, Google, …) so it would make sense for NVIDIA to invest in features for that market.

This is the problem with all highly-specialized ISA features (Intel has a bunch of these as well): If they are a perfect fit for a given use case, and their use dominates the application performance, you get great speedup. For all other cases, the cousin of Amdahl’s Law raises its ugly head: If you speed up 50% of app time by 10,000x with dedicated hardware, you are still only getting a 2x speedup at app level.

The effective end of Moore’s Law combined with the physical limit on reticle size (which Volta V100 has reached, for example) means that performance increases from general purpose hardware will be small going forward. This is why I expect to see more and more specialized ISA features in the next few years, striving to at least give meaningful boosts to some applications.

understood! I’m way more excited about what I’m going to get done with the card than how much faster it is than some other card - sorry for taking the conversation in a weird direction!

Thank you so much for your time, I appreciate the help!

Frankly, I would be most excited about the memory bandwidth due to the use of HBM2. As FLOPS have become “too cheap to meter” in recent years, many applications have become (partially) memory bound.

What some people do not appreciate sufficiently is that a high-end GPU calls for a high-end host system for best results: high single-thread performance to minimize host-side software overhead, large high-throughput system memory, quite possibly NVMe mass storage. NVIDIA’s DGX systems are good examples of how to build well-balanced platforms for high-end GPUs.

Did anybody experience long (3-4 minutes) delay when starting a task with TensorFlow?

I’m trying to evaluate Titan V with the great tensorflow benchmarks repo (https://github.com/tensorflow/benchmarks/), but during initialization, after this line:

2017-12-22 10:39:03.469975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10744 MB memory) -> physical GPU (device: 0, name: Graphics Device, pci bus id: 0000:01:00.0, compute capability: 7.0)

and after

Running warm-up

, execution seems to stop for 3 minutes.

May be related: when using FP16, which is expected to work well, I get a lot of:

2017-12-22 10:34:58.755828: E tensorflow/core/grappler/optimizers/constant_folding.cc:1272] Unexpected type half

When I perform the same test on the same computer on 1080 Ti, there’s no delay!

Used:
Titan V, TensorFlow master, TensorFlow benchmarks master, CUDA 9.0, Driver 387.34, Ubuntu 16.04.
TensorFlow’s Compute capability is hopefully 6.1 & 7.0, but I don’t know any way to check it. Also, I wanted to attach logs (nvidia-bug-report.log.gz), but I couldn’t find a way.

Thanks!

Sounds like the

If it isn’t, then the JIT compiling already referred to in this thread may be the issue.

I had to install cuda-9.1 and cuDNN-7.0.5, then recompile tensorflow 1.4.1 from source to fix the initialization issues. If you install from source, make sure to use bazel 0.8.1 (or earlier, 0.9 breaks) and make sure to ln the math_functions.hpp file as noted earlier in this thread.

Thanks! I had cuDNN-7.0.3, that caused the delay. After compiling TF with cuDNN-7.0.3 it’s gone! Great!

We’re still seeing JIT compilation happen on our Titan V, with CUDA 9.1 and CuDNN 7.0.5 installed. We use the TF CPP API in our stack. Is anyone else still seeing it happen with this configuration? There is about 60-80 seconds of delay before our inferencer starts working. We have 2 GPUs but are using CUDA_VISIBLE_DEVICES to force our Titan V to be used.