Hello.
Take a look at dmesg. There’s something wrong with your RAID.
Yes, you are right - I use soft raid (mdadm) for root directory, but I don’t see any raid/disk errors. The problem with nvidia-persistenced was yesterday, Apr 5, as you can see above. Here is the output of dmesg messages for last two weeks:
# dmesg -T | tail -20
[Thu Mar 23 03:51:59 2017] Process accounting resumed
[Fri Mar 24 03:52:38 2017] Process accounting resumed
[Sat Mar 25 03:53:19 2017] Process accounting resumed
[Sun Mar 26 03:53:58 2017] Process accounting resumed
[Mon Mar 27 03:54:38 2017] Process accounting resumed
[Tue Mar 28 03:55:19 2017] Process accounting resumed
[Tue Mar 28 05:20:37 2017] perf interrupt took too long (12135 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[Tue Mar 28 14:07:38 2017] tee (102551): drop_caches: 1
[Wed Mar 29 03:56:00 2017] traps: atop[92487] trap divide error ip:4073e6 sp:7ffcd099f2a0 error:0 in atop[400000+26000]
[Wed Mar 29 04:43:01 2017] tee (127348): drop_caches: 1
[Thu Mar 30 03:56:39 2017] Process accounting resumed
[Fri Mar 31 03:57:20 2017] Process accounting resumed
[Fri Mar 31 12:59:37 2017] perf interrupt took too long (6110 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[Fri Mar 31 15:59:34 2017] perf interrupt took too long (5861 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[Sat Apr 1 03:57:59 2017] Process accounting resumed
[Sun Apr 2 03:58:41 2017] Process accounting resumed
[Mon Apr 3 03:59:20 2017] Process accounting resumed
[Tue Apr 4 04:00:01 2017] Process accounting resumed
[Wed Apr 5 04:00:41 2017] Process accounting resumed
[Thu Apr 6 04:01:21 2017] Process accounting resumed
There is no messages about RAID/disk devices. As I can see via mdadm and smarctl the devices are OK:
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/md1 916G 22G 848G 3% /
# mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed Sep 23 08:53:37 2015
Raid Level : raid1
Array Size : 975630144 (930.43 GiB 999.05 GB)
Used Dev Size : 975630144 (930.43 GiB 999.05 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Apr 6 09:04:42 2017
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : rs-20:1
UUID : ddbdba12:40a33e72:4c129bb2:a1b9234a
Events : 328
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2
# smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-30-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Constellation ES.3
Device Model: ST1000NM0033-9ZM173
Serial Number: Z1W314K4
LU WWN Device Id: 5 000c50 079d9207c
Firmware Version: SN04
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Apr 6 09:06:40 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 609) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 119) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 063 044 Pre-fail Always - 184059370
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 030 Pre-fail Always - 1469655243
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16486
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 073 064 045 Old_age Always - 27 (Min/Max 23/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 697
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 022 008 000 Old_age Always - 184059370
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
# smartctl -a /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-30-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Constellation ES.3
Device Model: ST1000NM0033-9ZM173
Serial Number: Z1W3147V
LU WWN Device Id: 5 000c50 079d9280f
Firmware Version: SN04
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Apr 6 09:07:05 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 122) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 063 044 Pre-fail Always - 227693476
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 030 Pre-fail Always - 1468863861
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16484
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 075 063 045 Old_age Always - 25 (Min/Max 23/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 696
194 Temperature_Celsius 0x0022 025 040 000 Old_age Always - 25 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 023 005 000 Old_age Always - 227693476
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay
The same situation in this server was in December 2016, and at that time there were no any raid/disk errors.