depart | Reprise du message précédent : A l'aide svp...
Cette nuit un de mes KS a pseudo crashé (plus de services web)
je me connecte en ssh, ça marche, je ne trouve rien de spécial. Il y a de la place sur les partitions.
Un redémarrage d'apache/php ne donne rien (ils ne veulent pas).
le debug là dessus ne m'informe pas de grand chose, je me dis, allez je fais un apt update et je reboote... mais même l'apt ne veut pas (de tête il me dit que le système est read-only). Bon allez, un bon reboot et ça devrait plier le truc.
et là c'est le drame, plus rien
le serveur ping mais pas de ssh
reboot hard pas mieux.
Là je suis en rescue, mais je ne vois pas trop ce que je peux faire.
J'ai monté la partition système (sda1 dans /mnt) et je regarde les logs. Tout s'arrête à 23h59 (le serveur a crashé un tout petit peu après minuit sur l'aspect web, mais c'est étonnant de ne rien trouver car j'ai quand même réussi à accéder en ssh ce matin).
- dans fstab tout est normal (strictement identique à mes autres KS)
- j'ai tenté de désactiver ufw (mais il a l'air normal) et de rebooter en mode normal -> pas mieux
dans les autres logs dmesg, kern, ... je ne vois rien de spécial, surtout il n'y a rien issu de mes autres tentatives de reboot (genre le reboot hard aboutit à ce que le serveur pinge, donc l'os démarre bien un peu... mais rien dans les logs). Aucun log du 11 juillet par exemple.
Que faire ? Des suggestions ?
Et bien sûr sur les 3 que j'ai, c'est le serveur le plus important avec plein de clients qui comptent dessus...
Smartctl du disque :
Spoiler :
root@rescue-customer-eu (ns396063.ip-176-31-121.eu) /mnt/var/log # smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.18-mod-std] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: HGST Ultrastar 7K6000
Device Model: HGST HUS726020AAA610
Serial Number: N4G3ZCHUT1AYT32
LU WWN Device Id: 5 000cca 245c2afd3
Firmware Version: A5GNT920
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Jul 11 09:20:23 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 113) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 288) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 135 135 054 Pre-fail Offline - 112
3 Spin_Up_Time 0x0007 136 136 024 Pre-fail Always - 218 (Average 220)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 42
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 950
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18
9 Power_On_Hours 0x0012 094 094 000 Old_age Always - 43212
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 42
192 Power-Off_Retract_Count 0x0032 098 098 000 Old_age Always - 2442
193 Load_Cycle_Count 0x0012 098 098 000 Old_age Always - 2442
194 Temperature_Celsius 0x0002 095 095 000 Old_age Always - 63 (Min/Max 10/66)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 950
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 12
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 50
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 104 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 104 occurred at disk power-on lifetime: 43201 hours (1800 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 b0 37 50 40 08 2d+04:11:28.791 READ FPDMA QUEUED
60 90 08 10 6f 79 40 08 2d+04:11:26.027 READ FPDMA QUEUED
61 08 00 00 08 10 40 08 2d+04:11:26.026 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 08 2d+04:11:26.020 FLUSH CACHE EXT
ec 00 01 00 00 00 00 08 2d+04:11:26.020 IDENTIFY DEVICE
Error 103 occurred at disk power-on lifetime: 43201 hours (1800 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 20 b0 37 50 40 08 2d+04:11:25.658 READ FPDMA QUEUED
60 20 30 08 d5 1a 40 08 2d+04:11:23.004 READ FPDMA QUEUED
60 20 28 00 34 7d 40 08 2d+04:11:23.003 READ FPDMA QUEUED
60 08 18 00 38 50 40 08 2d+04:11:22.905 READ FPDMA QUEUED
60 08 10 f8 37 50 40 08 2d+04:11:22.898 READ FPDMA QUEUED
Error 102 occurred at disk power-on lifetime: 43177 hours (1799 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 90 e0 d5 ac 40 08 1d+04:11:36.359 READ FPDMA QUEUED
61 10 98 e8 88 86 40 08 1d+04:11:33.595 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 08 1d+04:11:33.594 FLUSH CACHE EXT
47 00 01 12 00 00 a0 08 1d+04:11:33.592 READ LOG DMA EXT
47 00 01 00 00 00 a0 08 1d+04:11:33.591 READ LOG DMA EXT
Error 101 occurred at disk power-on lifetime: 43177 hours (1799 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 60 e0 d5 ac 40 08 1d+04:11:33.426 READ FPDMA QUEUED
61 08 58 e8 25 17 40 08 1d+04:11:30.656 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 08 1d+04:11:30.656 FLUSH CACHE EXT
ea 00 00 00 00 00 a0 08 1d+04:11:30.640 FLUSH CACHE EXT
61 c0 28 28 25 17 40 08 1d+04:11:30.614 WRITE FPDMA QUEUED
Error 100 occurred at disk power-on lifetime: 43177 hours (1799 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 43 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 c0 f0 d4 ac 40 08 1d+04:11:30.409 READ FPDMA QUEUED
60 68 c8 80 d6 ac 40 08 1d+04:11:27.645 READ FPDMA QUEUED
60 18 b8 d8 c9 ac 40 08 1d+04:11:27.645 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 08 1d+04:11:27.641 FLUSH CACHE EXT
60 20 90 d0 d4 ac 40 08 1d+04:11:27.627 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2486 -
# 2 Short offline Completed without error 00% 2483 -
# 3 Short offline Completed without error 00% 2483 -
# 4 Short offline Completed without error 00% 2442 -
# 5 Short offline Completed without error 00% 2434 -
# 6 Short offline Completed without error 00% 242 -
# 7 Short offline Completed without error 00% 234 -
# 8 Short offline Completed without error 00% 234 -
# 9 Short offline Completed without error 00% 229 -
#10 Short offline Completed without error 00% 229 -
#11 Short offline Completed without error 00% 226 -
#12 Short offline Completed without error 00% 226 -
#13 Short offline Completed without error 00% 136 -
#14 Short offline Completed without error 00% 26 -
#15 Short offline Completed without error 00% 18 -
#16 Short offline Completed without error 00% 1 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
Message édité par depart le 11-07-2022 à 11:22:21
|