[Linux] 使用 smartmontools 修正壞軌
第一階段 檢查硬碟
1) 首先對要測試的硬碟做完整測試
ls:~# smartctl -t long /dev/hda
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 96 minutes for test to complete.
Test will complete after Thu Nov 6 18:26:57 2008
Use smartctl -X to abort test.
2) 完成之後,用 -l 參數查看硬碟檢查狀況
ls:~# smartctl -l selftest /dev/hda
smartctl version 5.36 [mipsel-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 7877 2086648
# 2 Extended offline Completed: read failure 90% 7455 2086648
從檢查結果中可以看到,硬碟有讀取錯誤的問題。 (Completed: read failure)
3) 利用 -A 參數檢查硬碟的 smart 屬性看有沒有錯誤
ls:~# smartctl -A /dev/ad0
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
...
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
屬性 197,Current_Pending_Sector 的值為 1,表示硬碟的確有問題。
第二階段 修正壞軌
1) 首先先用 fdisk 檢查問題硬碟的 block
ls:~# fdisk -lu /dev/hda
Disk /dev/hda: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 63 787184 393561 83 Linux
/dev/hda2 787185 1799279 506047+ 82 Linux swap / Solaris
/dev/hda3 1799280 625137344 311669032+ 83 Linux
根據我們剛剛用 selftest 找出來的數據,得知在 LBA 2086648 處有問題
2086648 – 1799280 = 287368, 在 /dev/hda3 的 LBA 287368 處
2) 利用 tune2fs 找出該 partition 的 block size
ls:~# tune2fs -l /dev/hda1 |grep -i block
Block count: 77917258
Block size: 4096
3) 接著,算出 LBA 所在的位置。
公式的規則為:b = (int)((L-S)*512/B)
b = 檔案系統的 block number
B = 檔案系統的 block size (in byte)
L = 壞軌的 LBA
S = fdisk -lu 顯示的 Partition Starting Sector
int => 取整數部份
所以我們會算出:
b = (int)(287368*512/4096) = (int)(35921.0) = 35921
3) 用 debugfs 找出有問題的檔案
ls:~# debugfs
debugfs 1.40-WIP (14-Nov-2006)
debugfs: open /dev/hda1
debugfs: icheck 35921
Block Inode number
35921 28966966
debugfs: ncheck 28966966
Inode Pathname
28966966 /var/log/cpu-usage-exec.log
找出檔案為 cpu-usage-exec.log
4) 再用 dd 去將該 block 填零
ls:~# dd if=/dev/zero of=/dev/hda1 bs=4096 count=1 seek=35921
4096 bytes (4.1 kB) copied,0.000452 秒,9.1 MB/s
ls:~# sync
第三階段 重新檢查
1) 利用 smartctl -A 檢查原本有問題的 smart attribute
ls:~# smartctl -A /dev/hda
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
smart attribute 197 變為 0,回復正常
2) 再做一次完整掃瞄檢查
ls:~# smartctl -t long /dev/hda
....Wait for test complete....
ls:~# smartctl -l /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 7896 -
# 2 Extended offline Completed: read failure 90% 7877 2086648
# 3 Extended offline Completed: read failure 90% 7455 2086648
#1, complete without error, problem solved.
後記
1) debugfs 很慢
2) 這篇單純只是翻譯以及實做而已…原始資料在 SmartMonTool 的 BadBlockHowTo 裡有
3) 對實際壞軌我想應該還是沒用…吧