Short story: One of the HDDs on my remote server has failed a long SMART test. Usually, this indicates that it’s time to replace the disk, but in this particular case, immediate replacement was not possible. I had to do something to prevent “bad block” affecting server performance.
smartctl
is you best friend to investigate problems with disks.
So what we have:
root@host:~# smartctl -x /dev/sdx
-- snip --
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
\# 1 Background long Failed in segment --> 7 8766 1312560352 [0x3 0x5d 0x1]
--snip---
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 8767:54 [526074 minutes]
Number of background scans performed: 2096, scan progress: 0.00%
Number of background medium scans performed: 2096
# when lba(hex) [sk,asc,ascq] reassign_status
1 8528:47 000000004e3c14ff [3,11,0] Require Write or Reassign Blocks command
2 8528:46 000000004e3c14fe [3,11,0] Require Write or Reassign Blocks command
3 8528:46 000000004e3c14fd [3,11,0] Require Write or Reassign Blocks command
--- snip ---
31 8528:45 000000004e3c14e0 [3,11,0] Require Write or Reassign Blocks command
Clearly, there’s something wrong with the block in position LBA 1312560352.
Let’s confirm it by attempting to read that sector.
root@host:~# dd of=/dev/null if=/dev/sdx skip=1312560352 count=1 bs=512
dd: error reading '/dev/sdx': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 5.66903 s, 0.0 kB/s
The usual way to “fix” “Offline Uncorrectable” errors on a SATA disk is to write something to the affected sector so the drive can reallocate that sector. Let’s give it a go. There is nothing to lose; the data is gone anyway.
root@host:~# dd if=/dev/zero of=/dev/sdx seek=1312560352 count=1 bs=512
dd: error writing '/dev/sdx': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 2.83703 s, 0.0 kB/s
Nope, not working. :-( It will probably work if I can overwrite all bad sectors in one go, but I haven’t managed to figure out how to do it with dd
. Using a larger block size didn’t help.
Just couple of comments on dd
parameters.
bs=512
- The 512 is specified because that particular disk uses 512-byte logical sectors. If the drive were using 4k logical sectors, bs=4096
should be used.
skip=1312560352
and seek=1312560352
- please note that different parameters are used for reading and writing, with skip
used for reading and seek
used for writing. Additionally, of
and if
are also different in the two commands. 1312560352
comes from the LBA_first_err column in the SMART Self-test log. You can also obtain it from the Background scan results log by converting the values in the lba(hex) column to decimal.
root@host:~#printf "%d" 0x000000004e3c14e0
1312560352
Okay, let’s bring heavy artillery - sg_reassign
tool from sg3-utils package.
It is easy to use. Let’s tell the drive to reassign one of the failed sectors.
sg_reassign --address=1312560352 /dev/sdx
It looks different now.
root@host:~# smartctl -x /dev/sdx | grep 000000004e3c14e0
31 8528:45 000000004e3c14e0 [3,11,0] Reassigned by app, has valid data
Looks better, lets do it for all other failed sectors by piping stuff to bash
root@host:~#smartctl -x /dev/sdx | grep "Require Write or Reassign Blocks command" | awk '{ print("sg_reassign -a 0x"$3" /dev/sdx") }' | bash
smartctl -x /dev/sdx
now shows that all sectors have been reassigned.
Let’s try to read and write to same sectors as before.
root@host:~# dd of=/dev/null if=/dev/sdx skip=1312560353 count=1 bs=512
1+0 records in
1+0 records out
512 bytes copied, 0.0454159 s, 11.3 kB/s
root@host:~# dd if=/dev/zero of=/dev/sdx seek=1312560352 count=1 bs=512
1+0 records in
1+0 records out
512 bytes copied, 0.0463502 s, 11.0 kB/s
It works now; the drive is “fixed.” The bad sectors are still present, but the drive won’t attempt to read or write to them.
And the last thing, run a ZFS scrub to restore any data that may have been in these sectors.
root@host:~#zpool scrub tank