“Fixing” bad blocks on SAS HDD by reallocating it

Short story: One of the HDDs on my remote server has failed a long SMART test. Usually, this indicates that it’s time to replace the disk, but in this particular case, immediate replacement was not possible. I had to do something to prevent “bad block” affecting server performance.

smartctl is you best friend to investigate problems with disks. So what we have:

 root@host:~# smartctl -x /dev/sdx
 -- snip --
 SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
\# 1  Background long   Failed in segment -->       7    8766        1312560352 [0x3 0x5d 0x1]
--snip---
Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 8767:54 [526074 minutes]
    Number of background scans performed: 2096,  scan progress: 0.00%
    Number of background medium scans performed: 2096

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 8528:47  000000004e3c14ff  [3,11,0]   Require Write or Reassign Blocks command
   2 8528:46  000000004e3c14fe  [3,11,0]   Require Write or Reassign Blocks command
   3 8528:46  000000004e3c14fd  [3,11,0]   Require Write or Reassign Blocks command
   --- snip ---
   31 8528:45  000000004e3c14e0  [3,11,0]   Require Write or Reassign Blocks command

Clearly, there’s something wrong with the block in position LBA 1312560352.
Let’s confirm it by attempting to read that sector.

root@host:~# dd of=/dev/null if=/dev/sdx skip=1312560352 count=1 bs=512
dd: error reading '/dev/sdx': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 5.66903 s, 0.0 kB/s

The usual way to “fix” “Offline Uncorrectable” errors on a SATA disk is to write something to the affected sector so the drive can reallocate that sector. Let’s give it a go. There is nothing to lose; the data is gone anyway.

root@host:~# dd if=/dev/zero of=/dev/sdx seek=1312560352 count=1 bs=512
dd: error writing '/dev/sdx': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 2.83703 s, 0.0 kB/s

Nope, not working. :-( It will probably work if I can overwrite all bad sectors in one go, but I haven’t managed to figure out how to do it with dd. Using a larger block size didn’t help.

Just couple of comments on dd parameters.
bs=512 - The 512 is specified because that particular disk uses 512-byte logical sectors. If the drive were using 4k logical sectors, bs=4096 should be used.
skip=1312560352 and seek=1312560352 - please note that different parameters are used for reading and writing, with skip used for reading and seek used for writing. Additionally, of and if are also different in the two commands. 1312560352 comes from the LBA_first_err column in the SMART Self-test log. You can also obtain it from the Background scan results log by converting the values in the lba(hex) column to decimal.

root@host:~#printf "%d" 0x000000004e3c14e0
1312560352

Okay, let’s bring heavy artillery - sg_reassign tool from sg3-utils package.
It is easy to use. Let’s tell the drive to reassign one of the failed sectors.

sg_reassign --address=1312560352 /dev/sdx

It looks different now.

root@host:~#  smartctl -x /dev/sdx | grep 000000004e3c14e0
  31 8528:45  000000004e3c14e0  [3,11,0]   Reassigned by app, has valid data

Looks better, lets do it for all other failed sectors by piping stuff to bash

 root@host:~#smartctl -x /dev/sdx | grep "Require Write or Reassign Blocks command" | awk '{ print("sg_reassign -a 0x"$3" /dev/sdx") }' | bash

smartctl -x /dev/sdx now shows that all sectors have been reassigned.

Let’s try to read and write to same sectors as before.

root@host:~# dd of=/dev/null if=/dev/sdx skip=1312560353 count=1 bs=512
1+0 records in
1+0 records out
512 bytes copied, 0.0454159 s, 11.3 kB/s
root@host:~# dd if=/dev/zero of=/dev/sdx seek=1312560352 count=1 bs=512
1+0 records in
1+0 records out
512 bytes copied, 0.0463502 s, 11.0 kB/s

It works now; the drive is “fixed.” The bad sectors are still present, but the drive won’t attempt to read or write to them.

And the last thing, run a ZFS scrub to restore any data that may have been in these sectors.

root@host:~#zpool scrub tank