IDE Seek Errors after kernel upgrade

Josip Loncaric josip at icase.edu
Mon Jun 18 12:42:04 EDT 2001


Mark Hahn wrote:
>
> > It is possible that SeekComplete errors are due to some difficulty that
> > the drive has in tracking the servo signal in a few spots.  Not
> > accessing those spots gets around the problem.
> 
> no.  badcrc's have nothing to do with any disk operation -
> they're strictly a cable/mode/noise problem.

The original post was mine.  BadCRC errors and SeekComplete errors are
NOT related.  They happened at different times (and also on different
nodes); I only listed them together to save space.  Knowing that
cable/mode/noise problems cause BadCRC errors does not say anything
about SeekComplete errors, which are probably due to servo tracking
problems.

Today's drives have extremely high track density, so servo tracking
requires very high precision.  The largest source of tracking errors is
runout (deviation from the ideal track shape).  The servo control
algorithm estimates the repeatable runout and compensates using a
feedforward signal.  Some less-than-ideal designs estimate the
compensation parameters only at power-up, so if such a drive is on for
months at a time, its mechanical parameters could drift away from the
estimates.  Unlike SCSI drives, most IDE drives are designed for light
duty (e.g. being on only 11hrs/day).  Using them 24hr/day, 365days/year
can create mechanical problems faster than the manufacturer expected. 
As the drive's ball bearings wear, non-repeatable runout (NRRO) can
become an insurmountable problem for the servo tracking algorithm.  For
this reason (and to reduce noise and cost) some recent IDE drives use
fluid dynamic bearings, which are expected to reduce NRRO by an order of
magnitude.

A few comments regarding hard disk reliability, the way I understand it:

(1) Embedded servo signals are written at the factory using high
precision machines.  This process cannot be duplicated by the drive
itself.

(2) Some checking is done and a factory list of bad blocks is
generated.  If the drive is within tolerance, it is shipped.

(3) Today's IDE drives can map out a small number of bad blocks
automatically.  If the drive exceeds this number, the OS will start to
see them.

(4) When bad blocks (or SeekComplete errors) are found, you have three
choices:
    (i)   map them out using Linux 'e2fsck -c ...' or 'mkswap -c ...'
    (ii)  if you have IBM drives, use IBM's Disk Fitness Test to check
          the drive, map out bad blocks and zero the disk.  Afterwards,
          the drive can continue to map out bad blocks as they develop,
          hiding them from the OS for a while.
    (iii) if neither (i) nor (ii) provide a long term fix, replace the
          drive

(5) When you return a drive under warranty, you'll get a remanufactured
replacement drive.  "Remanufactured" probably means that it was
subjected to some testing at the factory, had its factory list of bad
blocks updated, and if it tested within tolerance, was shipped.  This
process is similar to what IBM's Disk Fitness Test does; so the
replacement drives have a similar chance of being bad.  A bad drive may
need to be replaced several times before a good drive is found.

(6) Finally, the drive(s) might be OK and the problem may lie
elsewhere.  If a kernel upgrade degraded drive reliability, most likely
the problem is in software, not hardware.

Sincerely,
Josip

P.S. http://www.storage.ibm.com/hardsoft/diskdrdl/library/technolo.htm

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list