Professional Web Applications Themes

WRITE_DMA errors on SATA drive under 5.3-RELEASE - FreeBSD

I've gotten two messages like the ones below today on my production server (5.3-RELEASE): messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4848803 messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out What do these messages mean? The referenced drive is one of two identical SATA drives on the server; it holds /tmp and /var. I don't recall seeing these messages before. Is there a way to work backwards from the LBA to the filesystem so that I can see which file was being referenced when this occurred? -- Anthony...

  1. #1

    Default WRITE_DMA errors on SATA drive under 5.3-RELEASE

    I've gotten two messages like the ones below today on my production server
    (5.3-RELEASE):

    messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4848803
    messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out

    What do these messages mean? The referenced drive is one of two identical SATA
    drives on the server; it holds /tmp and /var. I don't recall seeing
    these messages before.

    Is there a way to work backwards from the LBA to the filesystem so that
    I can see which file was being referenced when this occurred?

    --
    Anthony


    Anthony Atkielski Guest

  2. #2

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    On Sun, Feb 27, 2005 at 03:53:30PM +0100, Anthony Atkielski wrote:
    > messages:Feb 27 14:48:17 freebie kernel: ad10: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4848803
    > messages:Feb 27 14:48:17 freebie kernel: ad10: FAILURE - WRITE_DMA timed out
    [...]
    > Is there a way to work backwards from the LBA to the filesystem so that
    > I can see which file was being referenced when this occurred?
    Theoretically, one could use 'fsdb -r' in a scripted manner, to
    generate a mapping of file names to blocks (relative to the partition
    of the file system you are mapping). Once you have the blocks, you'll
    need to do so artithmetics to map those blocks to LBA address ranges
    (perhaps via GEOM or using data in disklabels). Finally, you'll have
    to locate the range for a particular LBA address and work backwards
    up to the inode #, and then to the filename(s) that link to that inode.

    Perhaps there's already a system utility or port for this? It would be
    really useful!
    > Anthony
    Cheers,
    -cpghost.

    --
    Cordula's Web. [url]http://www.cordula.ws/[/url]
    cpghost@cordula.ws Guest

  3. #3

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    [email]cpghostcordula.ws[/email] writes:
    > Theoretically, one could use 'fsdb -r' in a scripted manner, to
    > generate a mapping of file names to blocks (relative to the partition
    > of the file system you are mapping). Once you have the blocks, you'll
    > need to do so artithmetics to map those blocks to LBA address ranges
    > (perhaps via GEOM or using data in disklabels). Finally, you'll have
    > to locate the range for a particular LBA address and work backwards
    > up to the inode #, and then to the filename(s) that link to that inode.
    Sounds complicated. Surely I'm not the first person to wish for such a
    utility ... in UNIXland, there seems to be a command for just about
    every conceivable purpose (?).
    > Perhaps there's already a system utility or port for this? It would be
    > really useful!
    I'm mainly worried about exactly what the system was trying to write at
    the time. It's not clear from the message whether the write succeeded
    or not.

    --
    Anthony


    Anthony Atkielski Guest

  4. #4

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    On Sun, Feb 27, 2005 at 05:19:32PM +0100, Anthony Atkielski wrote:
    > [email]cpghostcordula.ws[/email] writes:
    >
    > > Theoretically, one could use 'fsdb -r' in a scripted manner, to
    > > generate a mapping of file names to blocks (relative to the partition
    > > of the file system you are mapping). Once you have the blocks, you'll
    > > need to do so artithmetics to map those blocks to LBA address ranges
    > > (perhaps via GEOM or using data in disklabels). Finally, you'll have
    > > to locate the range for a particular LBA address and work backwards
    > > up to the inode #, and then to the filename(s) that link to that inode.
    >
    > Sounds complicated. Surely I'm not the first person to wish for such a
    > utility ... in UNIXland, there seems to be a command for just about
    > every conceivable purpose (?).
    Or you could write the missing ones :-).

    Actually, it's not that hard. You need three mappings:

    1. (lba address, (filesystem, block #))
    2. ((filesystem, block #), (filesystem, inode #))
    3. ((filesystem, inode #), (list of filenames linking to inode #))

    Each of those mappings could be done and displayed by a single
    utility. Combining all three into a lba2filenames program would
    then be trivial.
    > > Perhaps there's already a system utility or port for this? It would be
    > > really useful!
    >
    > I'm mainly worried about exactly what the system was trying to write at
    > the time. It's not clear from the message whether the write succeeded
    > or not.
    Yes, that's exactly my concern too.
    > --
    > Anthony
    -cpghost.

    --
    Cordula's Web. [url]http://www.cordula.ws/[/url]
    cpghost@cordula.ws Guest

  5. #5

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    On Sun, 27 Feb 2005 15:53:30 +0100, in sentex.lists.freebsd.questions
    you wrote:
     

    Could be a bad sector on the drive, or bad cable. Hard to say. Try
    /usr/ports/sysutils/smartmontools/

    It can read all sorts of info off the drive and help you narrow down
    what the problem might be.


    ---Mike
    --------------------------------------------------------
    Mike Tancsa, Sentex communications http://www.sentex.net
    Providing Internet Access since 1994
    net, (http://www.tancsa.com)
    Mike Guest

  6. #6

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    ws writes:
     

    Seems like it would be straightforward with adequate doentation.

    --
    Anthony


    Anthony Guest

  7. #7

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    Mike Tancsa writes:
     

    Wow! That is a very cool tool. There's even a Windows port so I can
    use it on my XP machine.

    The two SATA drives show no errors. The older IDE drive (which contains
    the filesystem root) shows the stuff below. There have been over 1000
    read errors over the lifetime of the disk, but the disk had some hard
    times back in December when it was in my overheated old server, so that
    might account for part of that. The most recent errors look like they
    might correlate with what I saw today (unfortunately, I'm not sure how
    to interpret them):

    ================================================== ====================
    smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/

    === START OF INFORMATION SECTION ===
    Device Model: SAMSUNG SV4002H
    Serial Number: 0413J1FR932555
    Firmware Version: QP100-07
    Device is: In smartctl database [for details use: -P show]
    ATA Version is: 6
    ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1
    Local Time is: Sun Feb 27 22:52:54 2005 CET

    ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    The SMART RETURN STATUS return value (smartmontools -H option/Directive)
    can not be retrieved with this version of ATAng, please do not rely on this value
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: Disabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: (1560) seconds.
    Offline data collection
    capabilities: (0x1b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    No Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    No General Purpose Logging support.
    Short self-test routine
    recommended polling time: ( 1) minutes.
    Extended self-test routine
    recommended polling time: ( 8) minutes.

    SMART Attributes Data Structure revision number: 9
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 1050
    4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 55
    5 Reallocated_Sector_Ct 0x0033 253 253 009 Pre-fail Always - 0
    7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
    8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
    9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 2968364
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 54
    194 Temperature_Celsius 0x0022 175 145 000 Old_age Always - 21
    197 Current_Pending_Sector 0x0033 253 253 009 Pre-fail Always - 0
    198 Offline_Uncorrectable 0x0031 253 253 009 Pre-fail Offline - 0
    199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
    201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 1

    SMART Error Log Version: 1
    Warning: ATA error count 22 inconsistent with error log pointer 4

    ATA Error Count: 22 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 22 occurred at disk power-on lifetime: 23324 hours (971 days + 20 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 88 05 01 00 00 a0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    a1 00 05 01 00 00 a0 00 49d+16:22:20.296 IDENTIFY PACKET DEVICE
    ec 00 05 01 00 00 b0 00 49d+16:22:20.296 IDENTIFY DEVICE
    a1 00 05 01 00 00 b0 00 49d+16:22:20.296 IDENTIFY PACKET DEVICE
    c4 00 19 7f 01 06 e0 ff 49d+16:22:06.296 READ MULTIPLE
    c4 00 01 40 00 00 e0 00 49d+16:20:45.296 READ MULTIPLE

    Error 21 occurred at disk power-on lifetime: 23324 hours (971 days + 20 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 88 05 01 00 00 a0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    a1 00 05 01 00 00 a0 00 49d+16:20:17.296 IDENTIFY PACKET DEVICE
    ec 00 05 01 00 00 b0 00 49d+16:20:17.296 IDENTIFY DEVICE
    a1 00 05 01 00 00 b0 00 49d+16:20:17.296 IDENTIFY PACKET DEVICE
    ca 00 0c 5f 61 38 e0 ff 49d+16:20:04.296 WRITE DMA
    e7 00 00 00 00 00 e0 00 49d+16:19:33.296 FLUSH CACHE

    Error 20 occurred at disk power-on lifetime: 23283 hours (970 days + 3 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 88 05 01 00 00 a0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    a1 00 05 01 00 00 a0 00 49d+09:02:47.296 IDENTIFY PACKET DEVICE
    ec 00 05 01 00 00 b0 00 49d+09:02:47.296 IDENTIFY DEVICE
    a1 00 05 01 00 00 b0 00 49d+09:02:47.296 IDENTIFY PACKET DEVICE
    c4 00 1a ff cd 06 e0 ff 49d+09:02:34.296 READ MULTIPLE
    c4 00 20 df cd 06 e0 ff 07:57:42.000 READ MULTIPLE

    Error 19 occurred at disk power-on lifetime: 23281 hours (970 days + 1 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 88 05 01 00 00 a0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    a1 00 05 01 00 00 a0 00 07:50:43.000 IDENTIFY PACKET DEVICE
    ec 00 05 01 00 00 b0 00 07:50:43.000 IDENTIFY DEVICE
    a1 00 05 01 00 00 b0 00 07:50:43.000 IDENTIFY PACKET DEVICE
    c4 00 07 98 01 06 e0 ff 07:50:43.000 READ MULTIPLE
    e3 00 00 40 00 00 a0 00 07:50:43.000 IDLE

    Error 18 occurred at disk power-on lifetime: 23272 hours (969 days + 16 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 88 05 01 00 00 a0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    b0 d5 01 01 4f c2 e0 00 05:59:56.000 SMART READ LOG
    b0 d1 01 01 4f c2 e0 00 05:59:56.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
    b0 d0 00 00 4f c2 e0 00 05:59:56.000 SMART READ DATA
    b0 da 00 00 4f c2 e0 00 05:59:56.000 SMART RETURN STATUS
    b0 da 00 00 4f c2 e0 00 05:59:56.000 SMART RETURN STATUS

    SMART Self-test log structure revision number 1
    No self-tests have been logged. [To run self-tests, use: smartctl -t]


    Device does not support Selective Self Tests/Logging

    --
    Anthony


    Anthony Guest

  8. #8

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    On Sun, 27 Feb 2005 23:09:50 +0100, in sentex.lists.freebsd.questions
    you wrote:
     
    >
    >
    >The two SATA drives show no errors. The older IDE drive (which contains
    >the filesystem root) shows the stuff below. There have been over 1000
    >
    >Device does not support Selective Self Tests/Logging[/ref]


    Try running some of the tests on the SATA drives as well as run the
    monitoring daemon. With any luck, it will provide a little more
    information about the error condition you are seeing.

    ---Mike
    --------------------------------------------------------
    Mike Tancsa, Sentex communications http://www.sentex.net
    Providing Internet Access since 1994
    net, (http://www.tancsa.com)
    Mike Guest

  9. #9

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    At 3:53 PM +0100 2/27/05, Anthony Atkielski wrote: 

    First question: which SATA controller are you using? And what is
    the make&model of the hard drives that you are using?

    Note: There have been several different threads on different mailing
    lists from users having WRITE_DMA errors similar to this. At least
    some of the problem is in the code which handles disk I/O. The
    developer who works the most on that code is in the middle of a
    fairly major set of improvements to it, as is described in the
    thread with a subject of:

    UPDATE2: ATA mkIII first official patches - please test!

    on the freebsd-current and freebsd-stable mailing list. That major
    set of improvements is still being tested, but it does solve some
    ATA/SATA issues for many users. Which issues you are running into
    will depend on which SATA controller you have, and the make&model
    of SATA hard-disks that you have attached to the controller.

    I realize that none of that info really helps you right now, but
    I just thought I would say that it may be you're not having any
    hardware problems. Or at least, not on the disk itself. It might
    be a problem with the disk-controller, or it might be fairly minor
    timing-problems that come up under certain kinds of load.

    Of course, it still *could* be your hard disk... Also note that I
    am not an expert on hard disks or disk I/O. It's just that I've
    suffered through many similar problems, and I know that Søren has
    been working on the newer, improved code for handling ATA/SATA.

    --
    Garance Alistair Drosehn = netel.rpi.edu
    Senior Systems Programmer or org
    Rensselaer Polytechnic Institute or edu
    Garance Guest

  10. #10

    Default RE: WRITE_DMA errors on SATA drive under 5.3-RELEASE


     
    >
    > Wow! That is a very cool tool. There's even a Windows port so I can
    > use it on my XP machine.
    >
    > The two SATA drives show no errors. The older IDE drive
    > (which contains
    > the filesystem root) shows the stuff below. There have been over 1000
    > read errors over the lifetime of the disk, but the disk had some hard
    > times back in December when it was in my overheated old server, so that
    > might account for part of that. The most recent errors look like they
    > might correlate with what I saw today (unfortunately, I'm not sure how
    > to interpret them):[/ref]

    Rule of thumb on IDE hard drives, if they show more than a few errors
    with a
    tool like smartmon, they need to be thrown in the garbage.

    Heat is the number one enemy of hard drives. If this drive overheated,
    particularly over a long timeperiod, resistance values and semiconductor
    values can shift, permanently, in the electronics of the drive. So even
    if the heads and platters are still good, your on borrowed time with the
    circuit board. And since it's the circuit board that's dodgy, the drive
    surface isn't failing, so the problems aren't going to register with
    S.M.A.R.T.

    Despite S.M.A.R.T., the vast majority of IDE hard drives that fail, fail
    without warning.

    Ted

    Ted Guest

  11. #11

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    Ted Mittelstaedt writes:
     

    Seems prudent to me, but right now I don't have the budget to replace
    this drive (yes, 40 GB IDE drives are cheap, but I don't have even
    that).

    --
    Anthony


    Anthony Guest

  12. #12

    Default Re: WRITE_DMA errors on SATA drive under 5.3-RELEASE

    Garance A Drosihn writes:
     

    The controller is built into the Asus P4P800-E motherboard, and is
    based on the Intel ICH5R southbridge chipset. There's also a Promise
    20378 RAID controller on board but I do NOT use it (disabled in BIOS).
     

    The SATA drives are two identical Western Digital WD1200JD 120-GB
    drives, 7200 RPM. Device ad10 holds /tmp and /var; device ad12 holds
    /usr.

    There is also a third drive, an older Samsung SV4002H (40 GB), connected
    to the primary IDE controller. This drive holds the root /.

    Although the error messages I've seen name ad10 (the first SATA drive),
    smartctl says that no errors have occurred on either of these
    drives--whereas it does show a log of errors on the third drive (ad0)
    that seem to correspond mysterious to the errors in the message.
     

    So I've surmised. The problem seems to be quite rare, but since this is
    a production server I worry about disk writes not being completed; I
    have no easy way to tell whether writes were actually lost or not.
     

    I don't think there are any hardware problems at all. This isn't a
    terribly exotic configuration. It's probably a bug or configuration
    problem.

    --
    Anthony


    Anthony Guest

Similar Threads

  1. SATA Drive
    By Al Krieger in forum Ubuntu
    Replies: 3
    Last Post: March 8th, 11:08 PM
  2. Problem - Red Hata 9 - na SATA 150 jak zainstalować
    By Skuf in forum Linux / Unix Administration
    Replies: 0
    Last Post: January 31st, 10:32 PM
  3. SATA drives
    By Shazbot in forum Linux Setup, Configuration & Administration
    Replies: 3
    Last Post: August 25th, 12:33 AM
  4. Slackware 9.0 and SATA Drive
    By Andrew Sarangan in forum Linux Setup, Configuration & Administration
    Replies: 1
    Last Post: August 6th, 02:03 PM
  5. OT: WD Raptor SATA Harddrive
    By Coen Naninck in forum Macromedia Fireworks
    Replies: 2
    Last Post: July 19th, 12:32 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139