Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Mysterious bit errors on healthy media in i386/4.99.72
  I am trying to run -current off of an USB stick on my eeepc900, in order 
to build some confidence in it before wiping the SSD, and I am seeing 
weird behaviour that I can't explain:
  After a kernel panic[0] in the middle of a bulk-install of large amounts 
of bloat[1], I thought the state of my pkg tree seemed a bit too messed up 
for having been caused by that panic alone, so I pkg_admin check'd my 
entire tree -- and it complained about a whole lot of MD5 mismatches.
  I decided to wipe the entire pkg tree, and simply copy it from the build 
host over NFS, /usr/pkg and /var/db/pkg* and all. While this also induced 
another kernel panic, I got enough to verify my suspicion: I get a LOT of 
mysterious bit errors on the usb stick when I install the packages (or 
just cp -pR them) over NFS. A binary diff shows that the files indeed 
differ from those on the build host. The build box was spewing out a 
few (from memory) "ex0: transmit underrun (20) @9000" during copy, but 
succeeded, and with all the checksumming going on in the network layers it 
seems unlikely the bit errors would have come from there.
  Seeing as these USB sticks are managed NAND that *should* report errors, 
I tried dd:ing the stick full of zeroes on an OS where I know umass itself 
works[2], dd:ing it back and comparing it to an image of zero bits only -- 
no problems. I couldn't reproduce the bit errors or cause any IO failures 
in OtherOS(tm) at all. However, the usb stick *seemed* to work fine with 
the eeepc's usb controller.
  Last night I tried to re-install the entire system onto the usb stick on 
the build host instead (which has an Intel usb controller that seemed to 
work as well) just to see what happened if I eliminate network from the 
equation, but that failed early on with infinite loops of reset failure 
after stalled bulk transfers (this rings a bell) -- so I guess umass is 
fucked there too, after all. :-/ I never saw these failures on the eeepc, 
however.
  What's bugging me is that I am not seeing random mayhem, completely 
broken files, wildly inconsistent meta-data or any of the other epic 
failures that I would expect from completely broken umass or sneaky 
-current file system bug. I only get small bit errors. I diffed a hexdump 
of /usr/pkg/bin/irssi -- it had a single bit error in the middle of it. 
And these bit errors don't seem reproducible in OtherOS. The eeepc is 
brand spanking new and doesn't exhibit stability problems in Linux, so 
while it can't be ruled out until tested, I would be surprised if it has 
faulty RAM.
  Can anyone think of anything in -current that can cause small bit errors 
like this? Has anyone experienced similar problems? The USB stick is a 
4GB Kingston DataTraveler 2.0, if that means anything...
[0] Actually this whole process repeated twice
[1] aka GNOME ;o)
[2] I have never seen usb2 umass work on any NetBSD machine, so I don't
    really trust it even when it seemed to work.
Best regards,
ali:)
Home |
Main Index |
Thread Index |
Old Index