Port-alpha archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: writing cdhdtape to CD



On Sun, 7 Jun 2009, Manuel Bouyer wrote:

On Sat, Jun 06, 2009 at 10:58:08AM +0200, Anders Hogrelius wrote:

This problem has been there a long time, at least since 3.0. It does not
only affect the CS20/DS20L but also the DS20 and 264DP. It is also the
reason why I gave up on trying to use NetBSD on my production boxes. I
suspect it might be related in some way to the problem Michael described
too as I ran in to that problem when I tried to boot my boxes with a
non-MP kernel. I didn't dig deeper into the cause however I can say
that it is not driver specific, the same problem occurs regardless of if
your disk is hooked up to the internal SCSI chain or to a card in the PCI
slots. It doesn't seem to be SCSI specific either as for me it thrashed
the filesystem on disks hooked up to the ATA controller too when I tried
that.

Can you give more details on the issue ? I have NetBSD 5.0 running on a
DS20 and several XP1000, and don't have issues with it. All are UP though.

Both of my CS20 systems have 2GB of memory and the Symbios Logic 53c1010 SCSI controller.

Both ran quite well with 3.x and 4.0 with the SCSI drives I had been using for quite some time. One CS20 had a 72GB Hitachi drive (later replace with a 72GB Compaq drive) and the other had an 18GB Seagate Cheetah. Once the problems I found with SMP in 3.x had been fixed, I was able to run MP on both for the duration of 3.x and 4.0.

The disk problems began showing up after I upgraded the drives to 140GB Fujitsu drives (MBA3147NC). Because I didn't need to reboot very often, I didn't concern myself with the disk I/O problems.

After the work Andy did on 4.99.x for locking, I tried an MP kernel a few times (it may have been on another MP alpha, I can't remember for sure). I found a problem with the tlb shootdown corrupting the pool_cache, which resulted in one of the CPUs looping. I tried using a different way of dealing with the tlb shootdown stuff, and was able to get a kernel that would run for a while, but eventually paniced (and I don't recall where that paniced). I kind of gave up pursuing that at the time, and sometime later noted that someone had addressed some kind of problem with corruption in the pool cache code (can't remember the details of that, either - age is getting to me). I attempted an MP kernel again, and I think I got panics similar to the previously mentioned ones. The problem there appears related to the tlb shootdown code again, but I haven't had the time to delve into that yet (some day I'll get there, if no one else figures it out before then). These things occured prior to the netbsd-5 branch, so the netbsd-5 kernels (and -current) are not capable of running MP at this time. [The GENERIC.MP kernel does appear to run fine with only 1 cpu enabled.]

Back to the disk I/O problems: while running a 5.0 kernel, I had one of my CS20 crash for some reason (can't remember what it was now, since I got sidetracked with the recovery). When rebooting, I ran into the disk I/O problem, and along with that found that something appeared to have scrambled one of the inode blocks on the root partition. After tracking down exactly which disk blocks contained those inodes, I was able to determine that the data was not close to what it should be. [Note to the person whose fsck clobbered the disk when it had problems reading: when my fsck fails during the preen on bootup, I am usually very careful about running an fsck that modifies the disk until I'm sure what it's complaining about and what it's going to do to fix it. That saved me from clobbering the disk more that the one block of inodes did. And that block contained files in the /.sysinst directory, so I didn't loose anything at that time.]

I continued running the 5.0 kernel after that, and again got a panic (this one was somewhere in the UDP checksum code, if I remember). Again, I experienced the disk I/O problems and once the disk I/O was working correctly, I found that another block of inodes had gotten overwritten with data similar to what happened previously. I don't know if the bad data is due to a caching issue with the disks, or a problem with the NetBSD kernel, or a problem due to the disk I/O problems.

One thing I need to try sometime is to try to see what data corruption is occuring when the disk I/O problems are occuring (but when that happens, I have problems running some programs, so it may be hard to capture any specific data to determine exactly what is getting corrupted).

--
Michael L. Hitch                        mhitch%montana.edu@localhost
Computer Consultant
Information Technology Center
Montana State University        Bozeman, MT     USA


Home | Main Index | Thread Index | Old Index