port-sparc64: Ultra 30 almost working

Subject: Ultra 30 almost working
To: None <port-sparc64@netbsd.org>
From: Geoff Adams <gadams@avernus.com>
List: port-sparc64
Date: 12/09/2001 04:19:09
I've loaded an Ultra 30 with -current (kernels built from sources from 
December 5 and from an hour ago), and it's almost usable. It comes up 
into single user mode (sometimes), and every operation I've attempted so 
far works (after some effort). However, processes frequently get "stuck."

I'll describe the symptoms in some detail, in hopes that someone can 
help me figure out how to debug the problem.

If I, for example, 'fsck /dev/rsd0a', it will print out the first line 
of output, and hang. If I drop into ddb with '+++++' and then continue, 
the next couple lines of fsck output appear, and it hangs again. Repeat 
the ddb-continue cycle by typing '+++++<CR>', and get the next line or 
two, etc.

Most processes get stuck like this sooner or later. It seems that it 
might be related to disk activity. Untarring /usr to disk, for instance, 
gets stuck many dozens of times, usually with the disk activity light 
lit. When I drop into the debugger and return, the disk just resumes 
churning away. Doing a 'df' will hang the first time after filesystem 
write activity (as will sync, so the cause is probably similar), while a 
subsequent 'df' (after clearing up the first one via '+++++<CR>') will 
print its output just fine.

It's not just disk activity, though. The same problem occurs if I 
netboot the machine and never mount a local disk. For some reason, 
probing SCSI devices takes over a half hour (when it completes at all) 
if I netboot, while the same kernel, booted from disk, probes the SCSI 
devices just fine. Also, some non-disk-related processes, such as 
'ifconfig -a' also exhibit this behavior.

When a process is hung, I can still interact with other processes. For 
instance, if the tar process hangs, I can hit ^Z, and after a 
ddb-continue cycle, I'll see the "suspended" message. I can then 'bg' 
the tar, and it will continue in the background for a few seconds 
(before it hangs again), during which time I can do something else, such 
as 'ls'. The ddb-continue cycle allows background processes to continue, 
just as well as when they were in the foreground.

Traces in ddb don't seem interesting, since I'm not actually breaking 
into the debugger during execution of whatever is causing the stoppage. 
In fact, it seems as if there's nothing really causing the processes to 
stop, but rather some interrupt is being missed, or something is not 
occurring to cause the process to be switched in from the run queue. If 
there's some interesting piece of data I can provide, please let me know.

I haven't been able to infer much meaning from 'ps -alxww' output, 
either. For instance, a hung 'sync' shows up in wait channel "getblk," 
status "D," with a running time of 71582788:15.99. I'm guessing it's 
hung before it's even gotten started. A hung 'tar', on the other hand, 
shows up in "biowait," status "D," with 0:00.40 on the clock.

Is there anything anyone can think of that I could look at to narrow 
down the problem? Does this pattern ring any bells?

This is so close to working, but this problem makes the machine 
completely unusable.

Thanks,
- Geoff