Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD/vax - worth continuing?



A month since I was updating this thread. Since I at least have some observations to offer now, I figured I should post them.

On 2016-09-21 09:59, Johnny Billquist wrote:
On 2016-09-21 08:26, Anders Magnusson wrote:
Den 2016-09-20 kl. 23:55, skrev Johnny Billquist:
On 2016-09-20 21:24, Anders Magnusson wrote:
Den 2016-09-20 kl. 19:12, skrev Johnny Billquist:
Hi, Ragge...

Tried work that out many times, but never gotten far. You want console
access to a hung or crashed system? :-)
If you cannot get into DDB then something really evil has happened. Is
this the case?

No. Seems I must have really mis-stated this. The system hangs as in
the OS stalls. The hardware is working fine, and I can break into DDB
as well. If I were to make a guess, it appears that all processes that
do disk I/O stalls.
Other things continue running. But as most things touch disk sooner or
later, pretty much everything draws to a standstill.
You aere using MSCP, eh?  I would take a guess of that it loses MSCP
buffers somewhere then.

Yes, MSCP.
Hmm... Loosing buffers. That's an interesting idea I hadn't considered.
Could be.

After lots of experimentation and playing around, I think the problem is not related to loosing buffers. I'll try to explain my observations, but this requires a bit of describing my setup as well, so bear with me.

The machine is a real VAX 8650, with 60 Megs of memory, and eight RA73 disk drives, and one ethernet.

Disk drives are connected to two UDA-50, and ethernet is DELUA.
The machine have two Unibuses. First Unibus have one UDA-50 and DELUA. Second Unibus only have one UDA-50. Disks are numered ra0 to ra7, with ra0-3 on UDA-50 #0, and ra4-ra7 on UDA-50 #1.

/, swap, /var and /home are all on ra0.
/usr is now on ra1
/usr/src in on ra2
/usr/src/external/gpl3 is on ra5

Earlier I had a ccd disk, which consisted of ra4-7, and this was all of /usr

In between I also tried having /usr/src on ra4.

In short, I have had various setups for the disks, but what I have been changing around is on which controller the different file systems have been located.

Now, with ccd, the system get stuck in uvn_fp2 when I', running cvs. It does not happen right away, but eventually it always happened.

Having skipped ccd, and just working on disks on the first UDA-50, the system seems to not have any problems. But when I do disk operations on the second UDA-50, sooner or later, the process gets stuck in biowait, and never recovers.

Now, I have tried this on different disks, and with different controllers, so I think the problem is not there. I have not tried replacing the Unibus adapter as such.

However, it seems the problem is somehow related to the controller on the second bus. Either we have some bug in the NetBSD code, or I have some other problem that I haven't noticed. I've tried exercising the disks through VMS, and haven't seen any problem through there, but I'm sure this testing have not been very thorough.

The machine runs both VMS and Ultrix fine, and pass all the diagnostics I've thrown at it so far. Unfortunately I do not have any diagnostics for the RA73 drives. The MSCP disk diagnostics I have do not recognize the RA73 drives (too new), so they do not show up that way. If anyone have newer DS diagnostics for MSCP drives than around 1990, I would be interested in getting copies.


But for now, this really smells as if we have some kind of issue with additional Unibuses in NetBSD. Interesting detail is that looking at vmsstat -i, I can see that uba0 have generated some interrupts, but uba1 never generate any interrupts.

Ragge, what interrupts would the Unibus adapter generate, and does it make sense that only one of the adapters are generating interrupts?


In the end, I have not been able to run the actual tests with cvs that I intended to, since the machine always hangs sooner or later, while working on the disk. And since /usr/src is so big, I need at least two ra73 to hold it. I could allocate another ra73 on the first UDA-50, and see if I can get through all the work then, but since I have some data on the other disks, this is a bit messy.

And really, seems like we have a problem that needs solving here.

Next time it happens, please get a process list (ps axl or from ddb) so
we can get further on diagnosing it.

Sure. That is easy.

Done a lot more than that, but it certainly seems like it gets stuck on disk, but only for disks on the second controller/second Unibus.

In fact, the machine is partially stuck right now and in ddb.

Here is ps from ddb:
db> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
9866     1 3   0        80           812f1d40            telnetd netio
7388     1 3   0        80           80e8b7e0            telnetd netio
7220     1 3   0   1000000           81548aa0                 df vcache
6190     1 3   0        80           812f1800            telnetd netio
7785     1 3   0        80           81548560               tcsh pause
6934     1 3   0        80           80be22c0               tcsh pause
6966     1 3   0        80           80be2020             pickup kqueue
3234     1 3   0         0           80be2560               find vcache
921      1 3   0        80           815ed2a0           postdrop netio
3880     1 3   0        80           81836aa0           sendmail pipe_rd
3650     1 3   0        80           812f1aa0                tee pipe_rd
3453     1 3   0        80           81548d40                 sh wait
1088     1 3   0        80           81548800                 sh wait
1092     1 3   0        80           82dbe000               cron pipe_rd
27046    1 3   0         0           83aa72a0                cvs vcache
25830    1 3   0        80           8393a7e0               tcsh ttyraw
26574    1 3   0        80           81836800               tcsh pause
28141    1 3   0        80           82194d40              login wait
26357    1 3   0        80           80e8ba80            telnetd select
21901    1 3   0         0           815ed7e0               find biowait
16289    1 3   0        80           80e8b000           postdrop netio
22585    1 3   0        80           821942c0           sendmail pipe_rd
21391    1 3   0        80           80be2aa0                tee pipe_rd
23186    1 3   0        80           812f1020                 sh wait
21144    1 3   0        80           812f1560                 sh wait
19187    1 3   0        80           838ceaa0               cron pipe_rd
12339    1 3   0         0           82194020               find biowait
12025    1 3   0        80           815ed000           postdrop netio
10843    1 3   0        80           82194800           sendmail pipe_rd
10822    1 3   0        80           82dbea80                tee pipe_rd
11165    1 3   0        80           83aa7000                 sh wait
10919    1 3   0        80           80be2800                 sh wait
12324    1 3   0        80           80e8b2a0               cron pipe_rd
1939     1 3   0        80           83aa77e0              getty ttyraw
1882     1 3   0        80           83b11d40              getty ttyraw
2007     1 3   0        80           82dbe540               cron nanoslp
463      1 3   0        80           812f12c0              inetd kqueue
1759     1 3   0        80           838ce2c0               qmgr kqueue
1747     1 3   0        80           815ed540             master kqueue
1381     1 3   0        80           82dbe2a0               sshd select
1338     1 3   0        80           8393aa80              rwhod select
1012     1 3   0        80           82dbe7e0               ntpd pause
1168     1 3   0        80           82dbed20          rpc.lockd select
862      1 3   0        80           82e7caa0          rpc.statd select
1148     5 3   0        80           82e7c020              slave nfsd
1148     4 3   0        80           82e7c2c0              slave nfsd
1148     3 3   0        80           82e7c560              slave nfsd
1148     2 3   0        80           82e7c800              slave nfsd
1148     1 3   0        80           82e7cd40             master select
349      1 3   0        80           838ce020             mountd select
1068     1 3   0        80           838ce800            rpcbind select
985      1 3   0        80           838ce560            syslogd kqueue
1        1 3   0        80           83b11aa0               init wait
0       39 3   0       200           8393a000              nfsio nfsiod
0       38 3   0       200           8393a2a0              nfsio nfsiod
0       37 3   0       200           8393a540              nfsio nfsiod
0       36 3   0       200           8393ad20              nfsio nfsiod
0       35 3   0       200           83b48000            physiod physiod
0       34 3   0       200           83aa7a80           aiodoned aiodoned
0       33 3   0       200           83aa7d20            ioflush tstile
0       32 3   0       200           83b482a0           pgdaemon pgdaemon
0       29 3   0       200           83b11800              unpgc unpgc
0       28 3   0       200           83b11560          nd6_timer nd6_timer
0       27 3   0       200           83b112c0           rt_timer rt_timer
0       26 3   0       200           83b11020        vmem_rehash vmem_rehash
0       17 3   0       200           83b48540            mscp_wq mscp_wq
0       16 3   0       200           83b487e0            mscp_wq mscp_wq
0       15 3   0       200           83b48a80         pmfsuspend pmfsuspend
0       14 3   0       200           83b48d20           pmfevent pmfevent
0       13 3   0       200           83b6a020         sopendfree sopendfr
0       12 3   0       200           83b6a2c0           nfssilly nfssilly
0       11 3   0       200           83b6a560            cachegc cachegc
0       10 3   0       200           83b6a800              vrele vrele
0        9 3   0       200           83b6aaa0             vdrain vdrain
0        8 3   0       200           83b6ad40          modunload mod_unld
0        7 3   0       200           83b80000            xcall/0 xcall
0        6 1   0       200           83b802a0          softser/0
0        5 1   0       200           83b80540          softclk/0
0        4 1   0       200           83b807e0          softbio/0
0        3 1   0       200           83b80a80          softnet/0
0    >   2 7   0       201           83b80d20             idle/0
0        1 3   0       200           802cac40            swapper uvm
db>

Fixing BDPs would also help to improve Unibus speed I assume.  That
wouldn't be too much worw.

One potential issue is that the kernel is spending a damn large amount
of time in the system these days. Performance is really sluggish, while
the same hardware with another OS really performs much better.

I still don't have any final numbers here, but I can say that in Ultrix, doing a cvs update on usr/src takes about 2h. NetBSD get stuck (for me) after maybe 12h, and at that point it have hardly started going through the files yet...

If I ever get the system to work right, I will be able to provide more interesting numbers comparing to Ultrix.

	Johnny

--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


Home | Main Index | Thread Index | Old Index