tech-kern: buffer priority [Re: unified buffers and responsibility]

Subject: buffer priority [Re: unified buffers and responsibility]
To: None <tech-kern@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 06/13/2002 00:42:57
Hi,
I've experimented a bit about this problem of X freeze while a large
cp is running, and the target disk is the same as the system disk.

One of the reasons of the problem is that some data gets paged out
when they shouldn't be (I see activity on the system disk when doing
a large cp on another disk, clearly related to cp).
Even setting filemax low (under 10%) doesn't help, and top still reports
about 30M allocated to files (of 128M - 70M when kernel and buffer cache
are allocated).

The second problem is I/O priority: buffers of a large, batch I/O have
the same priority as a one-buffer I/O on which a process is blocked.
This also kills interractive performances (and the disksort() routines
probably make this even worse).
On my system my test partition is the last one in the disklabel, so I
changed disksort with this simple algorithm: the lower the partition
number is, the highter the priority of the buffer is.
+void
+disksort_pri(struct buf_queue *bufq, struct buf *bp)
+{
+       int part = DISKPART(bp->b_dev);
+       struct buf *bq, *nbq;
+
+       bq = BUFQ_FIRST(bufq);
+       if (bq == NULL) {
+               BUFQ_INSERT_TAIL(bufq, bp);
+               return;
+       }
+
+       while ((nbq = BUFQ_NEXT(bq)) != NULL) {
+               if (part < DISKPART(nbq->b_dev))
+                       goto insert;
+               bq = nbq;
+       }
+insert:        BUFQ_INSERT_AFTER(bufq, bq, bp);
+}

This helps a lot. There is still some slowdown, but the system is now usable
when a cp is running (without this, X will freeze completely until the cp
completes).

So I think we need something to prioritize I/O at a disk level (not partition
level). Even for server use I'm afraid this can cause problems (I'm thinking
about my mail server, on which some users have mailboxes of more than 100M).

Now I don't have much idea on what algorithm to use, neither
how to implement it. Probably something like the process scheduler, but
for I/O, processes doing a lot of I/O having their I/O priority lowered.
At which level it should be implemented is another problem. Maybe at
the pagedaemon would be enouth, as it seems the problem is mostly caused by
writes (sequential reads probably can't lead to large buf queues at the
disk level).

I already faced this problem in 1.5.x (on the mentioned mail server), but it's
worse with UBC because:
- the buf queue can be larger
- a sequencial write can push out of RAM program data

Any idea ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
--