Subject: UBC problems (seen on alpha)
To: None <tech-kern@netbsd.org, port-alpha@netbsd.org>
From: Matthias Drochner <M.Drochner@fz-juelich.de>
List: port-alpha
Date: 01/14/2005 19:46:56
The mysterious compiler problem and the I/O problem
I reported yesterday are caused by an unfortunate
interaction between UBC code and the md trap handling.
I can only observe it on alpha, but it looks general
enough so that other cpu types might be affected too.

The symptom is that file writes behind EOF fail with
EINVAL occasionally.

Here is what happens, as far as I understand it:

write() dispatches into a file system dependant
vop_write function. This function (eg the ffs_write
implemented in ufs_readwrite.c, but other filesystems
are similar in this respect) calls ubc_alloc() to get
some virtual address space in the UBC, followed by a
uiomove() to copy the user data into the kernel space
UBC buffer. uiomove() returns EINVAL which goes back
the chain to the user.

The reason for uiomove() to fail is following:
That UBC buffer is not mapped into kvm yet. uiomove()
calls copyin() which arranges with the trap handler
to handle page faults appropriately and calls memcpy().
A page fault happens on the UBC kernel address.
uvm_fault() is called, this dispatches to ubc_fault(),
this calls (indirectly) the uvn_get() pager op.
uvn_get() goes into the filesystem dependant vop_getpages
method. These, while filesystem dependant, have in common
that genfs_getpages() is called. Here is where the EINVAL
really originates, the bounds check in genfs_vnops.c,
lines 516..523. This tests whether the request in within the
bounds of the file.
There is a flag "PGO_PASTEOF" which allows writes behind
EOF, exactly the case we are talking about. This flag
is not set.
It should have been set in ubc_fault(), if the fault
was a write fault, which is assumed if VM_PROT_WRITE
was passed as access_type to uvm_fault().

And here is where it gets alpha specific. The fault
was really a read fault. Obviously the alpha memcpy()
did a read from the destination address first, to
deal with alignment.

So we have a design problem here. Every piece of
code worked as it was supposed to. Well, ubc_fault()
just can't use the access_type to tell whether past-EOF
writes are allowed. A flag set somewhere in the upper
layer, where we still know that we are doing a write(),
should do it, but thinking of SMP and concurrency this
doesn't look well. And having the md trap handler
lie to uvm_fault() can't be right either...

best regards
Matthias