Subject: weird random occasional EFAULT error from open(".", O_RDONLY)
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-smp
Date: 10/11/2004 16:15:54
[[ note I'm not subscribed to tech-smp at the moment, and I'm not even
sure this is SMP related, so I've set followups to go to tech-kern ]]

Of late I've been seeing some very weird random occasional EFAULT errors
from an open(".", O_RDONLY) call in pax.

I cannot reliably re-create the conditions leading up to this error, nor
is it reproducible immediatly after it has happened.

However it _seems_ to happen only on my dual-CPU Alpha running with an
SMP 1.6.x kernel, and only when the current working directory is a newly
made directory that I've just "cd"ed into, e.g. to extract a file.  Most
often this happens when unpacking distfiles into a clean package-obj
directory, but occasionally it happens when I run "pax" by hand to view
the contents of an archive or similar.

However my focus on the Alpha and SMP is probably due simply to the fact
I've been using it more for tasks that evoke the problem than my other
machines in the past wee while.

Simply repeating the invocation of "pax" results in success.

Here's a recent example

15:23 [39] $ pax -vzf /build/woods/building/NetBSD-1.6.x-alpha-release-no-g/source/sets/gnusrc.tgz
(null): Can't open current working directory. (Bad address)
15:23 [40] $ pax -vzf /build/woods/building/NetBSD-1.6.x-alpha-release-no-g/source/sets/gnusrc.tgz
drwxr-xr-x  2 woods    wheel          0 Sep 29 12:50 usr/src/gnu
drwxr-xr-x  2 woods    wheel          0 Sep 29 12:52 usr/src/gnu/CVS
-rw-r--r--  1 woods    wheel        167 Sep 29 12:52 usr/src/gnu/CVS/Entries
^?

(FYI, the "(null)" artifact has been reported and is already fixed)

The times I've purposefully tried to reproduce the error by creating a
new directory, cd'ing into it, and then running "pax", I've never had
any problem -- instead the problem seems to sneak up on me and causes
problems at the worst of times (e.g. when building release sets where
restarting the failed operation is "difficult").

I suppose I could write a shell script to try to recreate this scenario
repeatedly and see if I can't find some other correlating factor, but my
guess is it's some SMP locking problem deep in the kernel (vnode?) code.

One other thing I've thought of is to instrument the open() system call
so that it can report more specific details via the console when this
particular scenario arises, but I'm not sure that would help either, or
what should be reported and how deep the diagnostic would have to delve.

If anyone can offer any suggestions of how to better track this down I'd
very much appreciate the help.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>