tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Problems with hangs under NetBSD-5.x
Hello. I believe I can now reliably reproduce the problem. I believe
I can also say that the problem is ffs with quotas. I have a machine in
the hung state at the moment, and when I get some time, I'll extract a back
trace and file a pr on the issue. If it's a simple locking problem that
can be fixed by code inspection, I may even also have a patch.
What I figured out was that the only difference between the machines I
have that hang and those that don't is that on the ones that hang, quotas
are enabled and used on some of the filesystems. So, I wrote some scripts
to exercise the filesystem with quotas, by concurently creating, deleting
and writing to files in the filesystem at the same time. I do not know
exactly how long it took to reproduce the hang, but I can say it did
reproduce it, and that should give us some useful data.
So, if you're using quotas on an FFS filesystem, with softdep in my
case, but I don't think that's necessary, you're probably seeing the same
hang.
-Brian
On Jul 28, 8:18am, Sverre Froyen wrote:
} Subject: Re: Problems with hangs under NetBSD-5.x
} --Boundary-00=_niwbKnEpOJU/9Ux
} Content-Type: text/plain;
} charset="iso-8859-15"
} Content-Transfer-Encoding: 7bit
} Content-Disposition: inline
}
} Hi,
}
} I see similar hangs on a NetBSD 5.0 Web server. Initially I thought they
were
} related to WAPBL -- so I disabled that. Last night, however, the system was
} hung again. I now believe my hangs may be the same ones that Brian are
} seeing.
}
} Does this warrant a PR? The information is pretty limited. I've attached
the
} output from ps.
}
} See also comments below.
}
} Thanks,
} Sverre
}
} On Wed July 15 2009 14:21:40 Brian Buhrow wrote:
} > Hello. Under NetBSD-5, I've been seeing weird hangs on one of my
} > production systems.
} > Here's what I'm seeing:
} >
} > What I see is that the machines kernel runs fine. Processes which
} > don't write to the filesystem run fine. Processes which write to the
} > filesystem, either directly or indirectly via something like syslog
} > eventually hang. I'm not sure where they hang, because I don't have ddb in
} > this kernel and I can't let the machine sit around when it happens so I
} > can't get it into ddb and look. I thought it might be the lockf problem
} > which was recently fixed up in the NetBSD-5 branch, but I've got those
} > fixes, and the problem still occurs.
}
} I can access (tested using ls) all file systems except one. Processes that
} attempt access to the "hung" file system end up in state "D" with wchan
} "tstile".
}
} > Here are some data points:
} >
} > 1. My system is a NetBSD-5 system with netbsd-5 sources as of July 10,
} > 2009.
}
} My system is compiled from the netbsd-5-0 branch in May. Are there changes
} since then that might fix this?
}
} > 2. I have a number of machines running this code in production without a
} > problem.
} >
} > 3. this particular machine is doing a lot of web serving.
}
} Same here.
}
} > 4. It is a pair of hyperthreaded Intel cpus, so it shows up as 4 cpus, but
} > it's actually only 2.
}
} Same here.
}
} > 5. I'm not using WAPBL, in fact, I don't have it built in my kernel. I
} > am,however, using ffs with softdep.
}
} No softdep and I've now seen this with and without WAPBL.
}
} > 6. Processes which don't write to the filesystem, either directly or
} > indirectly such as via syslog, do not hang. Only processes which touch the
} > filesystem hang.
}
} Looks like reads are hanging too -- but perhaps it hangs on the meta data
} update (atime?). E.g., /bin/ls hangs with no output.
}
} > 7. On my system, it takes about 18-21 hours of normal operation for the
} > problem to show up.
}
} It takes much longer here. Last time it was OK for a month.
}
} > 8. I'm only seeing this behavior on one system. One difference between
} > it an the others running 5.x is that it has more logical CPUs than the
} > others.
}
} Did you attempt to turn off hyperthreading for this system before downgrading
} to NetBSD 4?
}
} > I've seen some reports that there may be some issues with WAPBL, but
} > since I don't have that as an option in my kernel, I don't think the
} > problem is simply a WAPBL issue.
}
} I have seen it with and without WAPBL.
}
} > Here is the dmesg for the afflicted machine, for what it's worth.
} >
} > Unfortunately, this is a production machine, so I can't do a lot of
} > testing with it. In fact, I hav to drop it back to 4.x today because it's
} > beginning to impact our operations. Under 4.x it ran continuously without
} > a hitch for over a year.
} >
} > Any thoughts or ideas on how to get to the bottom of this would be
} > greatly appreciated.
}
} Same here.
}
} <dmesg output deleted>
}
} --Boundary-00=_niwbKnEpOJU/9Ux
} Content-Type: text/plain;
} charset="ISO-8859-1";
} name="ps-hang"
} Content-Transfer-Encoding: 7bit
} Content-Disposition: attachment;
} filename="ps-hang"
}
} USER PID %CPU %MEM VSZ RSS TTY STAT STARTED TIME COMMAND
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
} root 0 0.0 1.5 0 31828 ? OKl 25Jun09 173:31.56 [system
0 0 0 0 0 0 0 31828 - OKl ? 173:31.56 [system
} root 1 0.0 0.0 2932 784 ? Is 25Jun09 0:00.02 init
0 1 0 43 85 0 2932 784 wait Is ? 0:00.02 init
} root 125 0.0 0.0 2932 852 ? Ss 25Jun09 0:04.01 /usr/sb
0 125 1 0 85 0 2932 852 kqueue Ss ? 0:04.01 /usr/sb
} root 238 0.0 0.3 5476 5672 ? Ds 25Jun09 1:29.55 /usr/sb
0 238 1 0 85 0 5476 5672 vm_map Ds ? 1:29.55 /usr/sb
} root 255 0.0 0.1 5936 1592 ? Is 25Jun09 0:00.01 /usr/sb
0 255 1 0 85 0 5936 1592 select Is ? 0:00.01 /usr/sb
} root 286 0.0 0.0 2900 916 ? Is 25Jun09 0:04.78 /usr/sb
0 286 1 0 85 0 2900 916 nanoslp Is ? 0:04.78 /usr/sb
} root 341 0.0 0.1 4796 1524 ? Ds 25Jun09 0:11.09 /usr/li
0 341 1 0 85 0 4796 1524 vm_map Ds ? 0:11.09 /usr/li
} postfix 347 0.0 0.1 4796 1716 ? I 25Jun09 0:04.43 qmgr -l
12 347 341 0 85 0 4796 1716 kqueue I ? 0:04.43 qmgr -l
} root 370 0.0 0.0 2972 980 ? Ss 25Jun09 0:00.04 /usr/sb
0 370 1 0 85 0 2972 980 kqueue Ss ? 0:00.04 /usr/sb
} postfix 2910 0.0 0.0 0 0 ? ZW - 0:00.00 (pickup
12 2910 341 0 0 0 0 0 - ZW ? 0:00.00 (pickup
} www 5046 0.0 0.2 22936 3864 ? D 3:18AM 0:00.00 /usr/pk
99 5046 13735 0 117 0 22936 3864 tstile D ? 0:00.00 /usr/pk
} www 5261 0.0 0.2 22936 3904 ? D 3:18AM 0:00.00 /usr/pk
99 5261 13735 0 117 0 22936 3904 tstile D ? 0:00.00 /usr/pk
} www 5869 0.0 0.1 17476 2512 ? I Sun12AM 0:00.00 /usr/pk
99 5869 13735 0 85 0 17476 2512 socket I ? 0:00.00 /usr/pk
} www 7013 0.0 0.2 22936 4636 ? D 10:05PM 0:02.02 /usr/pk
99 7013 13735 0 117 0 22936 4636 tstile D ? 0:02.02 /usr/pk
} www 7113 0.0 0.2 22936 3876 ? D 3:16AM 0:00.00 /usr/pk
99 7113 13735 0 117 0 22936 3876 tstile D ? 0:00.00 /usr/pk
} root 7351 0.0 0.0 2824 672 ? I 3:15AM 0:00.00 tee /va
0 7351 25824 0 85 0 2824 672 piperd I ? 0:00.00 tee /va
} www 7618 0.0 0.2 22936 4112 ? D 3:15AM 0:00.01 /usr/pk
99 7618 13735 0 117 0 22936 4112 tstile D ? 0:00.01 /usr/pk
} www 7695 0.0 0.8 33176 16864 ? D 2:09AM 0:00.65 /usr/pk
99 7695 13735 0 117 0 33176 16864 tstile D ? 0:00.65 /usr/pk
} www 8361 0.0 0.2 22936 3856 ? D 3:16AM 0:00.00 /usr/pk
99 8361 13735 0 117 0 22936 3856 tstile D ? 0:00.00 /usr/pk
} root 8763 0.0 0.1 4796 1808 ? I 3:17AM 0:00.01 postdro
0 8763 21466 0 85 0 4796 1808 netio I ? 0:00.01 postdro
} www 10348 0.0 0.2 22936 3896 ? D 3:17AM 0:00.00 /usr/pk
99 10348 13735 0 117 0 22936 3896 tstile D ? 0:00.00 /usr/pk
} root 11255 0.0 0.5 20248 10388 ? Ss 30Jun09 0:45.02 httpd:
0 11255 1 0 85 0 20248 10388 select Ss ? 0:45.02 httpd:
} www 11295 0.0 0.2 22936 3900 ? D 3:18AM 0:00.00 /usr/pk
99 11295 13735 0 117 0 22936 3900 tstile D ? 0:00.00 /usr/pk
} www 12468 0.0 0.2 22936 3880 ? D 3:16AM 0:00.00 /usr/pk
99 12468 13735 0 117 0 22936 3880 tstile D ? 0:00.00 /usr/pk
} www 12944 0.0 0.2 22936 3876 ? D 3:18AM 0:00.00 /usr/pk
99 12944 13735 0 117 0 22936 3876 tstile D ? 0:00.00 /usr/pk
} root 13735 0.0 0.5 22936 11280 ? Ss 30Jun09 0:51.75 /usr/pk
0 13735 1 0 85 0 22936 11280 select Ss ? 0:51.75 /usr/pk
} www 14441 0.0 0.2 22936 3844 ? D 3:18AM 0:00.00 /usr/pk
99 14441 13735 0 117 0 22936 3844 tstile D ? 0:00.00 /usr/pk
} root 15076 0.0 0.4 16152 8512 ? Ss 30Jun09 0:43.58 httpd:
0 15076 1 0 85 0 16152 8512 select Ss ? 0:43.58 httpd:
} www 15452 0.0 1.4 40444 30096 ? I 12:04AM 0:16.53 httpd:
99 15452 15076 0 85 0 40444 30096 socket I ? 0:16.53 httpd:
} www 16020 0.0 0.2 22936 3880 ? D 3:17AM 0:00.00 /usr/pk
99 16020 13735 0 117 0 22936 3880 tstile D ? 0:00.00 /usr/pk
} www 16837 0.0 0.1 20248 2560 ? I Sun12AM 0:00.00 httpd:
99 16837 11255 0 85 0 20248 2560 socket I ? 0:00.00 httpd:
} www 16847 0.0 0.2 22936 3864 ? D 3:19AM 0:00.00 /usr/pk
99 16847 13735 0 117 0 22936 3864 tstile D ? 0:00.00 /usr/pk
} www 17100 0.0 0.2 22936 3876 ? D 3:16AM 0:00.00 /usr/pk
99 17100 13735 0 117 0 22936 3876 tstile D ? 0:00.00 /usr/pk
} www 17472 0.0 0.2 22936 3844 ? D 3:18AM 0:00.00 /usr/pk
99 17472 13735 0 117 0 22936 3844 tstile D ? 0:00.00 /usr/pk
} www 17557 0.0 1.2 36404 25148 ? I 3:05AM 0:11.10 httpd:
99 17557 11255 0 85 0 36404 25148 socket I ? 0:11.10 httpd:
} www 17633 0.0 0.1 16152 2532 ? I Sun12AM 0:00.00 httpd:
99 17633 15076 0 85 0 16152 2532 socket I ? 0:00.00 httpd:
} www 17727 0.0 0.2 22936 3844 ? D 3:17AM 0:00.00 /usr/pk
99 17727 13735 0 117 0 22936 3844 tstile D ? 0:00.00 /usr/pk
} root 18172 0.0 0.0 2948 992 ? Is 3:15AM 0:00.00 /bin/sh
0 18172 21446 0 85 0 2948 992 wait Is ? 0:00.00 /bin/sh
} www 18578 0.0 0.2 22936 4116 ? D 3:15AM 0:00.01 /usr/pk
99 18578 13735 0 117 0 22936 4116 tstile D ? 0:00.01 /usr/pk
} www 18715 0.0 0.2 22936 3892 ? D 3:15AM 0:00.01 /usr/pk
99 18715 13735 0 117 0 22936 3892 tstile D ? 0:00.01 /usr/pk
} www 18717 0.0 0.2 22936 3856 ? D 3:15AM 0:00.01 /usr/pk
99 18717 13735 0 117 0 22936 3856 tstile D ? 0:00.01 /usr/pk
} root 18937 0.0 0.0 2948 996 ? Ss 5:47AM 0:00.00 sh -c p
0 18937 22389 0 85 0 2948 996 wait Ss ? 0:00.00 sh -c p
} www 19024 0.0 0.2 22936 3852 ? D 3:16AM 0:00.00 /usr/pk
99 19024 13735 0 117 0 22936 3852 tstile D ? 0:00.00 /usr/pk
} root 19154 0.0 0.1 5556 2388 ? Ds 5:00AM 0:00.02 ftpd: v
0 19154 370 0 117 0 5556 2388 tstile Ds ? 0:00.02 ftpd: v
} www 19208 0.0 0.2 22936 3844 ? D 3:15AM 0:00.00 /usr/pk
99 19208 13735 0 117 0 22936 3844 tstile D ? 0:00.00 /usr/pk
} www 19389 0.0 0.2 22936 3840 ? D 3:16AM 0:00.00 /usr/pk
99 19389 13735 0 117 0 22936 3840 tstile D ? 0:00.00 /usr/pk
} root 19481 0.0 0.4 28820 8244 ? D 3:15AM 0:00.43 /usr/lo
0 19481 24007 0 117 0 28820 8244 tstile D ? 0:00.43 /usr/lo
} root 19606 0.0 0.1 6260 1700 ? D 4:15AM 0:00.00 /usr/lo
0 19606 25934 0 117 0 6260 1700 tstile D ? 0:00.00 /usr/lo
} www 19777 0.0 1.1 36404 22408 ? I 3:11AM 0:05.43 httpd:
99 19777 11255 0 85 0 36404 22408 socket I ? 0:05.43 httpd:
} root 19818 0.0 0.0 2900 1036 ? I 4:15AM 0:00.00 cron: r
0 19818 286 0 85 0 2900 1036 piperd I ? 0:00.00 cron: r
} www 19889 0.0 1.1 33260 23132 ? D 2:45AM 0:06.37 httpd:
99 19889 15076 0 117 0 33260 23132 tstile D ? 0:06.37 httpd:
} www 19897 0.0 1.8 46644 37232 ? I 2:55AM 0:46.55 httpd:
99 19897 11255 0 85 0 46644 37232 socket I ? 0:46.55 httpd:
} www 20584 0.0 0.2 22936 3844 ? D 3:15AM 0:00.01 /usr/pk
99 20584 13735 0 117 0 22936 3844 tstile D ? 0:00.01 /usr/pk
} www 20646 0.0 0.2 22936 3844 ? D 3:15AM 0:00.01 /usr/pk
99 20646 13735 0 117 0 22936 3844 tstile D ? 0:00.01 /usr/pk
} www 20661 0.0 0.2 22936 3888 ? D 3:18AM 0:00.00 /usr/pk
99 20661 13735 0 117 0 22936 3888 tstile D ? 0:00.00 /usr/pk
} www 21336 0.0 0.8 32152 16184 ? D 2:09AM 0:00.40 /usr/pk
99 21336 13735 0 117 0 32152 16184 tstile D ? 0:00.40 /usr/pk
} www 21379 0.0 0.2 22936 3884 ? D 3:18AM 0:00.00 /usr/pk
99 21379 13735 0 117 0 22936 3884 tstile D ? 0:00.00 /usr/pk
} root 21446 0.0 0.1 2900 1072 ? I 3:15AM 0:00.00 cron: r
0 21446 286 0 85 0 2900 1072 piperd I ? 0:00.00 cron: r
} root 21466 0.0 0.1 4796 1828 ? I 3:17AM 0:00.01 sendmai
0 21466 21446 0 85 0 4796 1828 piperd I ? 0:00.01 sendmai
} www 21638 0.0 0.2 16152 3264 ? I 3:15AM 0:00.00 httpd:
99 21638 15076 0 85 0 16152 3264 socket I ? 0:00.00 httpd:
} www 21880 0.0 0.2 22936 3860 ? D 3:18AM 0:00.00 /usr/pk
99 21880 13735 0 117 0 22936 3860 tstile D ? 0:00.00 /usr/pk
} www 22139 0.0 0.2 22936 3884 ? D 3:15AM 0:00.01 /usr/pk
99 22139 13735 0 117 0 22936 3884 tstile D ? 0:00.01 /usr/pk
} www 22289 0.0 1.0 36248 20244 ? D 5:51AM 0:01.32 /usr/pk
99 22289 13735 0 117 0 36248 20244 vm_map D ? 0:01.32 /usr/pk
} www 22295 0.0 0.2 22936 3864 ? D 3:19AM 0:00.00 /usr/pk
99 22295 13735 0 117 0 22936 3864 tstile D ? 0:00.00 /usr/pk
} www 22318 0.0 0.2 22936 3880 ? D 3:15AM 0:00.01 /usr/pk
99 22318 13735 0 117 0 22936 3880 tstile D ? 0:00.01 /usr/pk
} root 22389 0.0 0.1 3396 1320 ? S 5:47AM 0:00.01 rshd -L
0 22389 370 0 85 0 3396 1320 select S ? 0:00.01 rshd -L
} www 22442 0.0 0.2 22936 3896 ? D 3:15AM 0:00.01 /usr/pk
99 22442 13735 0 117 0 22936 3896 tstile D ? 0:00.01 /usr/pk
} root 22511 0.0 0.1 2948 1112 ? I 3:15AM 0:00.01 /bin/sh
0 22511 25824 0 85 0 2948 1112 wait I ? 0:00.01 /bin/sh
} www 22995 0.0 0.2 22936 3856 ? D 3:16AM 0:00.00 /usr/pk
99 22995 13735 0 117 0 22936 3856 tstile D ? 0:00.00 /usr/pk
} root 23143 0.0 0.1 4796 1840 ? I 3:15AM 0:00.01 sendmai
0 23143 25824 0 85 0 4796 1840 piperd I ? 0:00.01 sendmai
} www 23158 0.0 0.2 22936 3844 ? D 3:17AM 0:00.00 /usr/pk
99 23158 13735 0 117 0 22936 3844 tstile D ? 0:00.00 /usr/pk
} root 23412 0.0 0.0 2900 1036 ? I 3:15AM 0:00.00 cron: r
0 23412 286 0 85 0 2900 1036 piperd I ? 0:00.00 cron: r
} www 23723 0.0 0.2 22936 3864 ? D 3:18AM 0:00.00 /usr/pk
99 23723 13735 0 117 0 22936 3864 tstile D ? 0:00.00 /usr/pk
} root 23815 0.0 0.0 2952 848 ? O 5:47AM 0:00.00 ps -axu
0 23815 18937 0 43 0 2952 848 - O ? 0:00.00 ps -axu
} www 23859 0.0 0.2 22936 3900 ? D 3:17AM 0:00.00 /usr/pk
99 23859 13735 0 117 0 22936 3900 tstile D ? 0:00.00 /usr/pk
} root 24007 0.0 0.0 2948 1000 ? I 3:15AM 0:00.00 /bin/sh
0 24007 18172 0 85 0 2948 1000 wait I ? 0:00.00 /bin/sh
} www 24092 0.0 0.2 22936 3860 ? D 3:18AM 0:00.00 /usr/pk
99 24092 13735 0 117 0 22936 3860 tstile D ? 0:00.00 /usr/pk
} www 24211 0.0 0.2 22936 3848 ? D 3:18AM 0:00.00 /usr/pk
99 24211 13735 0 117 0 22936 3848 tstile D ? 0:00.00 /usr/pk
} www 24285 0.0 0.2 22936 3840 ? D 3:16AM 0:00.00 /usr/pk
99 24285 13735 0 117 0 22936 3840 tstile D ? 0:00.00 /usr/pk
} www 24290 0.0 0.2 22936 3872 ? D 3:18AM 0:00.00 /usr/pk
99 24290 13735 0 117 0 22936 3872 tstile D ? 0:00.00 /usr/pk
} www 24609 0.0 0.2 22936 3848 ? D 3:18AM 0:00.00 /usr/pk
99 24609 13735 0 117 0 22936 3848 tstile D ? 0:00.00 /usr/pk
} www 24648 0.0 0.2 22936 3864 ? D 3:19AM 0:00.00 /usr/pk
99 24648 13735 0 117 0 22936 3864 tstile D ? 0:00.00 /usr/pk
} www 25610 0.0 0.2 22936 3856 ? D 3:16AM 0:00.00 /usr/pk
99 25610 13735 0 117 0 22936 3856 tstile D ? 0:00.00 /usr/pk
} www 25624 0.0 0.2 22936 3884 ? D 3:18AM 0:00.00 /usr/pk
99 25624 13735 0 117 0 22936 3884 tstile D ? 0:00.00 /usr/pk
} www 25803 0.0 0.2 22936 4508 ? D 2:46AM 0:00.05 /usr/pk
99 25803 13735 0 117 0 22936 4508 tstile D ? 0:00.05 /usr/pk
} root 25824 0.0 0.0 2948 1028 ? Is 3:15AM 0:00.00 /bin/sh
0 25824 23412 0 85 0 2948 1028 wait Is ? 0:00.00 /bin/sh
} root 25934 0.0 0.0 2948 992 ? Is 4:15AM 0:00.00 /bin/sh
0 25934 19818 0 85 0 2948 992 wait Is ? 0:00.00 /bin/sh
} root 26352 0.0 0.0 2896 980 ? D 3:15AM 0:00.75 find /
0 26352 22511 0 117 0 2896 980 tstile D ? 0:00.75 find /
} www 26664 0.0 0.2 22936 3880 ? D 3:17AM 0:00.00 /usr/pk
99 26664 13735 0 117 0 22936 3880 tstile D ? 0:00.00 /usr/pk
} root 26847 0.0 0.1 4796 1808 ? I 3:15AM 0:00.01 postdro
0 26847 23143 0 85 0 4796 1808 netio I ? 0:00.01 postdro
} www 27657 0.0 0.2 22936 3860 ? D 3:17AM 0:00.00 /usr/pk
99 27657 13735 0 117 0 22936 3860 tstile D ? 0:00.00 /usr/pk
} www 27802 0.0 1.0 38296 21644 ? D 2:12AM 0:00.73 /usr/pk
99 27802 13735 0 117 0 38296 21644 tstile D ? 0:00.73 /usr/pk
} root 369 0.0 0.0 2948 876 ttyE0- I 25Jun09 0:00.01 /bin/sh
0 369 1 0 85 0 2948 876 wait I ttyE0- 0:00.01 /bin/sh
} mysql 429 0.0 1.9 124556 38976 ttyE0- Il 25Jun09 631:30.41 /usr/pk
98 429 369 0 85 0 124556 38976 select Il ttyE0- 631:30.41 /usr/pk
} root 524 0.0 0.0 2912 868 ttyE0 Is+ 25Jun09 0:00.00 /usr/li
0 524 1 0 85 0 2912 868 ttyraw Is+ ttyE0 0:00.00 /usr/li
} root 419 0.0 0.0 2912 868 ttyE1 Is+ 25Jun09 0:00.00 /usr/li
0 419 1 0 85 0 2912 868 ttyraw Is+ ttyE1 0:00.00 /usr/li
} root 470 0.0 0.0 2912 868 ttyE2 Is+ 25Jun09 0:00.00 /usr/li
0 470 1 0 85 0 2912 868 ttyraw Is+ ttyE2 0:00.00 /usr/li
} root 428 0.0 0.0 2912 868 ttyE3 Is+ 25Jun09 0:00.01 /usr/li
0 428 1 0 85 0 2912 868 ttyraw Is+ ttyE3 0:00.01 /usr/li
}
} --Boundary-00=_niwbKnEpOJU/9Ux--
>-- End of excerpt from Sverre Froyen
Home |
Main Index |
Thread Index |
Old Index