netbsd-users: Re: /var CORRUPTED : remote fix ?

Subject: Re: /var CORRUPTED : remote fix ?
To: NetBSD Users <netbsd-users@netbsd.org>
From: Asmodehn Shade <asmodehn@free.fr>
List: netbsd-users
Date: 07/15/2005 02:01:52
This is a multi-part message in MIME format.
--------------010900080704070601000609
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Thank you everyone for all the answers.

I have managed to fix this, but not the way I expected it :-) actually. 
And I have understood the whole story... Here is chronogically what 
happens : ( this will help me to do a full review as well, so I can just 
share my experience... )

0 ) I had already installed on my home server, among other things : 
apache, postfix with cyrus-imapd, jabberd. Apache, postfix and cyrus 
were running, and jabberd wasnt. All the logs were going to /var, and my 
imap folders were also in /var ( bad idea I guess when you only have a 
64Mo /var )
My NetBSD was a 2.0.1_STABLE and /var was mounted with softupdates... 
(NB / wasnt... )
I was connecting to it with ssh during the whole following process...

1 ) I have manage to do a DoS attack accidentally on my old apache 
(pkgsrc 2005-Q1) with a dumb download manager ( and maybe some router 
oddities as well ) so I decided to upgrade to the last one asap. While 
this has filled my log (along with some code red2 attack), my mailbox 
was full even if I had not so many mails in it... Quite upset :-/ I have 
decide to update everything...

2 ) Connect through ssh, grab the last src / xsrc (release-2-0) / pkgsrc 
(2005 Q2). Build my kernel / reboot, build userland, merge etc, reboot. 
Thanks to my handy script ^^ this was done pretty quickly ( at least for 
me, not for my PC ). Reboot and everything works fine. Clean the http 
log, and make a dummy file to prevent this codered2 attack side effect. 
OK space is back in my mailbox :-)

3 ) Now the pkgsrc update. A pkg_chk -c just show me which packages were 
old. Then I needed to install the last pkg_install package to be able to 
update my packages. I updated and tested one by one the "critical" 
packages (apache, postfix, cyrus-imap). Then I felt very confident about 
this update ( understand : far too much... ), and I decide as usual to 
do a full quick update of all the other packages like I was used to in a 
screen :
 > pkg_chk -u

4 ) then "shit happened"... I found out that the standard "make update" 
procedure had changed and, in the dependency graph, when a node is 
rebuild, all the parents seems to be updated as well... And the update 
is still the usual deinstall / build / install, so I something goes 
wrong there is more chance to end with many missing packages... And so 
my jabber ( which was NOT in my pkgchk.conf btw ) was updated, rebuilt 
and installed, and I ended with a  now too big DB in my /var... mounted 
with softupdate, just to remind... NB : It can be because of jabber or 
because of any other package, I just spotted jabber because it was 
installing when I recalled my screen.

5 ) I spent a lot of my time cleaning this mess, shutting down all the 
runing servers, comment the rc.conf entry in case of reboot, 
reinstalling the packages one by one. Once the packages installed were 
fine, jabber removed ( I will have a look at that later... ), I found 
out that even if my /var had som free space, my mailbox wasnt working, 
and syslogd begin to complain with a "/var : file system full" error, 
along with some other stuff... I tried a fsck -f /var which gave me the 
results on my last post... then I planned to unmount it to run a full 
"fsck -fpy /dev/sd0f"...

6 ) Then more problems : umount /var screwed up my server. I had to get 
someone to reboot it. I thought this "hot" reboot will force a fsck on 
restart, and will fix the problem. But the problem was still there... so 
I tried umount -f /var, which was screwing it again (as I thoughed but 
maybe it was just far too long, or maybe it was because of ssh, dont 
know...). Get someone at home -> Reboot again... then I post on 
netbsd-users... after all the answers I had I tried to "cp -RPrf /var 
/var.new" ->screwed again -> get someone -> reboot.

7 ) Then more problems : cannot connect anymore, no ping, no ssh, 
nothing... -> get someone (fear the worst, I am in NZ the server is in 
FR, dont want to go back for a Hard Drive...) -> quick analisys, the 
reboot didnt complete... (zzz-like screen) Then I understood that when  
I said "reboot", the guy over there was pushing the reset power button, 
which I had set up ( hardware, apm, apmd, powerd cant remember how,but 
it looks like I didnt use acpi... ) to gently halt or reboot the system. 
Or at least it hasnt reboot at all because I have disabled it hardware, 
but I think the guy would have noticed that... anyway...

8 )So thats is surely why my disk problem is not fixed yet, Let's 
"hot-unplug" it. Restart, take longer time than before ( good sign ), 
then I finally got the ssh, and can run fsck -f -> FIXED, no problem any 
more on the disk !!! Then restart all the server one by one and test 
them... everything looks fine except cyrus-imapd (cyradm unable to 
connect to localhost...). OK lets have a look at this configuration if 
we can put that somewhere else than /var. I choosed to make a 
/usr/local/var because I have a lot of space here... "make update", 
"/etc/rc.d/cyrus start". Fine, I achieve to create mailboxes and setup 
everything the same as before.

End )
Now everything works fine, and I will remember to be really aware of 
which package is writing in /var.
I havnt lost any data, except my old mails, and my mailbox was 
unavailable for few days because : I DIDNT KNOW / THINK ABOUT ADDING 
fsck_flags=-y in /etc/rc.conf, I should have guessed though...
I will also remember not to trust softupdate for /var. I think my /root 
wasnt softupdated because I was aware of some kind of bug...
I will also remember to be extremely carefull with "make update" now, I 
saw some post about it already...

So I feel really better now :-) BUT I have to say that a tool to 
dynamically move and resize partitions and slices would be really 
helpful sometimes...

OK this has been usefull for me at least I have change my CVS branch to 
release-2 now... and I saw some differences with release-2-0. Hope this 
was a good idea, I am going to update my OS tomorrow :-) time to sleep now.

I hope also this story will make sense to you and will be helpful for 
someone out here... I was thinking about adding a NetBSD blog to my 
website actually ^^... but I have far too much things to do for now...

Thank you again everyone for my beloved OS that I understand each and 
everyday a little more :-)
--
Asmodehn

Christos Zoulas a écrit :

>In article <e0cef0cc050711235222200375@mail.gmail.com>,
>Asmodehn Shade  <asmodehn@gmail.com> wrote:
>  
>
>>Hi all,
>>
>>I just had a weird network problem on a remote server, and after all
>>the update I have done, my var got corrupted. Here is the "fsck -f
>>/var" result :
>>
>>** /dev/rsd0f (NO WRITE)
>>** Last Mounted on /var
>>** Phase 1 - Check Blocks and Sizes
>>INCORRECT BLOCK COUNT I=12103 (2 should be 0)
>>CORRECT? no
>>
>>INCORRECT BLOCK COUNT I=12229 (2 should be 0)
>>CORRECT? no
>>
>>INCORRECT BLOCK COUNT I=12276 (2 should be 0)
>>CORRECT? no
>>
>>** Phase 2 - Check Pathnames
>>** Phase 3 - Check Connectivity
>>** Phase 4 - Check Reference Counts
>>** Phase 5 - Check Cyl groups
>>FREE BLK COUNT(S) WRONG IN SUPERBLK
>>SALVAGE? no
>>
>>SUMMARY INFORMATION BAD
>>SALVAGE? no
>>
>>BLK(S) MISSING IN BIT MAPS
>>SALVAGE? no
>>
>>2793 files, 42878 used, 21890 free (346 frags, 2693 blocks, 0.5% fragmentation)
>>
>>How can I fix that remotely (ssh) ? This computer is overseas, and
>>there is no way for me to get there and use the single user mode...
>>
>>While /var is mounted I am unable to fix it as fsck access only to
>>/dev/rsd0f which is read only. And when I try to unmount /var my
>>systems hangs, and I have to get someone overseas to push the power
>>button... and the raw reboot doesn't seem to fix that..
>>
>>Please help! I don't really know what to do...
>>    
>>
>
>There are many ways to accomplish this. The easiest one is to add
>fsck_flags=-y to /etc/rc.conf. If that does not work because your
>NetBSD is too old, then umount -f /var, and then fsck -y it.
>
>christos
>
>  
>


--------------010900080704070601000609
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Thank you everyone for all the answers.<br>
<br>
I have managed to fix this, but not the way I expected it :-) actually.
And I have understood the whole story... Here is chronogically what
happens : ( this will help me to do a full review as well, so I can
just share my experience... )<br>
<br>
0 ) I had already installed on my home server, among other things :
apache, postfix with cyrus-imapd, jabberd. Apache, postfix and cyrus
were running, and jabberd wasnt. All the logs were going to /var, and
my imap folders were also in /var ( bad idea I guess when you only have
a 64Mo /var )<br>
My NetBSD was a 2.0.1_STABLE and /var was mounted with softupdates...
(NB / wasnt... )<br>
I was connecting to it with ssh during the whole following process...<br>
<br>
1 ) I have manage to do a DoS attack accidentally on my old apache
(pkgsrc 2005-Q1) with a dumb download manager ( and maybe some router
oddities as well ) so I decided to upgrade to the last one asap. While
this has filled my log (along with some code red2 attack), my mailbox
was full even if I had not so many mails in it... Quite upset :-/ I
have decide to update everything...<br>
<br>
2 ) Connect through ssh, grab the last src / xsrc (release-2-0) /
pkgsrc (2005 Q2). Build my kernel / reboot, build userland, merge etc,
reboot. Thanks to my handy script ^^ this was done pretty quickly ( at
least for me, not for my PC ). Reboot and everything works fine. Clean
the http log, and make a dummy file to prevent this codered2 attack
side effect. OK space is back in my mailbox :-)<br>
<br>
3 ) Now the pkgsrc update. A pkg_chk -c just show me which packages
were old. Then I needed to install the last pkg_install package to be
able to update my packages. I updated and tested one by one the
"critical" packages (apache, postfix, cyrus-imap). Then I felt very
confident about this update ( understand : far too much... ), and I
decide as usual to do a full quick update of all the other packages
like I was used to in a screen : <br>
&gt; pkg_chk -u<br>
<br>
4 ) then "shit happened"... I found out that the standard "make update"
procedure had changed and, in the dependency graph, when a node is
rebuild, all the parents seems to be updated as well... And the update
is still the usual deinstall / build / install, so I something goes
wrong there is more chance to end with many missing packages... And so
my jabber ( which was NOT in my pkgchk.conf btw ) was updated, rebuilt
and installed, and I ended with a&nbsp; now too big DB in my /var... mounted
with softupdate, just to remind... NB : It can be because of jabber or
because of any other package, I just spotted jabber because it was
installing when I recalled my screen.<br>
<br>
5 ) I spent a lot of my time cleaning this mess, shutting down all the
runing servers, comment the rc.conf entry in case of reboot,
reinstalling the packages one by one. Once the packages installed were
fine, jabber removed ( I will have a look at that later... ), I found
out that even if my /var had som free space, my mailbox wasnt working,
and syslogd begin to complain with a "/var : file system full" error,
along with some other stuff... I tried a fsck -f /var which gave me the
results on my last post... then I planned to unmount it to run a full
"fsck -fpy /dev/sd0f"...<br>
<br>
6 ) Then more problems : umount /var screwed up my server. I had to get
someone to reboot it. I thought this "hot" reboot will force a fsck on
restart, and will fix the problem. But the problem was still there...
so I tried umount -f /var, which was screwing it again (as I thoughed
but maybe it was just far too long, or maybe it was because of ssh,
dont know...). Get someone at home -&gt; Reboot again... then I post on
netbsd-users... after all the answers I had I tried to "cp -RPrf /var
/var.new" -&gt;screwed again -&gt; get someone -&gt; reboot.<br>
<br>
7 ) Then more problems : cannot connect anymore, no ping, no ssh,
nothing... -&gt; get someone (fear the worst, I am in NZ the server is
in FR, dont want to go back for a Hard Drive...) -&gt; quick analisys,
the reboot didnt complete... (zzz-like screen) Then I understood that
when&nbsp; I said "reboot", the guy over there was pushing the reset power
button, which I had set up ( hardware, apm, apmd, powerd cant remember
how,but it looks like I didnt use acpi... ) to gently halt or reboot
the system. Or at least it hasnt reboot at all because I have disabled
it hardware, but I think the guy would have noticed that... anyway...<br>
<br>
8 )So thats is surely why my disk problem is not fixed yet, Let's
"hot-unplug" it. Restart, take longer time than before ( good sign ),
then I finally got the ssh, and can run fsck -f -&gt; FIXED, no problem
any more on the disk !!! Then restart all the server one by one and
test them... everything looks fine except cyrus-imapd (cyradm unable to
connect to localhost...). OK lets have a look at this configuration if
we can put that somewhere else than /var. I choosed to make a
/usr/local/var because I have a lot of space here... "make update",
"/etc/rc.d/cyrus start". Fine, I achieve to create mailboxes and setup
everything the same as before.<br>
<br>
End )<br>
Now everything works fine, and I will remember to be really aware of
which package is writing in /var.<br>
I havnt lost any data, except my old mails, and my mailbox was
unavailable for few days because : I DIDNT KNOW / THINK ABOUT ADDING
fsck_flags=-y in /etc/rc.conf, I should have guessed though...<br>
I will also remember not to trust softupdate for /var. I think my /root
wasnt softupdated because I was aware of some kind of bug...<br>
I will also remember to be extremely carefull with "make update" now, I
saw some post about it already...<br>
<br>
So I feel really better now :-) BUT I have to say that a tool to
dynamically move and resize partitions and slices would be really
helpful sometimes...<br>
<br>
OK this has been usefull for me at least I have change my CVS branch to
release-2 now... and I saw some differences with release-2-0. Hope this
was a good idea, I am going to update my OS tomorrow :-) time to sleep
now.<br>
<br>
I hope also this story will make sense to you and will be helpful for
someone out here... I was thinking about adding a NetBSD blog to my
website actually ^^... but I have far too much things to do for now...<br>
<br>
Thank you again everyone for my beloved OS that I understand each and
everyday a little more :-)<br>
--<br>
Asmodehn<br>
<br>
Christos Zoulas a &eacute;crit&nbsp;:
<blockquote cite="mid4cgfq2-2lk.ln1@pyry.gw.com" type="cite">
  <pre wrap="">In article <a class="moz-txt-link-rfc2396E" href="mailto:e0cef0cc050711235222200375@mail.gmail.com">&lt;e0cef0cc050711235222200375@mail.gmail.com&gt;</a>,
Asmodehn Shade  <a class="moz-txt-link-rfc2396E" href="mailto:asmodehn@gmail.com">&lt;asmodehn@gmail.com&gt;</a> wrote:
  </pre>
  <blockquote type="cite">
    <pre wrap="">Hi all,

I just had a weird network problem on a remote server, and after all
the update I have done, my var got corrupted. Here is the "fsck -f
/var" result :

** /dev/rsd0f (NO WRITE)
** Last Mounted on /var
** Phase 1 - Check Blocks and Sizes
INCORRECT BLOCK COUNT I=12103 (2 should be 0)
CORRECT? no

INCORRECT BLOCK COUNT I=12229 (2 should be 0)
CORRECT? no

INCORRECT BLOCK COUNT I=12276 (2 should be 0)
CORRECT? no

** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? no

SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

2793 files, 42878 used, 21890 free (346 frags, 2693 blocks, 0.5% fragmentation)

How can I fix that remotely (ssh) ? This computer is overseas, and
there is no way for me to get there and use the single user mode...

While /var is mounted I am unable to fix it as fsck access only to
/dev/rsd0f which is read only. And when I try to unmount /var my
systems hangs, and I have to get someone overseas to push the power
button... and the raw reboot doesn't seem to fix that..

Please help! I don't really know what to do...
    </pre>
  </blockquote>
  <pre wrap=""><!---->
There are many ways to accomplish this. The easiest one is to add
fsck_flags=-y to /etc/rc.conf. If that does not work because your
NetBSD is too old, then umount -f /var, and then fsck -y it.

christos

  </pre>
</blockquote>
<br>
</body>
</html>

--------------010900080704070601000609--