Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Severe netbsd-6 NFS server-side performance issues



        hello.  From the description, it sounds like the amr(4) driver is
really getting wedged somewhere and this is what's causing yor problem.  The
question is whether the driver is the problem, the firmware on the raid
device or just the combination of the two.  The amr driver puts a number of
transactions in flight between itself and the raid controller depending on
what it thinks the raid controller can handle.  It sounds like what's
happening is that, over time, the capacity of the raid controller is
leaking away due  to  bugs in the firmware or some interaction between the
firmware and the amr driver itself.  Converting to FreeBSD might do the
trick for you, but, then again, it might not.  I think what I would suggest
is looking at the changes the FreeBSD folks have made to the amr driver
since the two drivers diverged and see if you can figure out which changes
may account for the behavior youre seeing now.  The fact that it worked
well for several weeks and then stopped working really makes me think the
problem is with the raid device rather than the amr(4) driver itself.
However, having said that, it may be that the FreeBSD driver  knows about
the proclivities of this particular device and can work around its odities
much better.  If you're willing to pour over driver code and cvs/subversion
logs for the two drivers, I'm guessing you'll find the relatively minor
change required to make things work again with your current setup.  I've
done this with a number of NetBSD drivers, including wm(4),and pdcsata(4)
with excellent results!  If nothing else, you may be able to enable some
debugging that tells yu what's going wrong and gives you an idea of how to
fix it.  I take it you have no test raid array to work on between
maintenance windows?

-thanks
-Brian
On Jun 4,  6:06pm, Hauke Fath wrote:
} Subject: Re: Severe netbsd-6 NFS server-side performance issues
} Friday's server maintenance window resulted in more information, but not
} more clarity.
} 
} At 9:30 Uhr +0200 31.05.2012, Hauke Fath wrote:
} >>Or is nfsd really trashing the system?
} >>It could be an nfsd regression too.
} >
} >The load certainly goes away when I turn off nfsd.  ;)
} >
} >Maybe I should do that, then run bonnie to check whether local disk
} >bandwidth has changed.
} 
} That's what I started with: Switched nfsd off, then ran bonnie++ on the
} RAID with results like those I got in single-user. My take: As long as nfsd
} stays out of the way, the machine is fine (but see below).
} 
} Next, I restricted nfs access to one (NetBSD) client machine, mounted a
} share there, and ran bonnie++ on it. 'systat vmstat' on the server gave
} me > 30 MBytes/sec. I tried from a Ubuntu 10 client - same result, more or
} less.
} 
} In the end, I re-enabled nfs access on the server, set up a 4way bonnie run
} on a Ubuntu client, and left for the weekend.(*)
} 
} Checking back on Saturday, I found most of the processes on the server in
} D. The console had a lone
} 
} login: amr0: bad status (not active; 0x040)
} 
} and "systat vmstat" gave 100% disk bandwidth at 100 KBytes/sec.
} 
} I got a 'ps axl', but the serial console truncates output after the 80th
} column, even when you re-direct the output to a file, and most of the
} daemons start with /usr. grep(1) didn't work.
} 
} A reboot got stuck, I broke into the debugger and got
} <http://la.causeuse.org/hauke/NetBSD/netbsd-6-nfsd/ddb-venediger-nfsd.out.gz>.
} The "reboot 0x04" wedged the machine solidly.
} 
} A coworker reset the server on Monday morning, but found he had to actually
} power it down to un-wedge the MegaRAID. Now, the machine is back to its
} usual 10 MBytes/sec / 100% I/O / load 10 state.
} 
} What next? At this point, I a seriously contemplating to give FreeBSD a
} try. Their amr(4) appears to have seen a lot of updates that went past the
} NetBSD counterpart, and they ship nfs4 support, too ...
} 
}       hauke
} 
} 
} (*) I had disabled my "reboot the machine if nfsd found in D for more than
} 60 sec" script at this point.
} 
} 
} -- 
}      The ASCII Ribbon Campaign                    Hauke Fath
} ()     No HTML/RTF in email            Institut für Nachrichtentechnik
} /\     No Word docs in email                     TU Darmstadt
}      Respect for open standards              Ruf +49-6151-16-3281
>-- End of excerpt from Hauke Fath




Home | Main Index | Thread Index | Old Index