Subject: Re: kern/34461: multiple problems; ioapic related?
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Bert Kiers <kiersb@xs4all.net>
List: netbsd-bugs
Date: 09/04/2006 18:02:19
The following reply was made to PR kern/34461; it has been noted by GNATS.

From: Bert Kiers <kiersb@xs4all.net>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org, kiersb@xs4all.net
Subject: Re: kern/34461: multiple problems; ioapic related?
Date: Mon, 4 Sep 2006 16:05:17 +0200

 On Mon, Sep 04, 2006 at 12:20:02PM +0000, Rui Paulo wrote:
 > The following reply was made to PR kern/34461; it has been noted by GNATS.
 > 
 > From: Rui Paulo <rpaulo@fnop.net>
 > To: gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 > 	netbsd-bugs@netbsd.org
 > Subject: Re: kern/34461: multiple problems; ioapic related?
 > Date: Mon, 4 Sep 2006 13:16:53 +0100
 > 
 >  On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
 >  
 >  >> Number:         34461
 >  >> Category:       kern
 >  >> Synopsis:       multiple problems; ioapic related?
 >  >> Confidential:   no
 >  >> Severity:       critical
 >  >> Priority:       high
 >  >> Responsible:    kern-bug-people
 >  >> State:          open
 >  >> Class:          sw-bug
 >  >> Submitter-Id:   net
 >  >> Arrival-Date:   Mon Sep 04 00:35:00 +0000 2006
 >  >> Originator:     kiersb@xs4all.net
 >  >> Release:        NetBSD 4.99.1
 >  >> Organization:
 >  > 	XS4All
 >  >> Environment:
 >  > 	
 >  > 	
 >  > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:  
 >  > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/ 
 >  > i386/compile/GENERIC.MPACPI i386
 >  > Architecture: i386
 >  > Machine: i386
 >  >> Description:
 >  > 	
 >  > While rsyncing 1.3 TB from an ufs file system to an lfs file system  
 >  > on same box, after around 200 MB:
 >  >
 >  > db{0}> bt
 >  > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
 >  > --- interrupt ---
 >  > Bad frame pointer: 0xc3077318
 >  > 0x30:
 >  >
 >  > then, after rebooting and issueing save rsync command, after 392 GB:
 >  >
 >  > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
 >  >
 >  > and nothing special in kernel messages; after this sd5 message, the  
 >  > computer is *very* slow,
 >  >
 >  > root@raidtest2:~# uptime
 >  > 12:58AM  up  9:11, 2 users, load averages: 8.66, 8.59, 8.52
 >  > took around two minutes. Even typing 'uptime' on the serial console  
 >  > echoes around 1 char in two seconds.
 >  
 >  Can you monitor disk I/O with iostat at the same time you are doing a  
 >  copy ?
 
 
 before starting rsync, parity reconstruction is working on raid1:
 (37% complete)
 
 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd1          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd2          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd3          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd4          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd5          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    674   0.61    10.54       0.00      0   0.61     0.00
 sd8         16.00    675   0.61    10.55       0.00      0   0.61     0.00
 sd9         16.00    674   0.56    10.54       0.00      0   0.56     0.00
 sd10        16.00    674   0.59    10.54       0.00      0   0.59     0.00
 sd11        16.00    673   0.56    10.52       0.00      0   0.56     0.00
 sd12        16.00    674   0.55    10.54       0.00      0   0.55     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 
 during root@raidtest2:/disk0/pub# rsync --size-only -r --progress /disk1/pub/* .:
 (i/o to disks sd0..sd5 is very bursty)
 
 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0         10.69    339   0.85     3.53      12.25    479   0.85     5.73
 sd1         12.37    259   0.75     3.13      13.78    422   0.75     5.68
 sd2         10.80    340   0.88     3.58      12.27    476   0.88     5.71
 sd3         13.20    249   0.71     3.20      14.31    407   0.71     5.69
 sd4         10.69    374   0.88     3.91      11.94    491   0.88     5.72
 sd5         13.20    241   0.63     3.10      14.35    408   0.63     5.72
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    593   0.61     9.27       0.00      0   0.61     0.00
 sd8         16.00    593   0.58     9.27       0.00      0   0.58     0.00
 sd9         16.00    593   0.57     9.27       0.00      0   0.57     0.00
 sd10        16.00    593   0.61     9.27       0.00      0   0.61     0.00
 sd11        16.00    591   0.61     9.24       0.00      0   0.61     0.00
 sd12        16.00    592   0.59     9.25       0.00      0   0.59     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   1.01     0.00      50.19    489   1.01    23.97
 raid1       64.00      4   0.01     0.25       0.00      0   0.01     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 
 
 now, after 20 GB done, system is very unresponsive, but nothing in messages or
 dmesg:
 (iostat -w 1, rsync and systat vmstat: no output, ssh to box time-out, but
 serial console is 'only' very very slow)
 
 device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
 fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd0         11.37    277   0.74     3.08      12.57    374   0.74     4.59
 sd1         11.88    257   0.65     2.99      13.07    362   0.65     4.63
 sd2         11.80    258   0.67     2.98      13.03    365   0.67     4.65
 sd3         11.24    271   0.81     2.98      12.58    377   0.81     4.63
 sd4         11.82    247   0.69     2.85      13.14    362   0.69     4.65
 sd5         11.74    256   0.70     2.94      12.98    362   0.70     4.59
 sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd7         16.00    530   0.50     8.28       0.00      0   0.50     0.00
 sd8         16.00    530   0.57     8.28       0.00      0   0.57     0.00
 sd9         16.00    530   0.57     8.28       0.00      0   0.57     0.00
 sd10        16.00    530   0.56     8.28       0.00      0   0.56     0.00
 sd11        16.00    530   0.51     8.28       0.00      0   0.51     0.00
 sd12        16.00    530   0.51     8.28       0.00      0   0.51     0.00
 sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid0        0.00      0   1.00     0.00      36.31    535   1.00    18.96
 raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00
 
 systat vmstat at this moment:
 
     4 users    Load  9.34  9.10  8.04                  Mon Sep  4 15:39:01
 
 Proc:r  d  s  w     Csw    Trp    Sys   Int   Sof    Flt      PAGING   SWAPPING
         2 10       6435     23    117 13250   119     11      in  out   in  out
                                                         ops
   20.5% Sy   0.0% Us   0.0% Ni   3.3% In  76.2% Id    pages
 |    |    |    |    |    |    |    |    |    |    |
 ==========%%                                                              forks
                                                                           fkppw
            memory totals (in kB)           23237 Interrupts               fksvm
           real  virtual     free             100 cpu0 softclock           pwait
 Active 1227868  1227868     1620              13 cpu0 softnet             relck
 All    2052948  2052948   134672                 cpu0 softserial          rlkok
                                              100 cpu0 timer               noram
 Namei         Sys-cache     Proc-cache           FPU flush IPI            ndcpy
     Calls     hits    %     hits     %           FPU synch IPI            fltcp
         8        8  100                     2146 TLB shootdown I          zfod
                                                  cpu1 softnet             cow
 Disks: seeks xfers bytes %busy               100 cpu1 timer            64 fmin
    fd0                                           FPU flush IPI         85 ftarg
    ld0                                           FPU synch IPI     150021 itarg
    md0                                      2512 TLB shootdown I     1038 wired
    cd0                                           cpu2 softnet             pdfre
    sd0         685 8390K  76.9               100 cpu2 timer               pdscn
    sd1         669 8330K  72.6                   FPU flush IPI
    sd2         671 8349K  73.4                   FPU synch IPI
    sd3         679 8339K  73.8              2403 TLB shootdown I
    sd4         669 8296K  71.6                   cpu3 softnet
    sd5         669 8301K  72.2               100 cpu3 timer
    sd6                                           FPU flush IPI
    sd7         556 8896K  57.6                   FPU synch IPI
    sd8         556 8893K  58.0              2412 TLB shootdown I
    sd9         556 8893K  58.2                   ioapic0 pin 4
   sd10         556 8890K  58.6                   ioapic0 pin 6
   sd11         556 8893K  59.8             13238 ioapic2 pin 0
   sd12         556 8893K  61.4                13 ioapic2 pin 6
   sd13                                           ioapic1 pin 6
   sd14                                           ioapic0 pin 15
   sd15
  raid0         575   20M 100.0
  raid1
  raid2
 
 root@raidtest2:~# date
 Mon Sep  4 15:58:02 CEST 2006
 root@raidtest2:~# 
 
 took 1 minute, and see that vmstat is frozen for 20 minutes already.
 
 root@raidtest2:~# netstat -anfinet|grep EST
 tcp        0      0  194.109.0.22.22        194.109.0.97.63281     ESTABLISHED
 tcp        0      0  194.109.0.22.22        194.109.0.97.63327     ESTABLISHED
 tcp        0     80  194.109.0.22.22        194.109.0.97.63328     ESTABLISHED
 
 on the console took 30 seconds, just to see the connections are still there and now
 (16:00) suddenly the system starts going again; rsync, iostat and systat just continue
 
 >  
 >  >
 >  > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
 >  >
 >  > Not production system, so I can test stuff.
 >  >> How-To-Repeat:
 >  > 	
 >  >> Fix:
 >  > 	
 >  >
 >  >> Unformatted:
 >  >  	
 >  >  	
 >  
 >  
 >  
 >  	-- Rui Paulo
 >  
 >  
 > 
 
 -- 
 Bert Kiers
 XS4All UNIX systeembeheerder, lockpicker & techno anarchist