Subject: Re: 1.6 woes
To: Jukka Marin <jmarin@pyy.jmp.fi>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: port-sparc
Date: 07/29/2002 17:56:34
On Sun, Jul 28, 2002 at 11:17:21PM +0300, Jukka Marin wrote:
> On Sun, Jul 28, 2002 at 09:09:30PM +0300, Jukka Marin wrote:
> > It happened at least twice or thrice on one machine.  I'm trying it on
> > another Sparc now.  cc1 has used 91 minutes of CPU time and it's still
> > running.  Must be something wrong with gcc or the source or NetBSD...
> > 
> > CPU states:  4.5% user,  0.0% nice, 94.6% system,  0.9% interrupt,  0.0% idle
> 
> Getting worse..
> 
> CPU states:  2.7% user,  0.0% nice, 96.4% system,  0.9% interrupt,  0.0% idle
> Memory: 27M Act, 12M Inact, 132K Wired, 1800K Exec, 12M File, 1208K Free
> Swap: 128M Total, 5472K Used, 123M Free
> 
>   PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
> 25046 root      61    0    22M   22M RUN      215:36 90.72% 90.72% cc1

Well, I set up kernel profiling. With an almost idle machine the kernel
time is already high: just 12% for running top though an ssh session
load averages:  0.23,  0.64,  0.66                                     17:47:38
23 processes:  22 sleeping, 1 on processor
CPU states:  0.9% user,  0.0% nice, 12.8% system,  0.5% interrupt, 85.8% idle
Memory: 16M Act, 1936K Wired, 2336K Exec, 9956K File, 37M Free
Swap: 31M Total, 31M Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
  217 bouyer    31    0   200K  952K CPU        0:30  5.66%  5.66% top
  209 bouyer     2    0   464K 1792K select     0:20  3.76%  3.76% sshd
  160 root      18  -12   776K 1788K pause      0:33  3.22%  3.22% ntpd
[all other below at at 0% CPU]
profiling results show:
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 85.11     27.67    27.67                             idle
  9.38     30.72     3.05    26880   113.47   113.47  sun4_vcache_flush_page
  2.18     31.43     0.71                             Lgot_result
  0.34     31.54     0.11      331   332.33   332.33  pmap_count_ptes
  0.31     31.64     0.10    16509     6.06     6.06  pvhead
  0.25     31.72     0.08      426   187.79  7523.74  me_alloc
  0.22     31.79     0.07                             sparc_interrupt44c
  0.18     31.85     0.06     3874    15.49    15.49  callout_reset
  0.15     31.90     0.05     1806    27.69    27.69  memcpy
  0.15     31.95     0.05      508    98.43  6407.70  mmu_pagein
  0.12     31.99     0.04     3308    12.09    27.58  nfs_timer
  0.09     32.02     0.03     3323     9.03    44.74  softclock
  0.09     32.05     0.03     3308     9.07    48.75  hardclock
  0.09     32.08     0.03      331    90.63   504.72  fill_kproc2
  0.09     32.11     0.03       12  2500.00  2500.00  sun4_vcache_flush_segment
  0.09     32.14     0.03                             Lbcopy_doubles
  0.06     32.16     0.02     2442     8.19     8.19  pv_syncflags4_4c
  0.06     32.18     0.02     1775    11.27    11.27  lockmgr

If I read things properly, the killer is sun4_vcache_flush_page(), and
it seems to be called from me_alloc() most of the time.

More clearly: it looks like my patch is causing this overuse of system time.

I'm doing another run of profiling with a kernel compile running.

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
--