Subject: amd64 low-memory freezes -- possible culprit?
To: None <fvdl@netbsd.org, port-amd64@netbsd.org>
From: Christopher SEKIYA <sekiya@netbsd.org>
List: port-amd64
Date: 05/29/2004 11:03:13
I've been digging into the "amd64 freezes for many seconds under low memory
conditions", and I think I've got a line on problem.

The test system has 3/4 gigs of RAM, with /tmp being a mfs, using a profiling
kernel.  I've filled up the file buffer such that top reports:

  Memory: 457M Act, 234M Inact, 4380K Wired, 16M Exec, 590M File, 13M Free

(numbers fudged a bit as the above was copied a bit after the test -- the
"13M Free" should be around 2048k)

Let's see what happens when we run it out of memory:

  [10:51:17] monkey:/$ kgmon -b; dd if=/dev/zero of=/tmp/s6 bs=1024k count=20; kgmon -h; kgmon -p; kgmon -r
  kgmon: kernel profiling is running.
  /tmp: write failed, file system is full
  dd: /tmp/s6: No space left on device
  11+0 records in
  10+0 records out
  10485760 bytes transferred in 33.624 secs (311853 bytes/sec)
  kgmon: kernel profiling is off.
  kgmon: kernel profiling is off.
  kgmon: kernel profiling is off.

(We can ignore the "file system full" message, as the system paused for half
a minute trying to reorganize itself).

gprof says:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 63.95      7.70     7.70 1002719489     0.00     0.00  pmap_pdes_valid
 35.38     11.96     4.26      380    11.21    31.47  pmap_do_remove
  0.33     12.00     0.04     4775     0.01     0.01  copyout
  0.17     12.02     0.02     1760     0.01     0.01  copyin
  0.08     12.03     0.01    16689     0.00     0.00  pmap_clear_attrs

In contrast, a test case that _doesn't_ pause results in the following
gprof output:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 51.52      0.17     0.17    20575     0.01     0.01  copyout
 36.36      0.29     0.12     6860     0.02     0.02  copyin

I'm guessing that the code that implements the four-level page table is causing
the hangs.

Comments?  Thoughts?

-- Chris
	GPG key FEB9DE7F (91AF 4534 4529 4BCC 31A5  938E 023E EEFB FEB9 DE7F)