Re: Panic on a -current from 13/12/2018

To: Chavdar Ivanov <ci4ic4%gmail.com@localhost>
Subject: Re: Panic on a -current from 13/12/2018
From: Masanobu SAITOH <msaitoh%execsw.org@localhost>
Date: Tue, 18 Dec 2018 20:13:43 +0900

Hi!

On 2018/12/17 19:38, Chavdar Ivanov wrote:

I went through a series of tests. It is indeed that point the panic
takes place, the two parts of the screendump are in

http://ci4ic4.tx0.org/nb-panic-wm-03.png and
http://ci4ic4.tx0.org/nb-panic-wm-04.png .


 Thanks. This is the workaround code for broken lapic timer
counter which was added in:

	http://mail-index.netbsd.org/source-changes/2017/11/23/msg089946.html
	http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/x86/x86/lapic.c.diff?r1=1.63&r2=1.64&f=h

Your VM is configured act as KVM
(See system->acceleration(L) tab or see .box file's "Paravirt provider=")

I set up my vm to KVM and

VirtualBox gives three Intel NIC options:

Intel PRO/1000 MT Desktop (82540EM)
Intel PRO/1000 T Server   (82543GC)
Intel PRO/1000 MT Server  (82545EM)

I was able to get a panic with the same kernel from 13/12/2018 only
when I select the second option:


 I changed my VM's setting to use 82543GC. I tried hibernation
three times but I couldn't reproduce the problem. I couldn't reproduce
the same problem, but this problem must be exist because you had the
problem.

 The possibilities are:
	a) VirtualBox's lapic is not good.
	b) Our workaround code is not perfect or somewhere is not good.
	c) any others

I suspect this problem is not from if_wm.c. but from

There was a VirtualBox upgrade a few weeks ago, perhaps the problem is there.



 I read vbox/src/VBox/Devices/Network/DevE1000.cpp. One of the
difference between 82543GC emulation and other two is that
it generates interrupt when chip reset occurred. If other network
device emulation works well, I suspect that the reset timing in vbox
is not good and it makes no update of lapic timer.

 Workarounds are:
	a) Don't use KVM mode and use "Default" or other.
	   On my Windows7's virtual box, "Default" makes
	   CPUID2_RAZ bit not set. It makes NetBSD recognize
	   it's not on KVM.
	b) Use Other than 82543GC.
	c) any others

BTW, when I use 82543GC emulation, I got the following bug:

makphy0 at wm0 phy 0: Marvell 88E1000 Gigabit PHY, rev. 0
makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
makphy1 at wm0 phy 1: Marvell 88E1000 Gigabit PHY, rev. 0
makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

(snip)

makphy31 at wm0 phy 31: Marvell 88E1000 Gigabit PHY, rev. 0
makphy31: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ifmedia_match: multiple match for 0x20/0xfbff9ff, selected instance 0


This _IS_ a bug of VirtualBox's 82543GC emulation.
DevE1000Phy.cpp line 568 says:

	/* Note: A single PHY is supported, ignore PHYADR */

So I recommend all users not to use 82543GC emulation until this PHY
bug is fixed.

......
-rw------- 1 root wheel   2199810 Dec 17 09:24 netbsd.9
-rw------- 1 root wheel 147348504 Dec 17 09:24 netbsd.9.core
/var/crash # gdb netbsd.9
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64--netbsd".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from netbsd.9...(no debugging symbols found)...done.
(gdb) target kvm netbsd.9.core
0xffffffff80222d75 in cpu_reboot ()
(gdb) bt
#0  0xffffffff80222d75 in cpu_reboot ()
#1  0xffffffff8076e6f7 in db_reboot_cmd ()
#2  0xffffffff8076ee92 in db_command ()
#3  0xffffffff8076f20c in db_command_loop ()
#4  0xffffffff80772b80 in db_trap ()
#5  0xffffffff8021f5c2 in kdb_trap ()
#6  0xffffffff802244b1 in trap ()
#7  0xffffffff8021d568 in alltraps ()
#8  0xffffffff8021de45 in breakpoint ()
#9  0xffffffff809d54b0 in vpanic ()
#10 0xffffffff809d5550 in panic ()
#11 0xffffffff802514f0 in lapic_delay ()
#12 0xffffffff80353270 in wm_gmii_i82543_readreg ()
#13 0xffffffff807b1aa5 in makphy_status ()
#14 0xffffffff807b1cf7 in makphy_service ()
#15 0xffffffff807a826c in mii_tick ()
#16 0xffffffff80360926 in wm_tick ()
#17 0xffffffff809b6b96 in callout_softclock ()
#18 0xffffffff809aaa55 in softint_dispatch ()
#19 0xffffffff8021d21f in Xsoftintr ()


  I rebuilt the kernel (on a different physical host, but there may
have been an update on the 14th there) and tried to get a panic with
the .gdb kernel, but it never happened.

Obviously it is not a problem for me or anyone running NetBSD as a
VirtualBox guest, as using vioif / virtio is almost as twice as fast,
but I reported the panic thinking it may be relevant in other use
cases.


 Thank you for your report!

On Mon, 17 Dec 2018 at 07:49, Masanobu SAITOH <msaitoh%execsw.org@localhost> wrote:


On 2018/12/17 1:09, Chavdar Ivanov wrote:

I have no idea. As I said, it is running under VirtualBox on a Windows
10 host; I put the host in hibernation whilst the NetBSD guest is
running.


I tested today's -current on VirtualBox 5.2.22 on Windows 7 64bit
(on Core i7-2600). I tried hybernate(shutdown ->hybernate(H)) a few times
but I couldn't reproduce the problem yet.

          while (deltat > 0) {
                  xtick = lapic_gettick();
                  if (lapic_broken_periodic && xtick == 0 && otick == 0) {
                          lapic_initclocks();
                          xtick = lapic_gettick();
                          if (xtick == 0)
                                  panic("lapic timer stopped ticking");   <=========== here!
                  }


If that panic is from this, lapic_broken_periodic must be true, but it's set only
when the VM is KVM:

                 /*
                  * Apply workaround for broken periodic timer under KVM
                  */
                 if (vm_guest == VM_GUEST_KVM) {
                         lapic_broken_periodic = true;
                         lapic_timecounter.tc_quality = -100;
                         aprint_debug_dev(ci->ci_dev,
                             "applying KVM timer workaround\n");
                 }


   Could you try to reproduce the problem and see the panic message?
ci4ic4-panic-01.png has backtrace and it wiped out the panic message.

   Regards.

Previously it survived this, using the Intel Desktop NIC
emulation within VirtualBox, even my ssh connections (from the host to
the guest) remained active. I switched the NIC emulation for the
NetBSD guest to virtio-net, now it behaves as before, surviving a
hibernation.

There was a VirtualBox upgrade a few weeks ago, perhaps the problem is there.
On Sun, 16 Dec 2018 at 15:55, SAITOH Masanobu <msaitoh%execsw.org@localhost> wrote:


Hi.

On 2018/12/16 18:09, Chavdar Ivanov wrote:

Repeated this morning. Happens when the host hibernates when the
machine is running. The initial trace is slightly different, but the
lines with wm_gmii are the same, so for now I will switch to a
different NIC emulator.


In your .png:

vpanic()
lapic_delay()
wm_gmii_mdic_readreg()
.
.
.


There is no panic message itself, but I suspect it's:

static void
lapic_delay(unsigned int usec)
{
          int32_t xtick, otick;
          int64_t deltat;         /* XXX may want to be 64bit */

          otick = lapic_gettick();

          if (usec <= 0)
                  return;
          if (usec <= 25)
                  deltat = lapic_delaytab[usec];
          else
                  deltat = (lapic_frac_cycle_per_usec * usec) >> 32;

          while (deltat > 0) {
                  xtick = lapic_gettick();
                  if (lapic_broken_periodic && xtick == 0 && otick == 0) {
                          lapic_initclocks();
                          xtick = lapic_gettick();
                          if (xtick == 0)
                                  panic("lapic timer stopped ticking");   <=========== here!
                  }
                  if (xtick > otick)
                          deltat -= lapic_tval - (xtick - otick);
                  else
                          deltat -= otick - xtick;
                  otick = xtick;

                  x86_pause();
          }
}


Why does it cause?

And yes, it used to survive many hibernations of the hosts before. I
only had to adjust the time after waking the host up.
On Sat, 15 Dec 2018 at 10:59, Chavdar Ivanov <ci4ic4%gmail.com@localhost> wrote:


Hi,

On 8.99.27 AMD64 running under VirtualBox I got this morning the panic
in http://ci4ic4.tx0.org/ci4ic4-panic-01.png

I have the  coredump, if it is of interest. I thought it might be
useful, as it is apparently in the wm driver.

Chavdar
--
----



--
-----------------------------------------------
                  SAITOH Masanobu (msaitoh%execsw.org@localhost
                                   msaitoh%netbsd.org@localhost)



--
-----------------------------------------------
                  SAITOH Masanobu (msaitoh%execsw.org@localhost
                                   msaitoh%netbsd.org@localhost)



--
-----------------------------------------------
                SAITOH Masanobu (msaitoh%execsw.org@localhost
                                 msaitoh%netbsd.org@localhost)

Follow-Ups:
- Re: Panic on a -current from 13/12/2018
  - From: Masanobu SAITOH

References:
- Panic on a -current from 13/12/2018
  - From: Chavdar Ivanov
- Re: Panic on a -current from 13/12/2018
  - From: Chavdar Ivanov
- Re: Panic on a -current from 13/12/2018
  - From: SAITOH Masanobu
- Re: Panic on a -current from 13/12/2018
  - From: Chavdar Ivanov
- Re: Panic on a -current from 13/12/2018
  - From: Masanobu SAITOH
- Re: Panic on a -current from 13/12/2018
  - From: Chavdar Ivanov

Prev by Date: daily CVS update output
Next by Date: Re: Panic on a -current from 13/12/2018
Previous by Thread: Re: Panic on a -current from 13/12/2018
Next by Thread: Re: Panic on a -current from 13/12/2018
Indexes:

Home | Main Index | Thread Index | Old Index