NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

port-amd64/48142: i8254 timer stop working during boot - system lockup during boot



>Number:         48142
>Category:       port-amd64
>Synopsis:       i8254 timer stop working during boot - system lockup during 
>boot
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-amd64-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Aug 21 13:00:00 +0000 2013
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 6.1
>Organization:
Dr. Nagler & Company GmbH
>Environment:
        
        
System: NetBSD test-s0 5.1.2 NetBSD 5.1.2 (NSW-WS) #3: Fri Dec 21 15:15:43 CET 
2012 wgstuken@test-s0:/usr/src/sys/arch/amd64/compile/NSW-WS amd64
Architecture: x86_64
Machine: amd64
>Description:
        While initializing the hardware on x86-systems in 
sys/arch/x86/x86/cpu.c i8254_delay() is called directly
        bypassing the DELAY() macro that may call this function of somthing 
else - e.g. lapic_delay() from
        sys/arch/x86/x86/lapic.c. I'm not shure if this is by design or by 
error ...
        If lapic is setup, the i8254 timer counter frequency is set to '0' - 
full cycle as far as I understand.
        OK - the time-source is getting slower as before, but it is still 
running. All calls to i8254_delay() will wait
        longer as before, but that does not realy hurt here and nobody has 
recognized this as a problem before.

        On our Supermicro X8DAH (with one CPU only) we have the problem, that 
with some kernel configurations
        the system will hang during startup while starting the "other" CPU's.

        I've debugged into it and found, that the i8254 timer has stopped 
counting - for unknown reasons.
        This happens after the the isa subsystems is initialized.
        If i8254_delay() is used after this - e.g. at the end of isaattach() 
for debugging purpose, it will never return.
        At start of this routine the timer is still working fine.
        No indication of the problem is reported - the user sits there and is 
wondering ...

        The problem is triggered by the finsio driver on port 0x4e, but I'm not 
shure if it is the fault of this driver or
        if the timer registers are visible on other ports than 0x40 and 0x43 on 
this board too.
        Also still not tested on other motherboards - perhaps others are 
affected too.
>How-To-Repeat:
        Setup "finsio0 at isa? port 0x4e" in a kernel configuration on a 
Supermicro X8DAH board,
        The kernel will freeze during startup.
>Fix:
        Not 100% shure, because my knowloedge about the constrains during the 
startup is to small.
        Perhaps replacing the i8254_delay() in arch/x86/x86/cpu.c with DELAY() 
would be a good idea.
        It solves the problem for me, but I'm not shure if there are other side 
effects. 

        An other way to introduce a workaround is to catch the case that the 
timer stops working in i8254_delay() in
        sys/arch/x86/isa/clock.c.
        If we assume that each loop takes longer than one timer tick, we can 
decrement the remaining counter by one
        each time we read the same tick-value again to avoid an endless loop 
here.
        This aproach introduces some slowdown while bringing up the "other" 
CPU's, but works fine too without
        accessing "other resources" as the first fix would do ...

        remark: if complied as XEN, then xen_delay() would be used by the 
DELAY() macro. Not shure if this is OK or not,
                or if sys/arch/x86/x86/cpu.c goes to a XEN kernel or not.

        remark: i8254_delay() will not be used later directly again. So the 
second aproach will only slow down the
                boot process. (At least I've found no other references to it in 
the souces.)

        Here is a patch that uses the second aproach for 
sys/arch/x86/isa/clock.c.
        Feel free to use it or to replace i8254_delay() with DELAY() in 
sys/arch/x86/x86/cpu.c
        Perhaps it would make sence to add some addition code to report the 
problem to the user if it happes
        the first time, but on very very fast systems in the future this 
message may be misleading ...

--- clock.c     2013/08/21 12:43:37     1.1
+++ clock.c     2013/08/21 12:48:26
@@ -482,6 +482,9 @@
                cur_tick = gettick();
                if (cur_tick > initial_tick)
                        delta = rtclock_tval - (cur_tick - initial_tick);
+// avoid looping forever if timer stops counting for any reason
+               else if (cur_tick == initial_tick)
+                       delta = 1;
                else
                        delta = initial_tick - cur_tick;
                if (delta < 0 || delta >= rtclock_tval / 2) {
@@ -500,6 +503,9 @@
                cur_tick = gettick();
                if (cur_tick > initial_tick)
                        remaining -= rtclock_tval - (cur_tick - initial_tick);
+// avoid looping forever if timer stops counting for any reason
+               else if (cur_tick == initial_tick)
+                       remaining -= 1;
                else
                        remaining -= initial_tick - cur_tick;
 #endif

>Unformatted:
        
        


Home | Main Index | Thread Index | Old Index