Re: kern/59497: Panic in ucompoll

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,stix%stix.id.au@localhost
Subject: Re: kern/59497: Panic in ucompoll
From: "Christoph Badura via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Thu, 3 Jul 2025 22:15:01 +0000 (UTC)

The following reply was made to PR kern/59497; it has been noted by GNATS.

From: Christoph Badura <bad%bsd.de@localhost>
To: 
Cc: gnats-bugs%netbsd.org@localhost, kern-bug-people%netbsd.org@localhost,
	gnats-admin%netbsd.org@localhost
Subject: Re: kern/59497: Panic in ucompoll
Date: Fri, 4 Jul 2025 00:13:27 +0200

 On Thu, Jul 03, 2025 at 09:25:18PM +1000, Paul Ripke wrote:
 > On Tue, Jul 01, 2025 at 11:00:02PM +0000, Christoph Badura via gnats wrote:
 > >  On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix%stix.id.au@localhost wrote:
 > I'm really not sure - it's old, and it was cheap. I have used it for the
 > serial console on an old Sun SPARCserver 5, but that system now has dodgy RAM
 > that needs replacing.

 The photo of the chip that you sent me privately make it clear that it is a
 genuine PL-2303HX.  Good for you, I guess.  Bad for us as it suggests we
 have a bug in our driver that causes the disconnects.

 > I was considering shopping around for a USB FTDI-based serial adapter -
 > but I wonder if there are also fakes of those on the market...

 I think there are also fakes on the market.  Genuine FTDI fobs seem to be
 available mostly via Mouser, Farnell, etc.  I ended up buying a couple at
 ~USD25 from Farnell earlier this year; before I could hunt down a source
 for genuine Prolific fobs -- which cost basically the same.

 > >  [...] I'd like to try to reproduce this locally.
 > 
 > That could be challenging. I had it hooked up to a Tandy Color Computer (coco1)
 > at 38400 baud, via alligator clips, and the software was drivewire.py:
 > 
 > https://github.com/n6il/pyDriveWire
 > 
 > Basically doing remote floppy disk access over the serial port.

 Well, I could just try out pyDriveWire without a CoCo (or anything else)
 connected and see if that provokes the crash, too.

 > >  Could you try to disassemble the ucompoll() until the offending
 > >  instruction?
 > 
 > That's easy, it's a tiny function:
 > 
 > (gdb) x/20i ucompoll
 >    0xffffffff804960a5 <ucompoll>:       push   %rbp
 >    0xffffffff804960a6 <ucompoll+1>:     mov    %rsp,%rbp
 >    0xffffffff804960a9 <ucompoll+4>:     push   %r13
 >    0xffffffff804960ab <ucompoll+6>:     push   %r12
 >    0xffffffff804960ad <ucompoll+8>:     mov    %esi,%r12d
 >    0xffffffff804960b0 <ucompoll+11>:    mov    %rdx,%r13
 >    0xffffffff804960b3 <ucompoll+14>:    mov    %edi,%eax
 >    0xffffffff804960b5 <ucompoll+16>:    shr    $0xc,%eax
 >    0xffffffff804960b8 <ucompoll+19>:    movzbl %dil,%esi
 >    0xffffffff804960bc <ucompoll+23>:    and    $0x3ff00,%eax
 >    0xffffffff804960c1 <ucompoll+28>:    or     %eax,%esi
 >    0xffffffff804960c3 <ucompoll+30>:    mov    $0xffffffff81896660,%rdi
 >    0xffffffff804960ca <ucompoll+37>:    call   0xffffffff80e42be0 <device_lookup_private>
 >    0xffffffff804960cf <ucompoll+42>:    mov    0xe8(%rax),%rdi		<------
 >    0xffffffff804960d6 <ucompoll+49>:    mov    0x168(%rdi),%rax
 >    0xffffffff804960dd <ucompoll+56>:    mov    0x60(%rax),%rax
 >    0xffffffff804960e1 <ucompoll+60>:    mov    %r13,%rdx
 >    0xffffffff804960e4 <ucompoll+63>:    mov    %r12d,%esi
 >    0xffffffff804960e7 <ucompoll+66>:    pop    %r12
 >    0xffffffff804960e9 <ucompoll+68>:    pop    %r13
 > 
 > >  Could you try to find out if TS_CANCEL is set in tp->t_state?
 > 
 > Yeah, I was actually wondering how to do that. I can't figure out for the
 > life of me how to switch between cpu stacks in gdb. I realize most of the
 > kernel debugging I've done has been on single cpu machines...
 > 
 > However, doesn't this imply sc is null?

 Yes, that has to be the ``tp = sc->sc_tty'' assignment.

 Do you have the kernel messages right before the panic?  I.e. print the
 contents of msgbuf.  Your original mail only showed what is syslogged,
 doesn't it?

 What I'm wondering is if the panic happend between a "ucom2:
 detached\nuplcom1: detached" and a subsequent "uplcom1 at uhub1 port 8".

 sc being null implies the device being detached, if I remember things
 correctly.  Which makes the situation somewhat worse, because detaching
 the device should revoke the open vnode for the device.

 Maybe spec_poll() needs to check if sn->sn_gone is set after calling
 spec_io_enter()?

 https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#1378
 https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#618?

 But maybe that is pampering over the symptoms.  I haven't stared long
 enough at the code.

 > >  This might be relatively easy to work around.
 > >  
 > >  ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):
 > >  
 > >  	if (sc->sc_dying)
 > >  		return EIO;
 > >  
 > >  of course, it should return POLLHUP.
 > >  
 > >  uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):
 > >  
 > >  	if (!device_is_active(sc->sc_dev))
 > >  		return POLLHUP;
 > >  
 > >  So apparently there is no agreement how this should be handled.
 > >  
 > >  Could you try adding
 > >  
 > >  	if (sc->sc_dying)
 > >  		return POLLHUP;
 > >  
 > >  before line 853 in ucom.c and see if that makes the symtomps go away?
 > 
 > or perhaps:
 > 
 >   if (sc == NULL)
 >     return POLLHUP;
 > 
 > ?

 That certainly would avoid the crash.  But I think it is just pampering
 over the symptoms.

 Or maybe it and the other two placesshould return POLLERR like spec_poll()
 does?

 > >  But maybe the right fix would be to make ttycancel() deal with any pending
 > >  select()s too?  Or something similar that ties in with the d_cancel
 > >  framework?
 > 
 > Yeah, I haven't studied the code that much as yet.

 What a rabbit hole!

 I'm sorry, I don't have time right now and the next 2 weeks to dive down
 into it.  But you do have a local workaround, I think.  And if you can
 debug this further, we would greatly appreciate it.

 --chris

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: kern/59497: Panic in ucompoll
Previous by Thread: Re: kern/59497: Panic in ucompoll
Next by Thread: Re: kern/59497: Panic in ucompoll
Indexes:

Home | Main Index | Thread Index | Old Index