NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/59497: Panic in ucompoll
The following reply was made to PR kern/59497; it has been noted by GNATS.
From: Christoph Badura <bad%bsd.de@localhost>
To:
Cc: gnats-bugs%netbsd.org@localhost, kern-bug-people%netbsd.org@localhost,
gnats-admin%netbsd.org@localhost
Subject: Re: kern/59497: Panic in ucompoll
Date: Fri, 4 Jul 2025 00:13:27 +0200
On Thu, Jul 03, 2025 at 09:25:18PM +1000, Paul Ripke wrote:
> On Tue, Jul 01, 2025 at 11:00:02PM +0000, Christoph Badura via gnats wrote:
> > On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix%stix.id.au@localhost wrote:
> I'm really not sure - it's old, and it was cheap. I have used it for the
> serial console on an old Sun SPARCserver 5, but that system now has dodgy RAM
> that needs replacing.
The photo of the chip that you sent me privately make it clear that it is a
genuine PL-2303HX. Good for you, I guess. Bad for us as it suggests we
have a bug in our driver that causes the disconnects.
> I was considering shopping around for a USB FTDI-based serial adapter -
> but I wonder if there are also fakes of those on the market...
I think there are also fakes on the market. Genuine FTDI fobs seem to be
available mostly via Mouser, Farnell, etc. I ended up buying a couple at
~USD25 from Farnell earlier this year; before I could hunt down a source
for genuine Prolific fobs -- which cost basically the same.
> > [...] I'd like to try to reproduce this locally.
>
> That could be challenging. I had it hooked up to a Tandy Color Computer (coco1)
> at 38400 baud, via alligator clips, and the software was drivewire.py:
>
> https://github.com/n6il/pyDriveWire
>
> Basically doing remote floppy disk access over the serial port.
Well, I could just try out pyDriveWire without a CoCo (or anything else)
connected and see if that provokes the crash, too.
> > Could you try to disassemble the ucompoll() until the offending
> > instruction?
>
> That's easy, it's a tiny function:
>
> (gdb) x/20i ucompoll
> 0xffffffff804960a5 <ucompoll>: push %rbp
> 0xffffffff804960a6 <ucompoll+1>: mov %rsp,%rbp
> 0xffffffff804960a9 <ucompoll+4>: push %r13
> 0xffffffff804960ab <ucompoll+6>: push %r12
> 0xffffffff804960ad <ucompoll+8>: mov %esi,%r12d
> 0xffffffff804960b0 <ucompoll+11>: mov %rdx,%r13
> 0xffffffff804960b3 <ucompoll+14>: mov %edi,%eax
> 0xffffffff804960b5 <ucompoll+16>: shr $0xc,%eax
> 0xffffffff804960b8 <ucompoll+19>: movzbl %dil,%esi
> 0xffffffff804960bc <ucompoll+23>: and $0x3ff00,%eax
> 0xffffffff804960c1 <ucompoll+28>: or %eax,%esi
> 0xffffffff804960c3 <ucompoll+30>: mov $0xffffffff81896660,%rdi
> 0xffffffff804960ca <ucompoll+37>: call 0xffffffff80e42be0 <device_lookup_private>
> 0xffffffff804960cf <ucompoll+42>: mov 0xe8(%rax),%rdi <------
> 0xffffffff804960d6 <ucompoll+49>: mov 0x168(%rdi),%rax
> 0xffffffff804960dd <ucompoll+56>: mov 0x60(%rax),%rax
> 0xffffffff804960e1 <ucompoll+60>: mov %r13,%rdx
> 0xffffffff804960e4 <ucompoll+63>: mov %r12d,%esi
> 0xffffffff804960e7 <ucompoll+66>: pop %r12
> 0xffffffff804960e9 <ucompoll+68>: pop %r13
>
> > Could you try to find out if TS_CANCEL is set in tp->t_state?
>
> Yeah, I was actually wondering how to do that. I can't figure out for the
> life of me how to switch between cpu stacks in gdb. I realize most of the
> kernel debugging I've done has been on single cpu machines...
>
> However, doesn't this imply sc is null?
Yes, that has to be the ``tp = sc->sc_tty'' assignment.
Do you have the kernel messages right before the panic? I.e. print the
contents of msgbuf. Your original mail only showed what is syslogged,
doesn't it?
What I'm wondering is if the panic happend between a "ucom2:
detached\nuplcom1: detached" and a subsequent "uplcom1 at uhub1 port 8".
sc being null implies the device being detached, if I remember things
correctly. Which makes the situation somewhat worse, because detaching
the device should revoke the open vnode for the device.
Maybe spec_poll() needs to check if sn->sn_gone is set after calling
spec_io_enter()?
https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#1378
https://nxr.netbsd.org/xref/src/sys/miscfs/specfs/spec_vnops.c#618?
But maybe that is pampering over the symptoms. I haven't stared long
enough at the code.
> > This might be relatively easy to work around.
> >
> > ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):
> >
> > if (sc->sc_dying)
> > return EIO;
> >
> > of course, it should return POLLHUP.
> >
> > uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):
> >
> > if (!device_is_active(sc->sc_dev))
> > return POLLHUP;
> >
> > So apparently there is no agreement how this should be handled.
> >
> > Could you try adding
> >
> > if (sc->sc_dying)
> > return POLLHUP;
> >
> > before line 853 in ucom.c and see if that makes the symtomps go away?
>
> or perhaps:
>
> if (sc == NULL)
> return POLLHUP;
>
> ?
That certainly would avoid the crash. But I think it is just pampering
over the symptoms.
Or maybe it and the other two placesshould return POLLERR like spec_poll()
does?
> > But maybe the right fix would be to make ttycancel() deal with any pending
> > select()s too? Or something similar that ties in with the d_cancel
> > framework?
>
> Yeah, I haven't studied the code that much as yet.
What a rabbit hole!
I'm sorry, I don't have time right now and the next 2 weeks to dive down
into it. But you do have a local workaround, I think. And if you can
debug this further, we would greatly appreciate it.
--chris
Home |
Main Index |
Thread Index |
Old Index