Re: kern/59497: Panic in ucompoll

To: gnats-bugs%netbsd.org@localhost
Subject: Re: kern/59497: Panic in ucompoll
From: Paul Ripke <stix%stix.id.au@localhost>
Date: Thu, 3 Jul 2025 21:25:18 +1000

On Tue, Jul 01, 2025 at 11:00:02PM +0000, Christoph Badura via gnats wrote:
>  On Tue, Jul 01, 2025 at 09:20:00AM +0000, stix%stix.id.au@localhost wrote:
>  > Crash appears due to intermittent disconnect/reconnect of a uplcom device while open.
>  
>  Are you sure this is a genuine Prolific device?  I've tried to get some
>  Prolific USB serial fobs at the start of the year and found that the market
>  is swamped with buggy fake prolific chips.  Even supposedly reputable
>  manufacturers had fake chips on the fobs that claimed to be PL2303HX /
>  PL2303HXD.  In the end i managed to get some fobs with genuine Prolific
>  chips for some USD 20 per fob.  The fake ones all sold for about USD 3-4 and
>  were easily identifiable by the missing part number and Prolific logo on the
>  SSOP chip.

I'm really not sure - it's old, and it was cheap. I have used it for the
serial console on an old Sun SPARCserver 5, but that system now has dodgy RAM
that needs replacing.

>  The real ones also don't periodically disconnect/reconnect. :-)

I should hope not :)
I was considering shopping around for a USB FTDI-based serial adapter -
but I wonder if there are also fakes of those on the market...

>  Of course, using the fake chips shouldn't crash the system.

Indeed.

>  Obviously you were running a process that had the corresponding ttyUX open
>  when the crash happened.  Otherwise it wouldn't have been triggered from
>  the select(2) code.  Can you please describe what command exactly you were
>  running and what its command line options and other configuration settings
>  were.  I'd like to try to reproduce this locally.

That could be challenging. I had it hooked up to a Tandy Color Computer (coco1)
at 38400 baud, via alligator clips, and the software was drivewire.py:

https://github.com/n6il/pyDriveWire

Basically doing remote floppy disk access over the serial port.

>  > crash> bt
>  > __kernel_end() at 0
>  > kern_reboot() at sys_reboot
>  > vpanic() at vpanic+0x18d
>  > panic() at vprintf
>  > trap() at startlwp
>  > --- trap (number 6) ---
>  > ucompoll() at ucompoll+0x2a
>  > cdev_poll() at cdev_poll+0x87
>  > spec_poll() at spec_poll+0x6a
>  > VOP_POLL() at VOP_POLL+0x5d
>  > sel_do_scan() at sel_do_scan+0x3ba
>  > selcommon() at selcommon+0x309
>  > sys___select50() at sys___select50+0x75
>  > syscall() at syscall+0x1fc
>  > --- syscall (number 417) ---
>  > syscall+0x1fc:
>  > 
>  > Have core and kernel with symbols.
>  
>  Could you try to disassemble the ucompoll() until the offending
>  instruction?

That's easy, it's a tiny function:

(gdb) x/20i ucompoll
   0xffffffff804960a5 <ucompoll>:       push   %rbp
   0xffffffff804960a6 <ucompoll+1>:     mov    %rsp,%rbp
   0xffffffff804960a9 <ucompoll+4>:     push   %r13
   0xffffffff804960ab <ucompoll+6>:     push   %r12
   0xffffffff804960ad <ucompoll+8>:     mov    %esi,%r12d
   0xffffffff804960b0 <ucompoll+11>:    mov    %rdx,%r13
   0xffffffff804960b3 <ucompoll+14>:    mov    %edi,%eax
   0xffffffff804960b5 <ucompoll+16>:    shr    $0xc,%eax
   0xffffffff804960b8 <ucompoll+19>:    movzbl %dil,%esi
   0xffffffff804960bc <ucompoll+23>:    and    $0x3ff00,%eax
   0xffffffff804960c1 <ucompoll+28>:    or     %eax,%esi
   0xffffffff804960c3 <ucompoll+30>:    mov    $0xffffffff81896660,%rdi
   0xffffffff804960ca <ucompoll+37>:    call   0xffffffff80e42be0 <device_lookup_private>
   0xffffffff804960cf <ucompoll+42>:    mov    0xe8(%rax),%rdi		<------
   0xffffffff804960d6 <ucompoll+49>:    mov    0x168(%rdi),%rax
   0xffffffff804960dd <ucompoll+56>:    mov    0x60(%rax),%rax
   0xffffffff804960e1 <ucompoll+60>:    mov    %r13,%rdx
   0xffffffff804960e4 <ucompoll+63>:    mov    %r12d,%esi
   0xffffffff804960e7 <ucompoll+66>:    pop    %r12
   0xffffffff804960e9 <ucompoll+68>:    pop    %r13

>  Could you try to find out if TS_CANCEL is set in tp->t_state?

Yeah, I was actually wondering how to do that. I can't figure out for the
life of me how to switch between cpu stacks in gdb. I realize most of the
kernel debugging I've done has been on single cpu machines...

However, doesn't this imply sc is null?

(gdb) p ucom_cd
$9 = {
  cd_list = {
    le_next = 0xffffffff818966a0 <umidi_cd>,
    le_prev = 0xffffffff81896620 <ugen_cd>
  },
  cd_attach = {
    lh_first = 0xffffffff81815260 <ucom_ca>
  },
  cd_devs = 0x0,
  cd_name = 0xffffffff813e59e8 "ucom",
  cd_class = DV_DULL,
  cd_ndevs = 0,
  cd_attrs = 0x0
}

>  This might be relatively easy to work around.
>  
>  ucycom(4) has (https://nxr.netbsd.org/xref/src/sys/dev/usb/ucycom.c#897):
>  
>  	if (sc->sc_dying)
>  		return EIO;
>  
>  of course, it should return POLLHUP.
>  
>  uhso has (https://nxr.netbsd.org/xref/src/sys/dev/usb/uhso.c#1791):
>  
>  	if (!device_is_active(sc->sc_dev))
>  		return POLLHUP;
>  
>  So apparently there is no agreement how this should be handled.
>  
>  Could you try adding
>  
>  	if (sc->sc_dying)
>  		return POLLHUP;
>  
>  before line 853 in ucom.c and see if that makes the symtomps go away?

or perhaps:

  if (sc == NULL)
    return POLLHUP;

?

>  But maybe the right fix would be to make ttycancel() deal with any pending
>  select()s too?  Or something similar that ties in with the d_cancel
>  framework?

Yeah, I haven't studied the code that much as yet.

-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.

References:
- Re: kern/59497: Panic in ucompoll
  - From: Christoph Badura via gnats

Prev by Date: Re: kern/59497: Panic in ucompoll
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: kern/59497: Panic in ucompoll
Next by Thread: Re: kern/59497: Panic in ucompoll
Indexes:

Home | Main Index | Thread Index | Old Index