NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/58898: Crash when testing camera in chromium



> Date: Thu, 12 Dec 2024 18:46:43 +0100
> From: Robert Bagdan <kikadf.01%gmail.com@localhost>
> 
> I don't have changes in NetBSD tree, I'm on the 10.0 branch, upgraded
> with sysupgrade auto. Here is my /etc/release relevant part:
> Build information:
>          Build date   Fri Dec  6 20:38:43 UTC 2024
>            Built by   builder%localhost.NetBSD.org@localhost
>            Build ID   202412081740Z
> 
> Because of the gdb output, I'm just clone the NetBSD src from github,
> this is the git log output:
> commit 313a9eb5b1b8b899e5611e78063c2949affc811d (HEAD -> netbsd-10,
> origin/netbsd-10)
> Author: snj <snj%NetBSD.org@localhost>
> Date:   Fri Dec 6 20:38:43 2024 +0000
> 
>    tickets 1024-1026

Thanks, glad to confirm that!  It was suspicious because the stack
trace in gdb doesn't make sense and gdb warned about source newer than
object file.

> Some dates and sizes:
> /var/crash $ ls -l netbsd.4*
> -rw-------  1 root  wheel    2428478 Dec  9 13:02 netbsd.4
> -rw-------  1 root  wheel  583708184 Dec  9 13:02 netbsd.4.core
> ls -l /netbsd
> -rw-r--r--  1 root  wheel  29528376 Dec  9 12:48 /netbsd
> 
> I share `dmesg -M netbsd.4.core -N netbsd.4 > dmesg.txt' in
> https://pastebin.com/raw/jWLjzQbw . This is all that I got.
> 
> I run `gdb /netbsd --eval-command="target kvm netbsd.4.core"', then
> your suggested gdb commands, I share the output in
> https://pastebin.com/raw/wQL0uN0z .

Thanks!

So it looks like it is here:

   2479 	/* dequeue all buffers */
   2480 	while (SIMPLEQ_FIRST(&vs->vs_ingress) != NULL)
=> 2481 		SIMPLEQ_REMOVE_HEAD(&vs->vs_ingress, entries);
   2482 	while (SIMPLEQ_FIRST(&vs->vs_egress) != NULL)
   2483 		SIMPLEQ_REMOVE_HEAD(&vs->vs_egress, entries);

https://nxr.netbsd.org/xref/src/sys/dev/video.c?r=1.45#2481

Here's the disassembly:

   0xffffffff80db14df <+166>:   call   0xffffffff8023fac0 <mutex_enter>
   0xffffffff80db14e4 <+171>:   mov    0xa0(%rbx),%rax
   0xffffffff80db14eb <+178>:   lea    0xa0(%rbx),%rdx
   0xffffffff80db14f2 <+185>:   jmp    0xffffffff80db1504 <videoclose+203>
=> 0xffffffff80db14f4 <+187>:   mov    0x8(%rax),%rax
   0xffffffff80db14f8 <+191>:   mov    %rax,0xa0(%rbx)
   0xffffffff80db14ff <+198>:   test   %rax,%rax
   0xffffffff80db1502 <+201>:   je     0xffffffff80db1578 <videoclose+319>
   0xffffffff80db1504 <+203>:   test   %rax,%rax
   0xffffffff80db1507 <+206>:   jne    0xffffffff80db14f4 <videoclose+187>
...
   0xffffffff80db1578 <+319>:   mov    %rdx,0xa8(%rbx)
   0xffffffff80db157f <+326>:   jmp    0xffffffff80db1509 <videoclose+208>

(The DPRINTF has presumably been compiled away, so there's no `if
(vs->vs_streaming)' branch in the code.)

We have:

(gdb) print &((struct video_softc *)0)->sc_stream_in.vs_ingress
$2 = (struct sample_queue *) 0xa0
(gdb) print &((struct video_buffer *)0)->entries->sqe_next
$3 = (struct video_buffer **) 0x8

And the crash is:

[   512.917593] trap type 4 code 0 rip 0xffffffff80db14f4 cs 0x8 rflags 0x10206 cr2 0x7dbd48c2e600 ilevel 0 rsp 0xffffc18242975d40
[   512.917593] curlwp 0xffffe53d60a3ca00 pid 3297.3302 lowest kstack 0xffffc182429712c0

So one of the vs_ingress buffers has been corrupted: rax hold the
pointer to a struct video_buffer, and the CPU faulted at

	mov	0x8(%rax),%rax

with cr2=0x7dbd48c2e600 as the fault address.  That's not a kernel
virtual address so the kernel has no business dereferencing it (except
via copyout/copyin or similar).

Now this cr2 value is interesting, because according to gdb:

(gdb) print *(struct video_softc *)video_cd.cd_devs[0]->dv_private
$1 = {... sc_stream_in = {...
      ... vs_ingress = {sqh_first = 0x5fffffe53d0009, ...}, ...}, ...}

That is _also_ not a kva, but it's not the one in cr2!


Now I checked the ioctls VIDIOC_ENUM_FRAMESIZES and
VIDIOC_ENUM_FRAMEINTERVALS, and they don't look like they do anything
interesting.  But videoread looks racy:

   1826 	mutex_enter(&vs->vs_lock);
...
   1845 		vb = SIMPLEQ_FIRST(&vs->vs_egress);
   1846 	} else {
   1847 	        vb = SIMPLEQ_FIRST(&vs->vs_egress);
   1848 	}
...
   1858 	mutex_exit(&vs->vs_lock);
   1859 
   1860 	len = uimin(uio->uio_resid, vb->vb_buf->bytesused - vs->vs_bytesread);
   1861 	offset = vb->vb_buf->m.offset + vs->vs_bytesread;
   1862 
   1863 	if (scatter_io_init(&vs->vs_data, offset, len, &sio)) {
   1864 		err = scatter_io_uiomove(&sio, uio);
   1865 		if (err == EFAULT)
   1866 			return EFAULT;
   1867 		vs->vs_bytesread += (len - sio.sio_resid);
...
   1874 	if (vs->vs_bytesread >= vb->vb_buf->bytesused) {
   1875 		mutex_enter(&vs->vs_lock);
   1876 		vb = video_stream_dequeue(vs);
   1877 		video_stream_enqueue(vs, vb);
   1878 		mutex_exit(&vs->vs_lock);

https://nxr.netbsd.org/xref/src/sys/dev/video.c?r=1.45#1826

As far as I can tell, there is nothing preventing two threads from
entering videoread concurrently and then handling the _same_ vb from
the egress queue.  I don't see an obvious limit to how much damage
could be caused by that, since it controls uiomove pointers and
lengths.

It is important to drop the lock across uiomove if that can trigger
swapping, e.g. in copyin/copyout in case the user's pages are swapped
out, which requires blocking on I/O which can be indefinite.  But it
is also critical for access to each vb here to be serialized.  Maybe
there should be a vs_iobusy `lock' taken and released by videoread,
handled under vs_lock and signalled by vs_read_cv or something, so
only one videoread can be running at a time.

This doesn't necessarily explain your crash -- I don't know whether
Chromium tries to read from the same /dev/video instance in more than
one thread -- but it's definitely a bug that we need to fix.


Home | Main Index | Thread Index | Old Index