Re: kern/43199: read(2) returns bad size in multithreaded programs

To: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/43199: read(2) returns bad size in multithreaded programs
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock%nagler-company.com@localhost>
Date: Mon, 26 Apr 2010 09:43:27 +0200

Hi, sorry I've had no time to look into my mail over weekend.

I don't know the exact kind of the fd.

It is either a socket or a pipe, because the bacula-sd deamon seem torun the inter-deamon communication on that fd. (the 4 byte reads are alength information in the protocol for the next data-block)Due to the fact, that it is possible to distribute all parts of thebacula-backup sytem on different hosts, I tent to the assumption that itis a socket.


The problem is reproducable.

As soon as a parallel backup is started it takes only a short time(normaly less than a minute) until either der Deamon aborts (with onerror message written to the closed fd 2 - realy great idea .... - so Idon't know the contents of the message up to now), the kernel freasesand a hard-reset is required or the system panics inside the uvmsubsystem with kernel-page fault.If the system falls into DDB, it shows a stack-frame with sys_read init. Sync is impossible (hangs), I've failed to get a core till now.


Here the output of the trace command for a crash:

uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0

Stopped in pid 11845.6 (bacula-sd) at netbsd:uvm_map_lookup_entry+0x4d:m

ovq     0x40(%rax),%r9
db{2}> trace
uvm_map_lookup_entry() at netbsd:uvm_map_lookup_entry+0x4d
uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x55
uvmspace_free() at netbsd:uvmspace_free+0x9a
dofileread() at netbsd:dofileread+0x1a5
sys_read() at netbsd:sys_read+0x8f
syscall_fancy() at netbsd:syscall_fancy+0x16e
uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0
Faulted in DDB; continuing...
db{2}>

remark: I've added a call to panic in dofileread() near the end of theroutine just before the assignment of the return value, if the number ofbytes read is gooing to return is larger than the number of bytesrequested. That one has not been hit here!Either there was a jump to the "out:" label before or something otherwent wrong. So in this crash the number of requested bytes is at leastthe number of bytes the kernel was returning to the program.

I'm not realy confirm with the multi-thread implementation in theNetBSD-kernel. But I looks to me that the problem is bound to someaspects of parallel work on multiple threads.There has been no problem up to now if we run the backup of all systemsand filesystems sequential, but this is not even a sollution for a workaround, because that takes too much time ....


W. Stukenbrock

Andrew Doran wrote:

The following reply was made to PR kern/43199; it has been noted by GNATS.

From: Andrew Doran <ad%NetBSD.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
        netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/43199: read(2) returns bad size in multithreaded programs
Date: Fri, 23 Apr 2010 15:04:51 +0000

 >   26576      4 bacula-sd 1272016811.944594476 read(0xe, 0x69d040, 0x7ae0) = 
4629

What type of file is descriptor 0xe in your example above?

 Is it a pipe, or a regular file or a socket or ...?

Thanks.

References:
- Re: kern/43199: read(2) returns bad size in multithreaded programs
  - From: Andrew Doran

Prev by Date: Re: port-i386/34161: current GENERIC cannot use ehci on NForce4
Next by Date: Re: kern/43199: read(2) returns bad size in multithreaded programs
Previous by Thread: Re: kern/43199: read(2) returns bad size in multithreaded programs
Next by Thread: Re: kern/43199: read(2) returns bad size in multithreaded programs
Indexes:

Home | Main Index | Thread Index | Old Index