NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/43199: read(2) returns bad size in multithreaded programs



Hi, sorry I've had no time to look into my mail over weekend.

I don't know the exact kind of the fd.
It is either a socket or a pipe, because the bacula-sd deamon seem to run the inter-deamon communication on that fd. (the 4 byte reads are a length information in the protocol for the next data-block) Due to the fact, that it is possible to distribute all parts of the bacula-backup sytem on different hosts, I tent to the assumption that it is a socket.

The problem is reproducable.

As soon as a parallel backup is started it takes only a short time (normaly less than a minute) until either der Deamon aborts (with on error message written to the closed fd 2 - realy great idea .... - so I don't know the contents of the message up to now), the kernel freases and a hard-reset is required or the system panics inside the uvm subsystem with kernel-page fault. If the system falls into DDB, it shows a stack-frame with sys_read in it. Sync is impossible (hangs), I've failed to get a core till now.

Here the output of the trace command for a crash:

uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0
Stopped in pid 11845.6 (bacula-sd) at netbsd:uvm_map_lookup_entry+0x4d: m
ovq     0x40(%rax),%r9
db{2}> trace
uvm_map_lookup_entry() at netbsd:uvm_map_lookup_entry+0x4d
uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x55
uvmspace_free() at netbsd:uvmspace_free+0x9a
dofileread() at netbsd:dofileread+0x1a5
sys_read() at netbsd:sys_read+0x8f
syscall_fancy() at netbsd:syscall_fancy+0x16e
uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0
Faulted in DDB; continuing...
db{2}>

remark: I've added a call to panic in dofileread() near the end of the routine just before the assignment of the return value, if the number of bytes read is gooing to return is larger than the number of bytes requested. That one has not been hit here! Either there was a jump to the "out:" label before or something other went wrong. So in this crash the number of requested bytes is at least the number of bytes the kernel was returning to the program.


I'm not realy confirm with the multi-thread implementation in the NetBSD-kernel. But I looks to me that the problem is bound to some aspects of parallel work on multiple threads. There has been no problem up to now if we run the backup of all systems and filesystems sequential, but this is not even a sollution for a work around, because that takes too much time ....

W. Stukenbrock

Andrew Doran wrote:

The following reply was made to PR kern/43199; it has been noted by GNATS.

From: Andrew Doran <ad%NetBSD.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
        netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/43199: read(2) returns bad size in multithreaded programs
Date: Fri, 23 Apr 2010 15:04:51 +0000

 >   26576      4 bacula-sd 1272016811.944594476 read(0xe, 0x69d040, 0x7ae0) = 
4629
What type of file is descriptor 0xe in your example above?
 Is it a pipe, or a regular file or a socket or ...?
Thanks.




Home | Main Index | Thread Index | Old Index