NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/43199: read(2) returns bad size in multithreaded programs
The following reply was made to PR kern/43199; it has been noted by GNATS.
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost,
netbsd-bugs%NetBSD.org@localhost,
Wolfgang.Stukenbrock%nagler-company.com@localhost
Subject: Re: kern/43199: read(2) returns bad size in multithreaded programs
Date: Mon, 26 Apr 2010 09:43:27 +0200
Hi, sorry I've had no time to look into my mail over weekend.
I don't know the exact kind of the fd.
It is either a socket or a pipe, because the bacula-sd deamon seem to
run the inter-deamon communication on that fd. (the 4 byte reads are a
length information in the protocol for the next data-block)
Due to the fact, that it is possible to distribute all parts of the
bacula-backup sytem on different hosts, I tent to the assumption that it
is a socket.
The problem is reproducable.
As soon as a parallel backup is started it takes only a short time
(normaly less than a minute) until either der Deamon aborts (with on
error message written to the closed fd 2 - realy great idea .... - so I
don't know the contents of the message up to now), the kernel freases
and a hard-reset is required or the system panics inside the uvm
subsystem with kernel-page fault.
If the system falls into DDB, it shows a stack-frame with sys_read in
it. Sync is impossible (hangs), I've failed to get a core till now.
Here the output of the trace command for a crash:
uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0
Stopped in pid 11845.6 (bacula-sd) at netbsd:uvm_map_lookup_entry+0x4d:
m
ovq 0x40(%rax),%r9
db{2}> trace
uvm_map_lookup_entry() at netbsd:uvm_map_lookup_entry+0x4d
uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x55
uvmspace_free() at netbsd:uvmspace_free+0x9a
dofileread() at netbsd:dofileread+0x1a5
sys_read() at netbsd:sys_read+0x8f
syscall_fancy() at netbsd:syscall_fancy+0x16e
uvm_fault(0xffff800058213850, 0x0, 1) -> e
kernel: page fault trap, code=0
Faulted in DDB; continuing...
db{2}>
remark: I've added a call to panic in dofileread() near the end of the
routine just before the assignment of the return value, if the number of
bytes read is gooing to return is larger than the number of bytes
requested. That one has not been hit here!
Either there was a jump to the "out:" label before or something other
went wrong. So in this crash the number of requested bytes is at least
the number of bytes the kernel was returning to the program.
I'm not realy confirm with the multi-thread implementation in the
NetBSD-kernel. But I looks to me that the problem is bound to some
aspects of parallel work on multiple threads.
There has been no problem up to now if we run the backup of all systems
and filesystems sequential, but this is not even a sollution for a work
around, because that takes too much time ....
W. Stukenbrock
Andrew Doran wrote:
> The following reply was made to PR kern/43199; it has been noted by GNATS.
>
> From: Andrew Doran <ad%NetBSD.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
> netbsd-bugs%netbsd.org@localhost
> Subject: Re: kern/43199: read(2) returns bad size in multithreaded programs
> Date: Fri, 23 Apr 2010 15:04:51 +0000
>
> > 26576 4 bacula-sd 1272016811.944594476 read(0xe, 0x69d040, 0x7ae0)
> = 4629
>
> What type of file is descriptor 0xe in your example above?
> Is it a pipe, or a regular file or a socket or ...?
>
> Thanks.
>
>
Home |
Main Index |
Thread Index |
Old Index