Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Instability issues with NetBSD-9, xen-4.11 and the xbdb backend driver



	hello.  Ok.  I'm making progress on this issue.  The problem is not,
as I thought, random corruption.  that is good.  There are two problems
I've identified.  There are undoubtedly more:

1.  If the page granting request fails, then xbdback_io_error is called as
it should be.  It sends back a notice to the domu that there was an error.  However,
the xbdi instance continues to run and the request is processed again as if
it had just come in from the domu.  However, this time it's corrupted
because various functions have already touched it and marked it bad.  Specifically,
xbdback_getreq reads the request and decides that the request is invalid
because the operation field has been overwritten with the request ID number
during the previous error handling operation and cleanup.  this is when the
response that generates the panic on the domu is sent to the domu.  

So, the root cause of the problem are the transient xen page granting
errors that are occurring.  However, the error handling that results from
these errors is incorrect and is causing the domu's to panic.  I agree with
Michael that the xen_map_shm function needs to be corrected to retry the
requests when they fail, and to verify that we're passing in good page
mapping flags to xen in the first place.  However, I'm going to try and
work on  correcting the error handling that results from these errors
first, since it's easier to fix them if I can see the error paths fail in
real time.  Once I have a fix for that, I'll work on the mapping issue
itself.

-thanks
-Brian
On Nov 15,  7:39pm, Brian Buhrow wrote:
} Subject: Re: Instability issues with NetBSD-9, xen-4.11 and the xbdb backe
} 	hello Manuel.  So I'm still trying to get my head around how this code
} actually works. However, I think there are multiple paths to
} xbdback_send_reply() in error handling cases, which might be causing part
} of the problem.
} 
} 	These errors, are, of course, a symptom of the fact that I'm getting
} data corruption on the requests coming from the domu into the dom0.  What
} ever is corrupting the data, it seems to only happen when files on the domu
} are very large.  After looking at  reems of output, I see a lot of messages
} like:
} 
} Nov 13 18:02:06 xen-hardconnect /netbsd: [ 2971.7600968] xbdback_evthandler domain 5: cont 0xdeadbeef
} 
} I don't think I should ever see a continuation with that address, ever!
} Interestingly, I start seeing those messages long before anything
} apparently goes wrong, suggesting the corruption begins long before
} anything notices, which makes sense.
} In looking at the code, however, I really don't understand how the
} memory associated with the xbd_io portions is managed.  It looks like we
} set the pointer to null in many cases without first freeing that memory.
} How is it that this doesn't leak memory wildly?  
} I'm sure I'm just not understanding the code yet, so any pointers or notes
} would be helpful.
} -thanks
} -Brian
>-- End of excerpt from Brian Buhrow




Home | Main Index | Thread Index | Old Index