Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Dom0 ballooning: crash on guest start



On 2 May 2011 02:35, Jean-Yves Migeon <jeanyves.migeon%free.fr@localhost> wrote:
> On 01.05.2011 04:19, Jean-Yves Migeon wrote:
>> On 29.04.2011 15:02, Christoph Egger wrote:
>>>>> balloon0: inflate 512 => inflated by 512
>>>>> uvm_fault(0xffffffff80c6b220, 0xffffffff81400000, 1) -> e
>>>>> fatal page fault in supervisor mode
>>>>> trap type 6 code 0 rip ffffffff8054593d cs e030 rflags 10216 cr2
>>>> Okay, happens when balloon(4) has already been inflated by a good share.
>>>>
>>>> Out of interest, having XENDEBUG_SYNC enabled in
>>>> arch/xen/x86/x86_xpmap.c does not change anything to the result?
>>
>> I can reliably trigger the issue. It's due to the domain doing P2M page
>> frame number translations via the xpmap_phys_to_machine_mapping array
>> for specific PFNs in the 3GiB range.
>>
>> For amd64, index are from 0xbd000 to 0xbd200 (one page, mapping physical
>> pages 3024 => 3026MiB).
>>
>> Now, why there is this 2MiB hole right there in the pseudo-physical
>> map... is the next question. They have no direct connection to machine
>> (real) addresses, and there's nothing like this appearing in the
>> dom0/hypervisor domain_build code (xen/arch/x86/domain_build.c).
>
> I am more and more convinced that the issue lies in the early stage of
> boot: xpmap_phys_to_machine_mapping (the "P2M array") is first populated
> by hypervisor when launching dom0, then used/updated by domain as
> necessary, without requiring hypervisor's help.
>
> When tracking the content of the P2M array during start, the
> aforementioned addresses (0xffffffff81400000 here) have correct values
> up to entering init_x86_64(). Then, upon leaving
> pmap_prealloc_lowmem_ptps(), its content starts being suspicious
> (entries are 0 and not ~0 like INVALID_P2M_ENTRY typically is), then,
> right after the pmap_growkernel() call, reading the address will fire
> unrecoverable page faults.
>
> The hole is just one page (4k) long. "Manually" forcing the code to jump
> above it makes balloon(4) happy again.
>
> IMHO, all the black magic (nefarious? :o) code that involves memory 1:1
> mappings and kernel relocation that happen at this early stage has
> probably a bug hiding somewhere, and incorrectly map certain
> xpmap_phys_to_machine_mapping pages, leading ultimately to a fault.

I think this points to the need for more modularity in x86/pmap.c wrt
xen MD bits. I'm not promising to look into it right now :-)



-- 
~Cherry


Home | Main Index | Thread Index | Old Index