CVS commit: pkgsrc/sysutils/xenkernel411

To: pkgsrc-changes%NetBSD.org@localhost
Subject: CVS commit: pkgsrc/sysutils/xenkernel411
From: "Manuel Bouyer" <bouyer%netbsd.org@localhost>
Date: Thu, 7 Mar 2019 11:13:27 +0000

Module Name:    pkgsrc
Committed By:   bouyer
Date:           Thu Mar  7 11:13:27 UTC 2019

Modified Files:
        pkgsrc/sysutils/xenkernel411: Makefile distinfo
Added Files:
        pkgsrc/sysutils/xenkernel411/patches: patch-XSA284 patch-XSA285
            patch-XSA287 patch-XSA288 patch-XSA290-1 patch-XSA290-2
            patch-XSA291 patch-XSA292 patch-XSA293-1 patch-XSA293-2
            patch-XSA294
Removed Files:
        pkgsrc/sysutils/xenkernel411/patches: patch-XSA269 patch-XSA275-1
            patch-XSA275-2 patch-XSA276-1 patch-XSA276-2 patch-XSA277
            patch-XSA278 patch-XSA279 patch-XSA280-1 patch-XSA280-2
            patch-XSA282-1 patch-XSA282-2 patch-zz-JBeulich patch-zz-bouyer

Log Message:
Update to 4.11.1nb1
PKGREVISION set to 1 on purpose, because this is not a stock 4.11.1 kernel
(it includes security patches).
4.11.1 includes all security patches up to XSA282.
Apply official patches for XSA284, XSA285, XSA287, XSA288, XSA290, XSA291,
XSA292, XSA293 and XSA294.
Other changes since 4.11.0 are mostly bugfixes, no new features.


To generate a diff of this commit:
cvs rdiff -u -r1.3 -r1.4 pkgsrc/sysutils/xenkernel411/Makefile
cvs rdiff -u -r1.2 -r1.3 pkgsrc/sysutils/xenkernel411/distinfo
cvs rdiff -u -r1.1 -r0 pkgsrc/sysutils/xenkernel411/patches/patch-XSA269 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA275-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA275-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA276-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA276-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA277 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA278 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA279 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA280-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA280-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA282-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA282-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-zz-JBeulich \
    pkgsrc/sysutils/xenkernel411/patches/patch-zz-bouyer
cvs rdiff -u -r0 -r1.1 pkgsrc/sysutils/xenkernel411/patches/patch-XSA284 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA285 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA287 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA288 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA291 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA292 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-1 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-2 \
    pkgsrc/sysutils/xenkernel411/patches/patch-XSA294

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Modified files:

Index: pkgsrc/sysutils/xenkernel411/Makefile
diff -u pkgsrc/sysutils/xenkernel411/Makefile:1.3 pkgsrc/sysutils/xenkernel411/Makefile:1.4
--- pkgsrc/sysutils/xenkernel411/Makefile:1.3   Wed Nov 28 14:00:49 2018
+++ pkgsrc/sysutils/xenkernel411/Makefile       Thu Mar  7 11:13:26 2019
@@ -1,6 +1,6 @@
-# $NetBSD: Makefile,v 1.3 2018/11/28 14:00:49 bouyer Exp $
+# $NetBSD: Makefile,v 1.4 2019/03/07 11:13:26 bouyer Exp $
 
-VERSION=       4.11.0
+VERSION=       4.11.1
 PKGREVISION=   1
 DISTNAME=      xen-${VERSION}
 PKGNAME=       xenkernel411-${VERSION}

Index: pkgsrc/sysutils/xenkernel411/distinfo
diff -u pkgsrc/sysutils/xenkernel411/distinfo:1.2 pkgsrc/sysutils/xenkernel411/distinfo:1.3
--- pkgsrc/sysutils/xenkernel411/distinfo:1.2   Wed Nov 28 14:00:49 2018
+++ pkgsrc/sysutils/xenkernel411/distinfo       Thu Mar  7 11:13:26 2019
@@ -1,26 +1,23 @@
-$NetBSD: distinfo,v 1.2 2018/11/28 14:00:49 bouyer Exp $
+$NetBSD: distinfo,v 1.3 2019/03/07 11:13:26 bouyer Exp $
 
-SHA1 (xen411/xen-4.11.0.tar.gz) = 32b0657002bcd1992dcb6b7437dd777463f3b59a
-RMD160 (xen411/xen-4.11.0.tar.gz) = a2195b67ffd4bc1e6fc36bfc580ee9efe1ae708c
-SHA512 (xen411/xen-4.11.0.tar.gz) = 33d431c194f10d5ee767558404a1f80a66b3df019012b0bbd587fcbc9524e1bba7ea04269020ce891fe9d211d2f81c63bf78abedcdbe1595aee26251c803a50a
-Size (xen411/xen-4.11.0.tar.gz) = 25131533 bytes
+SHA1 (xen411/xen-4.11.1.tar.gz) = aeb45f3b05aaa73dd2ef3a0c533a975495b58c17
+RMD160 (xen411/xen-4.11.1.tar.gz) = c0eaf57cfbd4f762e8367bcf88e99912d2089084
+SHA512 (xen411/xen-4.11.1.tar.gz) = c1655c5decdaed95a2b9a99652318cfc72f6cfdae957cfe60d635f7787e8850f33e8fafc4c4b8d61fb579c9b9d93028a6382903e71808a0418b931e76d72a649
+Size (xen411/xen-4.11.1.tar.gz) = 25152217 bytes
 SHA1 (patch-Config.mk) = 9372a09efd05c9fbdbc06f8121e411fcb7c7ba65
-SHA1 (patch-XSA269) = baf135f05bbd82fea426a807877ddb1796545c5c
-SHA1 (patch-XSA275-1) = 7097ee5e1c073a0029494ed9ccf8c786d6c4034f
-SHA1 (patch-XSA275-2) = e286286a751c878f5138e3793835c61a11cf4742
-SHA1 (patch-XSA276-1) = 0b1e4b7620bb64f3a82671a172810c12bad91154
-SHA1 (patch-XSA276-2) = ef0e94925f1a281471b066719674bba5ecca8a61
-SHA1 (patch-XSA277) = 845afbe1f1cfdad5da44029f2f3073e1d45ef259
-SHA1 (patch-XSA278) = f344db46772536bb914ed32f2529424342cb81b0
-SHA1 (patch-XSA279) = 6bc022aba315431d916b2d9f6ccd92942e74818a
-SHA1 (patch-XSA280-1) = 401627a7cc80d77c4ab4fd9654a89731467b0bdf
-SHA1 (patch-XSA280-2) = 8317f7d8664fe32a938470a225ebb33a78edfdc6
-SHA1 (patch-XSA282-1) = e790657be970c71ee7c301b7f16bd4e4d282586a
-SHA1 (patch-XSA282-2) = 8919314eadca7e5a16104db1c2101dc702a67f91
+SHA1 (patch-XSA284) = dfab3d5f51cef2ac2e201988e2c8ffbe6066ad89
+SHA1 (patch-XSA285) = 99b2864579d7a09b2d3c911f2d4f4bae23f9e42e
+SHA1 (patch-XSA287) = 834156c50c47d683e64793a5e6874a21b2999b94
+SHA1 (patch-XSA288) = 8551dc11ecb1a3912b5708b0db65533038f60390
+SHA1 (patch-XSA290-1) = 21bcc513e9ff1aa10fa62fcf1aca1e5f3558622c
+SHA1 (patch-XSA290-2) = be394879eeb98917690d284c10e04ee432e83df3
+SHA1 (patch-XSA291) = 00b2949e1d2567e5d9bf823bdd69c31be2300800
+SHA1 (patch-XSA292) = a887098d4b38567d0c8ab3170c15a08b47cbe835
+SHA1 (patch-XSA293-1) = 7e46dab8b44cc1b129e5717502e26094f96e67b9
+SHA1 (patch-XSA293-2) = 02eeb9533fa22ee99699319cc0194045fa26fef5
+SHA1 (patch-XSA294) = 8f7dd8ba100c3b93cb6f48c72b403a3cf43c09e7
 SHA1 (patch-xen_Makefile) = 465388d80de414ca3bb84faefa0f52d817e423a6
 SHA1 (patch-xen_Rules.mk) = c743dc63f51fc280d529a7d9e08650292c171dac
 SHA1 (patch-xen_arch_x86_Rules.mk) = 0bedfc53a128a87b6a249ae04fbdf6a053bfb70b
 SHA1 (patch-xen_arch_x86_boot_build32.mk) = b82c20de9b86ddaa9d05bbc1ff28f970eb78473c
 SHA1 (patch-xen_tools_symbols.c) = 6070b3b5ccc38a196283cfc1c52f5d87858beb18
-SHA1 (patch-zz-JBeulich) = d9eed028cbaf24cfd3fd725fdbb8d0264a19d615
-SHA1 (patch-zz-bouyer) = fb8a8e27d1879663d2f4dd198484626eaf20dd50

Added files:

Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA284
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA284:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA284   Thu Mar  7 11:13:26 2019
@@ -0,0 +1,33 @@
+$NetBSD: patch-XSA284,v 1.1 2019/03/07 11:13:26 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: gnttab: set page refcount for copy-on-grant-transfer
+
+Commit 5cc77f9098 ("32-on-64: Fix domain address-size clamping,
+implement"), which introduced this functionality, took care of clearing
+the old page's PGC_allocated, but failed to set the bit (and install the
+associated reference) on the newly allocated one. Furthermore the "mfn"
+local variable was never updated, and hence the wrong MFN was passed to
+guest_physmap_add_page() (and back to the destination domain) in this
+case, leading to an IOMMU mapping into an unowned page.
+
+Ideally the code would use assign_pages(), but the call to
+gnttab_prepare_for_transfer() sits in the middle of the actions
+mirroring that function.
+
+This is XSA-284.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Acked-by: George Dunlap <george.dunlap%citrix.com@localhost>
+
+--- xen/common/grant_table.c.orig
++++ xen/common/grant_table.c
+@@ -2183,6 +2183,8 @@ gnttab_transfer(
+             page->count_info &= ~(PGC_count_mask|PGC_allocated);
+             free_domheap_page(page);
+             page = new_page;
++            page->count_info = PGC_allocated | 1;
++            mfn = page_to_mfn(page);
+         }
+ 
+         spin_lock(&e->page_alloc_lock);
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA285
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA285:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA285   Thu Mar  7 11:13:26 2019
@@ -0,0 +1,45 @@
+$NetBSD: patch-XSA285,v 1.1 2019/03/07 11:13:26 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: IOMMU/x86: fix type ref-counting race upon IOMMU page table construction
+
+When arch_iommu_populate_page_table() gets invoked for an already
+running guest, simply looking at page types once isn't enough, as they
+may change at any time. Add logic to re-check the type after having
+mapped the page, unmapping it again if needed.
+
+This is XSA-285.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Tentatively-Acked-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+
+--- xen/drivers/passthrough/x86/iommu.c.orig
++++ xen/drivers/passthrough/x86/iommu.c
+@@ -68,6 +68,27 @@ int arch_iommu_populate_page_table(struct domain *d)
+                 rc = hd->platform_ops->map_page(d, gfn, mfn,
+                                                 IOMMUF_readable |
+                                                 IOMMUF_writable);
++
++                /*
++                 * We may be working behind the back of a running guest, which
++                 * may change the type of a page at any time.  We can't prevent
++                 * this (for instance, by bumping the type count while mapping
++                 * the page) without causing legitimate guest type-change
++                 * operations to fail.  So after adding the page to the IOMMU,
++                 * check again to make sure this is still valid.  NB that the
++                 * writable entry in the iommu is harmless until later, when
++                 * the actual device gets assigned.
++                 */
++                if ( !rc && !is_hvm_domain(d) &&
++                     ((page->u.inuse.type_info & PGT_type_mask) !=
++                      PGT_writable_page) )
++                {
++                    rc = hd->platform_ops->unmap_page(d, gfn);
++                    /* If the type changed yet again, simply force a retry. */
++                    if ( !rc && ((page->u.inuse.type_info & PGT_type_mask) ==
++                                 PGT_writable_page) )
++                        rc = -ERESTART;
++                }
+             }
+             if ( rc )
+             {
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA287
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA287:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA287   Thu Mar  7 11:13:26 2019
@@ -0,0 +1,330 @@
+$NetBSD: patch-XSA287,v 1.1 2019/03/07 11:13:26 bouyer Exp $
+
+From 67620c1ccb13f7b58645f48248ba1f408b021fdc Mon Sep 17 00:00:00 2001
+From: George Dunlap <george.dunlap%citrix.com@localhost>
+Date: Fri, 18 Jan 2019 15:00:34 +0000
+Subject: [PATCH] steal_page: Get rid of bogus struct page states
+
+The original rules for `struct page` required the following invariants
+at all times:
+
+- refcount > 0 implies owner != NULL
+- PGC_allocated implies refcount > 0
+
+steal_page, in a misguided attempt to protect against unknown races,
+violates both of these rules, thus introducing other races:
+
+- Temporarily, the count_info has the refcount go to 0 while
+  PGC_allocated is set
+
+- It explicitly returns the page PGC_allocated set, but owner == NULL
+  and page not on the page_list.
+
+The second one meant that page_get_owner_and_reference() could return
+NULL even after having successfully grabbed a reference on the page,
+leading the caller to leak the reference (since "couldn't get ref" and
+"got ref but no owner" look the same).
+
+Furthermore, rather than grabbing a page reference to ensure that the
+owner doesn't change under its feet, it appears to rely on holding
+d->page_alloc lock to prevent this.
+
+Unfortunately, this is ineffective: page->owner remains non-NULL for
+some time after the count has been set to 0; meaning that it would be
+entirely possible for the page to be freed and re-allocated to a
+different domain between the page_get_owner() check and the count_info
+check.
+
+Modify steal_page to instead follow the appropriate access discipline,
+taking the page through series of states similar to being freed and
+then re-allocated with MEMF_no_owner:
+
+- Grab an extra reference to make sure we don't race with anyone else
+  freeing the page
+
+- Drop both references and PGC_allocated atomically, so that (if
+successful), anyone else trying to grab a reference will fail
+
+- Attempt to reset Xen's mappings
+
+- Reset the rest of the state.
+
+Then, modify the two callers appropriately:
+
+- Leave count_info alone (it's already been cleared)
+- Call free_domheap_page() directly if appropriate
+- Call assign_pages() rather than open-coding a partial assign
+
+With all callers to assign_pages() now passing in pages with the
+type_info field clear, tighten the respective assertion there.
+
+This is XSA-287.
+
+Signed-off-by: George Dunlap <george.dunlap%citrix.com@localhost>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+---
+ xen/arch/x86/mm.c        | 84 ++++++++++++++++++++++++++++------------
+ xen/common/grant_table.c | 20 +++++-----
+ xen/common/memory.c      | 19 +++++----
+ xen/common/page_alloc.c  |  2 +-
+ 4 files changed, 83 insertions(+), 42 deletions(-)
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index 6509035a5c..d8ff58c901 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -3966,70 +3966,106 @@ int donate_page(
+     return -EINVAL;
+ }
+ 
++/*
++ * Steal page will attempt to remove `page` from domain `d`.  Upon
++ * return, `page` will be in a state similar to the state of a page
++ * returned from alloc_domheap_page() with MEMF_no_owner set:
++ * - refcount 0
++ * - type count cleared
++ * - owner NULL
++ * - page caching attributes cleaned up
++ * - removed from the domain's page_list
++ *
++ * If MEMF_no_refcount is not set, the domain's tot_pages will be
++ * adjusted.  If this results in the page count falling to 0,
++ * put_domain() will be called.
++ *
++ * The caller should either call free_domheap_page() to free the
++ * page, or assign_pages() to put it back on some domain's page list.
++ */
+ int steal_page(
+     struct domain *d, struct page_info *page, unsigned int memflags)
+ {
+     unsigned long x, y;
+     bool drop_dom_ref = false;
+-    const struct domain *owner = dom_xen;
++    const struct domain *owner;
++    int rc;
+ 
+     if ( paging_mode_external(d) )
+         return -EOPNOTSUPP;
+ 
+-    spin_lock(&d->page_alloc_lock);
+-
+-    if ( is_xen_heap_page(page) || ((owner = page_get_owner(page)) != d) )
++    /* Grab a reference to make sure the page doesn't change under our feet */
++    rc = -EINVAL;
++    if ( !(owner = page_get_owner_and_reference(page)) )
+         goto fail;
+ 
++    if ( owner != d || is_xen_heap_page(page) )
++        goto fail_put;
++
+     /*
+-     * We require there is just one reference (PGC_allocated). We temporarily
+-     * drop this reference now so that we can safely swizzle the owner.
++     * We require there are exactly two references -- the one we just
++     * took, and PGC_allocated. We temporarily drop both these
++     * references so that the page becomes effectively non-"live" for
++     * the domain.
+      */
+     y = page->count_info;
+     do {
+         x = y;
+-        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
+-            goto fail;
+-        y = cmpxchg(&page->count_info, x, x & ~PGC_count_mask);
++        if ( (x & (PGC_count_mask|PGC_allocated)) != (2 | PGC_allocated) )
++            goto fail_put;
++        y = cmpxchg(&page->count_info, x, x & ~(PGC_count_mask|PGC_allocated));
+     } while ( y != x );
+ 
+     /*
+-     * With the sole reference dropped temporarily, no-one can update type
+-     * information. Type count also needs to be zero in this case, but e.g.
+-     * PGT_seg_desc_page may still have PGT_validated set, which we need to
+-     * clear before transferring ownership (as validation criteria vary
+-     * depending on domain type).
++     * NB this is safe even if the page ends up being given back to
++     * the domain, because the count is zero: subsequent mappings will
++     * cause the cache attributes to be re-instated inside
++     * get_page_from_l1e().
++     */
++    if ( (rc = cleanup_page_cacheattr(page)) )
++    {
++        /*
++         * Couldn't fixup Xen's mappings; put things the way we found
++         * it and return an error
++         */
++        page->count_info |= PGC_allocated | 1;
++        goto fail;
++    }
++
++    /*
++     * With the reference count now zero, nobody can grab references
++     * to do anything else with the page.  Return the page to a state
++     * that it might be upon return from alloc_domheap_pages with
++     * MEMF_no_owner set.
+      */
++    spin_lock(&d->page_alloc_lock);
++
+     BUG_ON(page->u.inuse.type_info & (PGT_count_mask | PGT_locked |
+                                       PGT_pinned));
+     page->u.inuse.type_info = 0;
+-
+-    /* Swizzle the owner then reinstate the PGC_allocated reference. */
+     page_set_owner(page, NULL);
+-    y = page->count_info;
+-    do {
+-        x = y;
+-        BUG_ON((x & (PGC_count_mask|PGC_allocated)) != PGC_allocated);
+-    } while ( (y = cmpxchg(&page->count_info, x, x | 1)) != x );
++    page_list_del(page, &d->page_list);
+ 
+     /* Unlink from original owner. */
+     if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
+         drop_dom_ref = true;
+-    page_list_del(page, &d->page_list);
+ 
+     spin_unlock(&d->page_alloc_lock);
++
+     if ( unlikely(drop_dom_ref) )
+         put_domain(d);
++
+     return 0;
+ 
++ fail_put:
++    put_page(page);
+  fail:
+-    spin_unlock(&d->page_alloc_lock);
+     gdprintk(XENLOG_WARNING, "Bad steal mfn %" PRI_mfn
+              " from d%d (owner d%d) caf=%08lx taf=%" PRtype_info "\n",
+              mfn_x(page_to_mfn(page)), d->domain_id,
+              owner ? owner->domain_id : DOMID_INVALID,
+              page->count_info, page->u.inuse.type_info);
+-    return -EINVAL;
++    return rc;
+ }
+ 
+ static int __do_update_va_mapping(
+diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
+index c0585d33f4..656fad1b42 100644
+--- xen/common/grant_table.c.orig
++++ xen/common/grant_table.c
+@@ -2179,7 +2179,7 @@ gnttab_transfer(
+             rcu_unlock_domain(e);
+         put_gfn_and_copyback:
+             put_gfn(d, gop.mfn);
+-            page->count_info &= ~(PGC_count_mask|PGC_allocated);
++            /* The count_info has already been cleaned */
+             free_domheap_page(page);
+             goto copyback;
+         }
+@@ -2202,10 +2202,9 @@ gnttab_transfer(
+ 
+             copy_domain_page(page_to_mfn(new_page), mfn);
+ 
+-            page->count_info &= ~(PGC_count_mask|PGC_allocated);
++            /* The count_info has already been cleared */
+             free_domheap_page(page);
+             page = new_page;
+-            page->count_info = PGC_allocated | 1;
+             mfn = page_to_mfn(page);
+         }
+ 
+@@ -2245,12 +2244,17 @@ gnttab_transfer(
+          */
+         spin_unlock(&e->page_alloc_lock);
+         okay = gnttab_prepare_for_transfer(e, d, gop.ref);
+-        spin_lock(&e->page_alloc_lock);
+ 
+-        if ( unlikely(!okay) || unlikely(e->is_dying) )
++        if ( unlikely(!okay || assign_pages(e, page, 0, MEMF_no_refcount)) )
+         {
+-            bool_t drop_dom_ref = !domain_adjust_tot_pages(e, -1);
++            bool drop_dom_ref;
+ 
++            /*
++             * Need to grab this again to safely free our "reserved"
++             * page in the page total
++             */
++            spin_lock(&e->page_alloc_lock);
++            drop_dom_ref = !domain_adjust_tot_pages(e, -1);
+             spin_unlock(&e->page_alloc_lock);
+ 
+             if ( okay /* i.e. e->is_dying due to the surrounding if() */ )
+@@ -2263,10 +2267,6 @@ gnttab_transfer(
+             goto unlock_and_copyback;
+         }
+ 
+-        page_list_add_tail(page, &e->page_list);
+-        page_set_owner(page, e);
+-
+-        spin_unlock(&e->page_alloc_lock);
+         put_gfn(d, gop.mfn);
+ 
+         TRACE_1D(TRC_MEM_PAGE_GRANT_TRANSFER, e->domain_id);
+diff --git a/xen/common/memory.c b/xen/common/memory.c
+index 4fb7962c79..f71163221f 100644
+--- xen/common/memory.c.orig
++++ xen/common/memory.c
+@@ -675,20 +675,22 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
+          * Success! Beyond this point we cannot fail for this chunk.
+          */
+ 
+-        /* Destroy final reference to each input page. */
++        /*
++         * These pages have already had owner and reference cleared.
++         * Do the final two steps: Remove from the physmap, and free
++         * them.
++         */
+         while ( (page = page_list_remove_head(&in_chunk_list)) )
+         {
+             unsigned long gfn;
+ 
+-            if ( !test_and_clear_bit(_PGC_allocated, &page->count_info) )
+-                BUG();
+             mfn = page_to_mfn(page);
+             gfn = mfn_to_gmfn(d, mfn_x(mfn));
+             /* Pages were unshared above */
+             BUG_ON(SHARED_M2P(gfn));
+             if ( guest_physmap_remove_page(d, _gfn(gfn), mfn, 0) )
+                 domain_crash(d);
+-            put_page(page);
++            free_domheap_page(page);
+         }
+ 
+         /* Assign each output page to the domain. */
+@@ -761,13 +763,16 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
+      * chunks succeeded.
+      */
+  fail:
+-    /* Reassign any input pages we managed to steal. */
++    /*
++     * Reassign any input pages we managed to steal.  NB that if the assign
++     * fails again, we're on the hook for freeing the page, since we've already
++     * cleared PGC_allocated.
++     */
+     while ( (page = page_list_remove_head(&in_chunk_list)) )
+         if ( assign_pages(d, page, 0, MEMF_no_refcount) )
+         {
+             BUG_ON(!d->is_dying);
+-            if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
+-                put_page(page);
++            free_domheap_page(page);
+         }
+ 
+  dying:
+diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
+index 482f0988f7..52da7762e3 100644
+--- xen/common/page_alloc.c.orig
++++ xen/common/page_alloc.c
+@@ -2221,7 +2221,7 @@ int assign_pages(
+     for ( i = 0; i < (1 << order); i++ )
+     {
+         ASSERT(page_get_owner(&pg[i]) == NULL);
+-        ASSERT((pg[i].count_info & ~(PGC_allocated | 1)) == 0);
++        ASSERT(!pg[i].count_info);
+         page_set_owner(&pg[i], d);
+         smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
+         pg[i].count_info = PGC_allocated | 1;
+-- 
+2.20.1
+
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA288
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA288:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA288   Thu Mar  7 11:13:27 2019
@@ -0,0 +1,310 @@
+$NetBSD: patch-XSA288,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From 5d3a02e320f88747b75e3794c2e694284ae64c3e Mon Sep 17 00:00:00 2001
+From: George Dunlap <george.dunlap%citrix.com@localhost>
+Date: Wed, 23 Jan 2019 11:57:46 +0000
+Subject: [PATCH] xen: Make coherent PV IOMMU discipline
+
+In order for a PV domain to set up DMA from a passed-through device to
+one of its pages, the page must be mapped in the IOMMU.  On the other
+hand, before a PV page may be used as a "special" page type (such as a
+pagetable or descriptor table), it _must not_ be writable in the IOMMU
+(otherwise a malicious guest could DMA arbitrary page tables into the
+memory, bypassing Xen's safety checks); and Xen's current rule is to
+have such pages not in the IOMMU at all.
+
+At the moment, in order to accomplish this, the code borrows HVM
+domain's "physmap" concept: When a page is assigned to a guest,
+guess_physmap_add_entry() is called, which for PV guests, will create
+a writable IOMMU mapping; and when a page is removed,
+guest_physmap_remove_entry() is called, which will remove the mapping.
+
+Additionally, when a page gains the PGT_writable page type, the page
+will be added into the IOMMU; and when the page changes away from a
+PGT_writable type, the page will be removed from the IOMMU.
+
+Unfortunately, borrowing the "physmap" concept from HVM domains is
+problematic.  HVM domains have a lock on their p2m tables, ensuring
+synchronization between modifications to the p2m; and all hypercall
+parameters must first be translated through the p2m before being used.
+
+Trying to mix this locked-and-gated approach with PV's lock-free
+approach leads to several races and inconsistencies:
+
+* A race between a page being assigned and it being put into the
+  physmap; for example:
+  - P1: call populate_physmap() { A = allocate_domheap_pages() }
+  - P2: Guess page A's mfn, and call decrease_reservation(A).  A is owned by the domain,
+        and so Xen will clear the PGC_allocated bit and free the page
+  - P1: finishes populate_physmap() { guest_physmap_add_entry() }
+
+  Now the domain has a writable IOMMU mapping to a page it no longer owns.
+
+* Pages start out as type PGT_none, but with a writable IOMMU mapping.
+  If a guest uses a page as a page table without ever having created a
+  writable mapping, the IOMMU mapping will not be removed; the guest
+  will have a writable IOMMU mapping to a page it is currently using
+  as a page table.
+
+* A newly-allocated page can be DMA'd into with no special actions on
+  the part of the guest; However, if a page is promoted to a
+  non-writable type, the page must be mapped with a writable type before
+  DMA'ing to it again, or the transaction will fail.
+
+To fix this, do away with the "PV physmap" concept entirely, and
+replace it with the following IOMMU discipline for PV guests:
+ - (type == PGT_writable) <=> in iommu (even if type_count == 0)
+ - Upon a final put_page(), check to see if type is PGT_writable; if so,
+   iommu_unmap.
+
+In order to achieve that:
+
+- Remove PV IOMMU related code from guest_physmap_*
+
+- Repurpose cleanup_page_cacheattr() into a general
+  cleanup_page_mappings() function, which will both fix up Xen
+  mappings for pages with special cache attributes, and also check for
+  a PGT_writable type and remove pages if appropriate.
+
+- For compatibility with current guests, grab-and-release a
+  PGT_writable_page type for PV guests in guest_physmap_add_entry().
+  This will cause most "normal" guest pages to start out life with
+  PGT_writable_page type (and thus an IOMMU mapping), but no type
+  count (so that they can be used as special cases at will).
+
+Also, note that there is one exception to to the "PGT_writable => in
+iommu" rule: xenheap pages shared with guests may be given a
+PGT_writable type with one type reference.  This reference prevents
+the type from changing, which in turn prevents page from gaining an
+IOMMU mapping in get_page_type().  It's not clear whether this was
+intentional or not, but it's not something to change in a security
+update.
+
+This is XSA-288.
+
+Reported-by: Paul Durrant <paul.durrant%citrix.com@localhost>
+Signed-off-by: George Dunlap <george.dunlap%citrix.com@localhost>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+---
+ xen/arch/x86/mm.c     | 95 +++++++++++++++++++++++++++++++++++++++----
+ xen/arch/x86/mm/p2m.c | 57 ++++++++++++--------------
+ 2 files changed, 111 insertions(+), 41 deletions(-)
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index d8ff58c901..ad8aacad68 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -81,6 +81,22 @@
+  * OS's, which will generally use the WP bit to simplify copy-on-write
+  * implementation (in that case, OS wants a fault when it writes to
+  * an application-supplied buffer).
++ *
++ * PV domUs and IOMMUs:
++ * --------------------
++ * For a guest to be able to DMA into a page, that page must be in the
++ * domain's IOMMU.  However, we *must not* allow DMA into 'special'
++ * pages (such as page table pages, descriptor tables, &c); and we
++ * must also ensure that mappings are removed from the IOMMU when the
++ * page is freed.  Finally, it is inherently racy to make any changes
++ * based on a page with a non-zero type count.
++ *
++ * To that end, we put the page in the IOMMU only when a page gains
++ * the PGT_writeable type; and we remove the page when it loses the
++ * PGT_writeable type (not when the type count goes to zero).  This
++ * effectively protects the IOMMU status update with the type count we
++ * have just acquired.  We must also check for PGT_writable type when
++ * doing the final put_page(), and remove it from the iommu if so.
+  */
+ 
+ #include <xen/init.h>
+@@ -2275,19 +2291,79 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
+     return rc;
+ }
+ 
+-static int cleanup_page_cacheattr(struct page_info *page)
++/*
++ * In the course of a page's use, it may have caused other secondary
++ * mappings to have changed:
++ * - Xen's mappings may have been changed to accomodate the requested
++ *   cache attibutes
++ * - A page may have been put into the IOMMU of a PV guest when it
++ *   gained a writable mapping.
++ *
++ * Now that the page is being freed, clean up these mappings if
++ * appropriate.  NB that at this point the page is still "allocated",
++ * but not "live" (i.e., its refcount is 0), so it's safe to read the
++ * count_info, owner, and type_info without synchronization.
++ */
++static int cleanup_page_mappings(struct page_info *page)
+ {
+     unsigned int cacheattr =
+         (page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base;
++    int rc = 0;
++    unsigned long mfn = mfn_x(page_to_mfn(page));
+ 
+-    if ( likely(cacheattr == 0) )
+-        return 0;
++    /*
++     * If we've modified xen mappings as a result of guest cache
++     * attributes, restore them to the "normal" state.
++     */
++    if ( unlikely(cacheattr) )
++    {
++        page->count_info &= ~PGC_cacheattr_mask;
+ 
+-    page->count_info &= ~PGC_cacheattr_mask;
++        BUG_ON(is_xen_heap_page(page));
+ 
+-    BUG_ON(is_xen_heap_page(page));
++        rc = update_xen_mappings(mfn, 0);
++    }
+ 
+-    return update_xen_mappings(mfn_x(page_to_mfn(page)), 0);
++    /*
++     * If this may be in a PV domain's IOMMU, remove it.
++     *
++     * NB that writable xenheap pages have their type set and cleared by
++     * implementation-specific code, rather than by get_page_type().  As such:
++     * - They aren't expected to have an IOMMU mapping, and
++     * - We don't necessarily expect the type count to be zero when the final
++     * put_page happens.
++     *
++     * Go ahead and attemp to call iommu_unmap() on xenheap pages anyway, just
++     * in case; but only ASSERT() that the type count is zero and remove the
++     * PGT_writable type for non-xenheap pages.
++     */
++    if ( (page->u.inuse.type_info & PGT_type_mask) == PGT_writable_page )
++    {
++        struct domain *d = page_get_owner(page);
++
++        if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
++        {
++            int rc2 = iommu_unmap_page(d, mfn);
++
++            if ( !rc )
++                rc = rc2;
++        }
++
++        if ( likely(!is_xen_heap_page(page)) )
++        {
++            ASSERT((page->u.inuse.type_info &
++                    (PGT_type_mask | PGT_count_mask)) == PGT_writable_page);
++            /*
++             * Clear the type to record the fact that all writable mappings
++             * have been removed.  But if either operation failed, leave
++             * type_info alone.
++             */
++            if ( likely(!rc) )
++                page->u.inuse.type_info &= ~(PGT_type_mask | PGT_count_mask);
++        }
++    }
++
++    return rc;
+ }
+ 
+ void put_page(struct page_info *page)
+@@ -2303,7 +2379,7 @@ void put_page(struct page_info *page)
+ 
+     if ( unlikely((nx & PGC_count_mask) == 0) )
+     {
+-        if ( cleanup_page_cacheattr(page) == 0 )
++        if ( !cleanup_page_mappings(page) )
+             free_domheap_page(page);
+         else
+             gdprintk(XENLOG_WARNING,
+@@ -4020,9 +4096,10 @@ int steal_page(
+      * NB this is safe even if the page ends up being given back to
+      * the domain, because the count is zero: subsequent mappings will
+      * cause the cache attributes to be re-instated inside
+-     * get_page_from_l1e().
++     * get_page_from_l1e(), or the page to be added back to the IOMMU
++     * upon the type changing to PGT_writeable, as appropriate.
+      */
+-    if ( (rc = cleanup_page_cacheattr(page)) )
++    if ( (rc = cleanup_page_mappings(page)) )
+     {
+         /*
+          * Couldn't fixup Xen's mappings; put things the way we found
+diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
+index c53cab44d9..2b62bc61dd 100644
+--- xen/arch/x86/mm/p2m.c.orig
++++ xen/arch/x86/mm/p2m.c
+@@ -708,23 +708,9 @@ p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn_l, unsigned long mfn,
+     p2m_type_t t;
+     p2m_access_t a;
+ 
++    /* IOMMU for PV guests is handled in get_page_type() and put_page(). */
+     if ( !paging_mode_translate(p2m->domain) )
+-    {
+-        int rc = 0;
+-
+-        if ( need_iommu(p2m->domain) )
+-        {
+-            for ( i = 0; i < (1 << page_order); i++ )
+-            {
+-                int ret = iommu_unmap_page(p2m->domain, mfn + i);
+-
+-                if ( !rc )
+-                    rc = ret;
+-            }
+-        }
+-
+-        return rc;
+-    }
++        return 0;
+ 
+     ASSERT(gfn_locked_by_me(p2m, gfn));
+     P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn_l, mfn);
+@@ -769,26 +755,33 @@ guest_physmap_add_entry(struct domain *d, gfn_t gfn, mfn_t mfn,
+     int pod_count = 0;
+     int rc = 0;
+ 
++    /* IOMMU for PV guests is handled in get_page_type() and put_page(). */
+     if ( !paging_mode_translate(d) )
+     {
+-        if ( need_iommu(d) && t == p2m_ram_rw )
+-        {
+-            for ( i = 0; i < (1 << page_order); i++ )
+-            {
+-                rc = iommu_map_page(d, mfn_x(mfn_add(mfn, i)),
+-                                    mfn_x(mfn_add(mfn, i)),
+-                                    IOMMUF_readable|IOMMUF_writable);
+-                if ( rc != 0 )
+-                {
+-                    while ( i-- > 0 )
+-                        /* If statement to satisfy __must_check. */
+-                        if ( iommu_unmap_page(d, mfn_x(mfn_add(mfn, i))) )
+-                            continue;
++        struct page_info *page = mfn_to_page(mfn);
+ 
+-                    return rc;
+-                }
+-            }
++        /*
++         * Our interface for PV guests wrt IOMMU entries hasn't been very
++         * clear; but historically, pages have started out with IOMMU mappings,
++         * and only lose them when changed to a different page type.
++         *
++         * Retain this property by grabbing a writable type ref and then
++         * dropping it immediately.  The result will be pages that have a
++         * writable type (and an IOMMU entry), but a count of 0 (such that
++         * any guest-requested type changes succeed and remove the IOMMU
++         * entry).
++         */
++        if ( !need_iommu(d) || t != p2m_ram_rw )
++            return 0;
++
++        for ( i = 0; i < (1UL << page_order); ++i, ++page )
++        {
++            if ( get_page_and_type(page, d, PGT_writable_page) )
++                put_page_and_type(page);
++            else
++                return -EINVAL;
+         }
++
+         return 0;
+     }
+ 
+-- 
+2.20.1
+
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-1
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-1:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-1 Thu Mar  7 11:13:27 2019
@@ -0,0 +1,239 @@
+$NetBSD: patch-XSA290-1,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: x86/mm: also allow L2 (un)validation to be preemptible
+
+Commit c612481d1c ("x86/mm: Plumbing to allow any PTE update to fail
+with -ERESTART") added assertions next to the {alloc,free}_l2_table()
+invocations to document (and validate in debug builds) that L2
+(un)validations are always preemptible.
+
+The assertion in free_page_type() was now observed to trigger when
+recursive L2 page tables get cleaned up.
+
+In particular put_page_from_l2e()'s assumption that _put_page_type()
+would always succeed is now wrong, resulting in a partially un-validated
+page left in a domain, which has no other means of getting cleaned up
+later on. If not causing any problems earlier, this would ultimately
+trigger the check for ->u.inuse.type_info having a zero count when
+freeing the page during cleanup after the domain has died.
+
+As a result it should be considered a mistake to not have extended
+preemption fully to L2 when it was added to L3/L4 table handling, which
+this change aims to correct.
+
+The validation side additions are done just for symmetry.
+
+This is part of XSA-290.
+
+Reported-by: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
+Tested-by: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -1126,7 +1126,7 @@ get_page_from_l1e(
+ define_get_linear_pagetable(l2);
+ static int
+ get_page_from_l2e(
+-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d)
++    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+ {
+     unsigned long mfn = l2e_get_pfn(l2e);
+     int rc;
+@@ -1141,7 +1141,8 @@ get_page_from_l2e(
+         return -EINVAL;
+     }
+ 
+-    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d, 0, 0);
++    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d,
++                                    partial, false);
+     if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
+         rc = 0;
+ 
+@@ -1295,8 +1296,11 @@ void put_page_from_l1e(l1_pgentry_t l1e,
+  * NB. Virtual address 'l2e' maps to a machine address within frame 'pfn'.
+  * Note also that this automatically deals correctly with linear p.t.'s.
+  */
+-static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn)
++static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
++                             int partial, bool defer)
+ {
++    int rc = 0;
++
+     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || (l2e_get_pfn(l2e) == pfn) )
+         return 1;
+ 
+@@ -1311,13 +1315,27 @@ static int put_page_from_l2e(l2_pgentry_
+     else
+     {
+         struct page_info *pg = l2e_get_page(l2e);
+-        int rc = _put_page_type(pg, false, mfn_to_page(_mfn(pfn)));
++        struct page_info *ptpg = mfn_to_page(_mfn(pfn));
+ 
+-        ASSERT(!rc);
+-        put_page(pg);
++        if ( unlikely(partial > 0) )
++        {
++            ASSERT(!defer);
++            rc = _put_page_type(pg, true, ptpg);
++        }
++        else if ( defer )
++        {
++            current->arch.old_guest_ptpg = ptpg;
++            current->arch.old_guest_table = pg;
++        }
++        else
++        {
++            rc = _put_page_type(pg, true, ptpg);
++            if ( likely(!rc) )
++                put_page(pg);
++        }
+     }
+ 
+-    return 0;
++    return rc;
+ }
+ 
+ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
+@@ -1487,11 +1505,12 @@ static int alloc_l2_table(struct page_in
+     unsigned long  pfn = mfn_x(page_to_mfn(page));
+     l2_pgentry_t  *pl2e;
+     unsigned int   i;
+-    int            rc = 0;
++    int            rc = 0, partial = page->partial_pte;
+ 
+     pl2e = map_domain_page(_mfn(pfn));
+ 
+-    for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES; i++ )
++    for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
++          i++, partial = 0 )
+     {
+         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
+         {
+@@ -1501,23 +1520,33 @@ static int alloc_l2_table(struct page_in
+         }
+ 
+         if ( !is_guest_l2_slot(d, type, i) ||
+-             (rc = get_page_from_l2e(pl2e[i], pfn, d)) > 0 )
++             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial)) > 0 )
+             continue;
+ 
+-        if ( unlikely(rc == -ERESTART) )
++        if ( rc == -ERESTART )
+         {
+             page->nr_validated_ptes = i;
+-            break;
++            page->partial_pte = partial ?: 1;
+         }
+-
+-        if ( rc < 0 )
++        else if ( rc == -EINTR && i )
++        {
++            page->nr_validated_ptes = i;
++            page->partial_pte = 0;
++            rc = -ERESTART;
++        }
++        else if ( rc < 0 && rc != -EINTR )
+         {
+             gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+-            while ( i-- > 0 )
+-                if ( is_guest_l2_slot(d, type, i) )
+-                    put_page_from_l2e(pl2e[i], pfn);
+-            break;
++            if ( i )
++            {
++                page->nr_validated_ptes = i;
++                page->partial_pte = 0;
++                current->arch.old_guest_ptpg = NULL;
++                current->arch.old_guest_table = page;
++            }
+         }
++        if ( rc < 0 )
++            break;
+ 
+         pl2e[i] = adjust_guest_l2e(pl2e[i], d);
+     }
+@@ -1797,28 +1826,50 @@ static int free_l2_table(struct page_inf
+     struct domain *d = page_get_owner(page);
+     unsigned long pfn = mfn_x(page_to_mfn(page));
+     l2_pgentry_t *pl2e;
+-    unsigned int  i = page->nr_validated_ptes - 1;
+-    int err = 0;
++    int rc = 0, partial = page->partial_pte;
++    unsigned int i = page->nr_validated_ptes - !partial;
+ 
+     pl2e = map_domain_page(_mfn(pfn));
+ 
+-    ASSERT(page->nr_validated_ptes);
+-    do {
+-        if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) &&
+-             put_page_from_l2e(pl2e[i], pfn) == 0 &&
+-             i && hypercall_preempt_check() )
++    for ( ; ; )
++    {
++        if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
++            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
++        if ( rc < 0 )
++            break;
++
++        partial = 0;
++
++        if ( !i-- )
++            break;
++
++        if ( hypercall_preempt_check() )
+         {
+-           page->nr_validated_ptes = i;
+-           err = -ERESTART;
++            rc = -EINTR;
++            break;
+         }
+-    } while ( !err && i-- );
++    }
+ 
+     unmap_domain_page(pl2e);
+ 
+-    if ( !err )
++    if ( rc >= 0 )
++    {
+         page->u.inuse.type_info &= ~PGT_pae_xen_l2;
++        rc = 0;
++    }
++    else if ( rc == -ERESTART )
++    {
++        page->nr_validated_ptes = i;
++        page->partial_pte = partial ?: -1;
++    }
++    else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
++    {
++        page->nr_validated_ptes = i + 1;
++        page->partial_pte = 0;
++        rc = -ERESTART;
++    }
+ 
+-    return err;
++    return rc;
+ }
+ 
+ static int free_l3_table(struct page_info *page)
+@@ -2138,7 +2189,7 @@ static int mod_l2_entry(l2_pgentry_t *pl
+             return -EBUSY;
+         }
+ 
+-        if ( unlikely((rc = get_page_from_l2e(nl2e, pfn, d)) < 0) )
++        if ( unlikely((rc = get_page_from_l2e(nl2e, pfn, d, 0)) < 0) )
+             return rc;
+ 
+         nl2e = adjust_guest_l2e(nl2e, d);
+@@ -2157,7 +2208,8 @@ static int mod_l2_entry(l2_pgentry_t *pl
+         return -EBUSY;
+     }
+ 
+-    put_page_from_l2e(ol2e, pfn);
++    put_page_from_l2e(ol2e, pfn, 0, true);
++
+     return rc;
+ }
+ 
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-2
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-2:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA290-2 Thu Mar  7 11:13:27 2019
@@ -0,0 +1,73 @@
+$NetBSD: patch-XSA290-2,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: x86/mm: add explicit preemption checks to L3 (un)validation
+
+When recursive page tables are used at the L3 level, unvalidation of a
+single L4 table may incur unvalidation of two levels of L3 tables, i.e.
+a maximum iteration count of 512^3 for unvalidating an L4 table. The
+preemption check in free_l2_table() as well as the one in
+_put_page_type() may never be reached, so explicit checking is needed in
+free_l3_table().
+
+When recursive page tables are used at the L4 level, the iteration count
+at L4 alone is capped at 512^2. As soon as a present L3 entry is hit
+which itself needs unvalidation (and hence requiring another nested loop
+with 512 iterations), the preemption checks added here kick in, so no
+further preemption checking is needed at L4 (until we decide to permit
+5-level paging for PV guests).
+
+The validation side additions are done just for symmetry.
+
+This is part of XSA-290.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -1581,6 +1581,13 @@ static int alloc_l3_table(struct page_in
+     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
+           i++, partial = 0 )
+     {
++        if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
++        {
++            page->nr_validated_ptes = i;
++            rc = -ERESTART;
++            break;
++        }
++
+         if ( is_pv_32bit_domain(d) && (i == 3) )
+         {
+             if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
+@@ -1882,15 +1889,25 @@ static int free_l3_table(struct page_inf
+ 
+     pl3e = map_domain_page(_mfn(pfn));
+ 
+-    do {
++    for ( ; ; )
++    {
+         rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+         if ( rc < 0 )
+             break;
++
+         partial = 0;
+-        if ( rc > 0 )
+-            continue;
+-        pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
+-    } while ( i-- );
++        if ( rc == 0 )
++            pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
++
++        if ( !i-- )
++            break;
++
++        if ( hypercall_preempt_check() )
++        {
++            rc = -EINTR;
++            break;
++        }
++    }
+ 
+     unmap_domain_page(pl3e);
+ 
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA291
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA291:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA291   Thu Mar  7 11:13:27 2019
@@ -0,0 +1,55 @@
+$NetBSD: patch-XSA291,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: x86/mm: don't retain page type reference when IOMMU operation fails
+
+The IOMMU update in _get_page_type() happens between recording of the
+new reference and validation of the page for its new type (if
+necessary). If the IOMMU operation fails, there's no point in actually
+carrying out validation. Furthermore, with this resulting in failure
+getting indicated to the caller, the recorded type reference also needs
+to be dropped again.
+
+Note that in case of failure of alloc_page_type() there's no need to
+undo the IOMMU operation: Only special types get handed to the function.
+The function, upon failure, clears ->u.inuse.type_info, effectively
+converting the page to PGT_none. The IOMMU mapping, however, solely
+depends on whether the type is PGT_writable_page.
+
+This is XSA-291.
+
+Reported-by: Igor Druzhinin <igor.druzhinin%citrix.com@localhost>
+Reported-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -2751,6 +2751,13 @@ static int _get_page_type(struct page_in
+                 iommu_ret = iommu_map_page(d, gfn_x(gfn),
+                                            mfn_x(page_to_mfn(page)),
+                                            IOMMUF_readable|IOMMUF_writable);
++
++            if ( unlikely(iommu_ret) )
++            {
++                _put_page_type(page, false, NULL);
++                rc = iommu_ret;
++                goto out;
++            }
+         }
+     }
+ 
+@@ -2765,12 +2772,10 @@ static int _get_page_type(struct page_in
+         rc = alloc_page_type(page, type, preemptible);
+     }
+ 
++ out:
+     if ( (x & PGT_partial) && !(nx & PGT_partial) )
+         put_page(page);
+ 
+-    if ( !rc )
+-        rc = iommu_ret;
+-
+     return rc;
+ }
+ 
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA292
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA292:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA292   Thu Mar  7 11:13:27 2019
@@ -0,0 +1,97 @@
+$NetBSD: patch-XSA292,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: x86/mm: properly flush TLB in switch_cr3_cr4()
+
+The CR3 values used for contexts run with PCID enabled uniformly have
+CR3.NOFLUSH set, resulting in the CR3 write itself to not cause any
+flushing at all. When the second CR4 write is skipped or doesn't do any
+flushing, there's nothing so far which would purge TLB entries which may
+have accumulated again if the PCID doesn't change; the "just in case"
+flush only affects the case where the PCID actually changes. (There may
+be particularly many TLB entries re-accumulated in case of a watchdog
+NMI kicking in during the critical time window.)
+
+Suppress the no-flush behavior of the CR3 write in this particular case.
+
+Similarly the second CR4 write may not cause any flushing of TLB entries
+established again while the original PCID was still in use - it may get
+performed because of unrelated bits changing. The flush of the old PCID
+needs to happen nevertheless.
+
+At the same time also eliminate a possible race with lazy context
+switch: Just like for CR4, CR3 may change at any time while interrupts
+are enabled, due to the __sync_local_execstate() invocation from the
+flush IPI handler. It is for that reason that the CR3 read, just like
+the CR4 one, must happen only after interrupts have been turned off.
+
+This is XSA-292.
+
+Reported-by: Sergey Dyasli <sergey.dyasli%citrix.com@localhost>
+Reported-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Tested-by: Sergey Dyasli <sergey.dyasli%citrix.com@localhost>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+---
+v3: Adjust comments. Drop old_cr4 from the PGE check in the expression
+    controlling the invocation of invpcid_flush_single_context(), as PGE
+    is always clear there.
+v2: Decouple invpcid_flush_single_context() from 2nd CR4 write.
+
+--- xen/arch/x86/flushtlb.c.orig
++++ xen/arch/x86/flushtlb.c
+@@ -103,9 +103,8 @@ static void do_tlb_flush(void)
+ 
+ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+ {
+-    unsigned long flags, old_cr4;
++    unsigned long flags, old_cr4, old_pcid;
+     u32 t;
+-    unsigned long old_pcid = cr3_pcid(read_cr3());
+ 
+     /* This non-reentrant function is sometimes called in interrupt context. */
+     local_irq_save(flags);
+@@ -133,15 +132,38 @@ void switch_cr3_cr4(unsigned long cr3, u
+          */
+         invpcid_flush_all_nonglobals();
+ 
++    /*
++     * If we don't change PCIDs, the CR3 write below needs to flush this very
++     * PCID, even when a full flush was performed above, as we are currently
++     * accumulating TLB entries again from the old address space.
++     * NB: Clearing the bit when we don't use PCID is benign (as it is clear
++     * already in that case), but allows the if() to be more simple.
++     */
++    old_pcid = cr3_pcid(read_cr3());
++    if ( old_pcid == cr3_pcid(cr3) )
++        cr3 &= ~X86_CR3_NOFLUSH;
++
+     write_cr3(cr3);
+ 
+     if ( old_cr4 != cr4 )
+         write_cr4(cr4);
+-    else if ( old_pcid != cr3_pcid(cr3) )
+-        /*
+-         * Make sure no TLB entries related to the old PCID created between
+-         * flushing the TLB and writing the new %cr3 value remain in the TLB.
+-         */
++
++    /*
++     * Make sure no TLB entries related to the old PCID created between
++     * flushing the TLB and writing the new %cr3 value remain in the TLB.
++     *
++     * The write to CR4 just above has performed a wider flush in certain
++     * cases, which therefore get excluded here. Since that write is
++     * conditional, note in particular that it won't be skipped if PCIDE
++     * transitions from 1 to 0. This is because the CR4 write further up will
++     * have been skipped in this case, as PCIDE and PGE won't both be set at
++     * the same time.
++     *
++     * Note also that PGE is always clear in old_cr4.
++     */
++    if ( old_pcid != cr3_pcid(cr3) &&
++         !(cr4 & X86_CR4_PGE) &&
++         (old_cr4 & X86_CR4_PCIDE) <= (cr4 & X86_CR4_PCIDE) )
+         invpcid_flush_single_context(old_pcid);
+ 
+     post_flush(t);
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-1
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-1:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-1 Thu Mar  7 11:13:27 2019
@@ -0,0 +1,319 @@
+$NetBSD: patch-XSA293-1,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/pv: Rewrite guest %cr4 handling from scratch
+
+The PV cr4 logic is almost impossible to follow, and leaks bits into guest
+context which definitely shouldn't be visible (in particular, VMXE).
+
+The biggest problem however, and source of the complexity, is that it derives
+new real and guest cr4 values from the current value in hardware - this is
+context dependent and an inappropriate source of information.
+
+Rewrite the cr4 logic to be invariant of the current value in hardware.
+
+First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
+and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
+the ASSERT() obviously redundant.
+
+For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
+all logic gets reworked in terms of this and mmu_cr4_features only.
+
+Two masks are introduced; bits which the guest has control over, and bits
+which are forwarded from Xen's settings.  One guest-visible change here is
+that Xen's VMXE setting is no longer visible at all.
+
+pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
+deliberately starts with mmu_cr4_features, and only alters the minimal subset
+of bits.
+
+The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
+remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
+derivation from the policy.
+
+Another guest visible change here is that a 32bit PV guest can now flip
+FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
+unusable outside of a 64bit code segment, the ability to modify FSGSBASE
+matches real hardware behaviour, and avoids the need for any 32bit/64bit
+differences in the logic.
+
+Overall, this patch shouldn't have a practical change in guest behaviour.
+VMXE will disappear from view, and an inquisitive 32bit kernel can now see
+FSGSBASE changing, but this new logic is otherwise bug-compatible with before.
+
+This is part of XSA-293
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
+index b1e50d1..675152a 100644
+--- xen/arch/x86/domain.c.orig
++++ xen/arch/x86/domain.c
+@@ -733,49 +733,6 @@ int arch_domain_soft_reset(struct domain *d)
+     return ret;
+ }
+ 
+-/*
+- * These are the masks of CR4 bits (subject to hardware availability) which a
+- * PV guest may not legitimiately attempt to modify.
+- */
+-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
+-
+-static int __init init_pv_cr4_masks(void)
+-{
+-    unsigned long common_mask = ~X86_CR4_TSD;
+-
+-    /*
+-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
+-     */
+-    if ( cpu_has_de )
+-        common_mask &= ~X86_CR4_DE;
+-    if ( cpu_has_xsave )
+-        common_mask &= ~X86_CR4_OSXSAVE;
+-
+-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
+-
+-    /*
+-     * 64bit PV guests may attempt to modify FSGSBASE.
+-     */
+-    if ( cpu_has_fsgsbase )
+-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
+-
+-    return 0;
+-}
+-__initcall(init_pv_cr4_masks);
+-
+-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
+-{
+-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
+-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
+-
+-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
+-        printk(XENLOG_G_WARNING
+-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
+-               current->domain->domain_id, v, hv_cr4, guest_cr4);
+-
+-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
+-}
+-
+ #define xen_vcpu_guest_context vcpu_guest_context
+ #define fpu_ctxt fpu_ctxt.x
+ CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
+@@ -789,7 +746,7 @@ int arch_set_info_guest(
+     struct domain *d = v->domain;
+     unsigned long cr3_gfn;
+     struct page_info *cr3_page;
+-    unsigned long flags, cr4;
++    unsigned long flags;
+     unsigned int i;
+     int rc = 0, compat;
+ 
+@@ -978,9 +935,8 @@ int arch_set_info_guest(
+     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
+     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
+ 
+-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
+-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
+-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
++    v->arch.pv_vcpu.ctrlreg[4] =
++        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
+ 
+     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
+     for ( i = 0; i < 8; i++ )
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index 6509035..08634b7 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -505,33 +505,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
+         v->arch.cr3 |= get_pcid_bits(v, false);
+ }
+ 
+-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
+-{
+-    const struct domain *d = v->domain;
+-    unsigned long cr4;
+-
+-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
+-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
+-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
+-
+-    if ( d->arch.pv_domain.pcid )
+-        cr4 |= X86_CR4_PCIDE;
+-    else if ( !d->arch.pv_domain.xpti )
+-        cr4 |= X86_CR4_PGE;
+-
+-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
+-
+-    return cr4;
+-}
+-
+ void write_ptbase(struct vcpu *v)
+ {
+     struct cpu_info *cpu_info = get_cpu_info();
+     unsigned long new_cr4;
+ 
+     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+-              ? pv_guest_cr4_to_real_cr4(v)
+-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
++              ? pv_make_cr4(v) : mmu_cr4_features;
+ 
+     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
+     {
+@@ -550,8 +530,6 @@ void write_ptbase(struct vcpu *v)
+         switch_cr3_cr4(v->arch.cr3, new_cr4);
+         cpu_info->pv_cr3 = 0;
+     }
+-
+-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
+ }
+ 
+ /*
+diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
+index b75ff6b..3965959 100644
+--- xen/arch/x86/pv/domain.c.orig
++++ xen/arch/x86/pv/domain.c
+@@ -97,6 +97,52 @@ static void release_compat_l4(struct vcpu *v)
+     v->arch.guest_table_user = pagetable_null();
+ }
+ 
++unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
++{
++    const struct cpuid_policy *p = v->domain->arch.cpuid;
++
++    /* Discard attempts to set guest controllable bits outside of the policy. */
++    cr4 &= ~((p->basic.tsc     ? 0 : X86_CR4_TSD)      |
++             (p->basic.de      ? 0 : X86_CR4_DE)       |
++             (p->feat.fsgsbase ? 0 : X86_CR4_FSGSBASE) |
++             (p->basic.xsave   ? 0 : X86_CR4_OSXSAVE));
++
++    /* Masks expected to be disjoint sets. */
++    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
++
++    /*
++     * A guest sees the policy subset of its own choice of guest controllable
++     * bits, and a subset of Xen's choice of certain hardware settings.
++     */
++    return ((cr4 & PV_CR4_GUEST_MASK) |
++            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
++}
++
++unsigned long pv_make_cr4(const struct vcpu *v)
++{
++    const struct domain *d = v->domain;
++    unsigned long cr4 = mmu_cr4_features &
++        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
++
++    /*
++     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
++     * set, as it impacts the safety of TLB flushing.
++     */
++    if ( d->arch.pv_domain.pcid )
++        cr4 |= X86_CR4_PCIDE;
++    else if ( !d->arch.pv_domain.xpti )
++        cr4 |= X86_CR4_PGE;
++
++    /*
++     * TSD is needed if either the guest has elected to use it, or Xen is
++     * virtualising the TSC value the guest sees.
++     */
++    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
++        cr4 |= X86_CR4_TSD;
++
++    return cr4;
++}
++
+ int switch_compat(struct domain *d)
+ {
+     struct vcpu *v;
+@@ -191,7 +237,7 @@ int pv_vcpu_initialise(struct vcpu *v)
+     /* PV guests by default have a 100Hz ticker. */
+     v->periodic_period = MILLISECS(10);
+ 
+-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
++    v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
+ 
+     if ( is_pv_32bit_domain(d) )
+     {
+diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
+index ce2ec76..4abbc14 100644
+--- xen/arch/x86/pv/emul-priv-op.c.orig
++++ xen/arch/x86/pv/emul-priv-op.c
+@@ -32,6 +32,7 @@
+ #include <asm/hypercall.h>
+ #include <asm/mc146818rtc.h>
+ #include <asm/p2m.h>
++#include <asm/pv/domain.h>
+ #include <asm/pv/traps.h>
+ #include <asm/shared.h>
+ #include <asm/traps.h>
+@@ -785,8 +786,8 @@ static int write_cr(unsigned int reg, unsigned long val,
+     }
+ 
+     case 4: /* Write CR4 */
+-        curr->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
+-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
++        curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
++        write_cr4(pv_make_cr4(curr));
+         ctxt_switch_levelling(curr);
+         return X86EMUL_OKAY;
+     }
+diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
+index ec81d78..c8aa8a5 100644
+--- xen/include/asm-x86/domain.h.orig
++++ xen/include/asm-x86/domain.h
+@@ -610,17 +610,6 @@ bool update_secondary_system_time(struct vcpu *,
+ void vcpu_show_execution_state(struct vcpu *);
+ void vcpu_show_registers(const struct vcpu *);
+ 
+-/* Clean up CR4 bits that are not under guest control. */
+-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
+-
+-/* Convert between guest-visible and real CR4 values. */
+-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
+-
+-#define real_cr4_to_pv_guest_cr4(c)                         \
+-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
+-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
+-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
+-
+ #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
+ 
+ static inline struct vcpu_guest_context *alloc_vcpu_guest_context(void)
+diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
+index 4fea764..4e4710c 100644
+--- xen/include/asm-x86/pv/domain.h.orig
++++ xen/include/asm-x86/pv/domain.h
+@@ -59,6 +59,23 @@ int pv_vcpu_initialise(struct vcpu *v);
+ void pv_domain_destroy(struct domain *d);
+ int pv_domain_initialise(struct domain *d);
+ 
++/*
++ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
++ * hardware, while some are fully emulated.
++ */
++#define PV_CR4_GUEST_MASK \
++    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
++
++/* Bits which a PV guest may observe from the real hardware settings. */
++#define PV_CR4_GUEST_VISIBLE_MASK \
++    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
++
++/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
++unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
++
++/* Create a cr4 value to load into hardware, based on vcpu settings. */
++unsigned long pv_make_cr4(const struct vcpu *v);
++
+ #else  /* !CONFIG_PV */
+ 
+ #include <xen/errno.h>
+@@ -68,6 +85,8 @@ static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
+ static inline void pv_domain_destroy(struct domain *d) {}
+ static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
+ 
++static inline unsigned long pv_make_cr4(const struct vcpu *v) { return ~0ul; }
++
+ #endif        /* CONFIG_PV */
+ 
+ void paravirt_ctxt_switch_from(struct vcpu *v);
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-2
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-2:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA293-2 Thu Mar  7 11:13:27 2019
@@ -0,0 +1,262 @@
+$NetBSD: patch-XSA293-2,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back
+
+Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
+the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
+even when the guest kernel believes that they are disabled.
+
+The FSGSBASE feature isn't currently supported in Linux, and its context
+switch path has some optimisations which rely on userspace being unable to use
+the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
+expectation.
+
+In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
+into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
+instructions.
+
+ * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
+   check %cr4 directly.
+ * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
+   is set.  Comment this property.
+ * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
+   the current %cr4 value to determine which mechanism to use.
+ * toggle_guest_mode() and save_segments() are update to avoid reading
+   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
+   consequence of this is that the write_cr() path needs to cache the current
+   bases, as subsequent context switches will skip saving the values.
+ * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
+   observed in a safe way WRT the hardware setting, if an interrupt happens to
+   hit in the middle.
+ * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
+   choice of FSGSBASE.
+
+This is part of XSA-293
+
+Reported-by: Andy Lutomirski <luto%kernel.org@localhost>
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
+index 675152a..29f892c 100644
+--- xen/arch/x86/domain.c.orig
++++ xen/arch/x86/domain.c
+@@ -1433,7 +1433,8 @@ static void save_segments(struct vcpu *v)
+     regs->fs = read_sreg(fs);
+     regs->gs = read_sreg(gs);
+ 
+-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
++    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
++    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
+     {
+         v->arch.pv_vcpu.fs_base = __rdfsbase();
+         if ( v->arch.flags & TF_kernel_mode )
+diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
+index 3965959..228a174 100644
+--- xen/arch/x86/pv/domain.c.orig
++++ xen/arch/x86/pv/domain.c
+@@ -140,6 +140,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
+     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+         cr4 |= X86_CR4_TSD;
+ 
++    /*
++     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
++     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
++     * we do leave it set in 32bit PV context to speed up Xen's context switch
++     * path.
++     */
++    if ( !is_pv_32bit_domain(d) &&
++         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
++        cr4 &= ~X86_CR4_FSGSBASE;
++
+     return cr4;
+ }
+ 
+@@ -375,7 +385,8 @@ void toggle_guest_mode(struct vcpu *v)
+ {
+     ASSERT(!is_pv_32bit_vcpu(v));
+ 
+-    if ( cpu_has_fsgsbase )
++    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+     {
+         if ( v->arch.flags & TF_kernel_mode )
+             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
+index 4abbc14..312c1ee 100644
+--- xen/arch/x86/pv/emul-priv-op.c.orig
++++ xen/arch/x86/pv/emul-priv-op.c
+@@ -786,6 +786,17 @@ static int write_cr(unsigned int reg, unsigned long val,
+     }
+ 
+     case 4: /* Write CR4 */
++        /*
++         * If this write will disable FSGSBASE, refresh Xen's idea of the
++         * guest bases now that they can no longer change.
++         */
++        if ( (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
++             !(val & X86_CR4_FSGSBASE) )
++        {
++            curr->arch.pv_vcpu.fs_base = __rdfsbase();
++            curr->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
++        }
++
+         curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+         write_cr4(pv_make_cr4(curr));
+         ctxt_switch_levelling(curr);
+@@ -835,14 +846,15 @@ static int read_msr(unsigned int reg, uint64_t *val,
+     case MSR_FS_BASE:
+         if ( is_pv_32bit_domain(currd) )
+             break;
+-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv_vcpu.fs_base;
++        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
++                                               : curr->arch.pv_vcpu.fs_base;
+         return X86EMUL_OKAY;
+ 
+     case MSR_GS_BASE:
+         if ( is_pv_32bit_domain(currd) )
+             break;
+-        *val = cpu_has_fsgsbase ? __rdgsbase()
+-                                : curr->arch.pv_vcpu.gs_base_kernel;
++        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
++                                               : curr->arch.pv_vcpu.gs_base_kernel;
+         return X86EMUL_OKAY;
+ 
+     case MSR_SHADOW_GS_BASE:
+diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
+index ecb0149..a353d76 100644
+--- xen/arch/x86/setup.c.orig
++++ xen/arch/x86/setup.c
+@@ -1567,7 +1567,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
+ 
+     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
+ 
+-    if ( cpu_has_fsgsbase )
++    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
+         set_in_cr4(X86_CR4_FSGSBASE);
+ 
+     if ( opt_invpcid && cpu_has_invpcid )
+diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
+index b237da1..861cb0a 100644
+--- xen/include/asm-x86/cpufeature.h.orig
++++ xen/include/asm-x86/cpufeature.h
+@@ -90,7 +90,6 @@
+ #define cpu_has_xsaves          boot_cpu_has(X86_FEATURE_XSAVES)
+ 
+ /* CPUID level 0x00000007:0.ebx */
+-#define cpu_has_fsgsbase        boot_cpu_has(X86_FEATURE_FSGSBASE)
+ #define cpu_has_bmi1            boot_cpu_has(X86_FEATURE_BMI1)
+ #define cpu_has_hle             boot_cpu_has(X86_FEATURE_HLE)
+ #define cpu_has_avx2            boot_cpu_has(X86_FEATURE_AVX2)
+diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
+index afbeb7f..1ba6ee3 100644
+--- xen/include/asm-x86/msr.h.orig
++++ xen/include/asm-x86/msr.h
+@@ -120,6 +120,14 @@ static inline uint64_t rdtsc_ordered(void)
+                         : "=a" (low), "=d" (high) \
+                         : "c" (counter))
+ 
++/*
++ * On hardware supporting FSGSBASE, the value loaded into hardware is the
++ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
++ * 32bit PV).
++ *
++ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
++ * %cr4.fsgsbase is set.
++ */
+ static inline unsigned long __rdfsbase(void)
+ {
+     unsigned long base;
+@@ -150,7 +158,7 @@ static inline unsigned long rdfsbase(void)
+ {
+     unsigned long base;
+ 
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+         return __rdfsbase();
+ 
+     rdmsrl(MSR_FS_BASE, base);
+@@ -162,7 +170,7 @@ static inline unsigned long rdgsbase(void)
+ {
+     unsigned long base;
+ 
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+         return __rdgsbase();
+ 
+     rdmsrl(MSR_GS_BASE, base);
+@@ -174,7 +182,7 @@ static inline unsigned long rdgsshadow(void)
+ {
+     unsigned long base;
+ 
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+     {
+         asm volatile ( "swapgs" );
+         base = __rdgsbase();
+@@ -188,7 +196,7 @@ static inline unsigned long rdgsshadow(void)
+ 
+ static inline void wrfsbase(unsigned long base)
+ {
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+ #ifdef HAVE_AS_FSGSBASE
+         asm volatile ( "wrfsbase %0" :: "r" (base) );
+ #else
+@@ -200,7 +208,7 @@ static inline void wrfsbase(unsigned long base)
+ 
+ static inline void wrgsbase(unsigned long base)
+ {
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+ #ifdef HAVE_AS_FSGSBASE
+         asm volatile ( "wrgsbase %0" :: "r" (base) );
+ #else
+@@ -212,7 +220,7 @@ static inline void wrgsbase(unsigned long base)
+ 
+ static inline void wrgsshadow(unsigned long base)
+ {
+-    if ( cpu_has_fsgsbase )
++    if ( read_cr4() & X86_CR4_FSGSBASE )
+     {
+         asm volatile ( "swapgs\n\t"
+ #ifdef HAVE_AS_FSGSBASE
+diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
+index 2bd9e69..8e253dc 100644
+--- xen/include/asm-x86/processor.h.orig
++++ xen/include/asm-x86/processor.h
+@@ -305,11 +305,31 @@ static inline unsigned long read_cr4(void)
+ 
+ static inline void write_cr4(unsigned long val)
+ {
++    struct cpu_info *info = get_cpu_info();
++
+     /* No global pages in case of PCIDs enabled! */
+     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
+ 
+-    get_cpu_info()->cr4 = val;
+-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
++    /*
++     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
++     * choice for 64bit PV guests, which impacts whether Xen can use the
++     * instructions.
++     *
++     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
++     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
++     * the MSR path if not.  Some users require interrupt safety.
++     *
++     * If FSGSBASE is currently or about to become clear, reflect this in
++     * info->cr4 before updating %cr4, so an interrupt which hits in the
++     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
++     */
++    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
++
++    asm volatile ( "mov %[val], %%cr4"
++                   : "+m" (info->cr4) /* Force ordering without a barrier. */
++                   : [val] "r" (val) );
++
++    info->cr4 = val;
+ }
+ 
+ /* Clear and set 'TS' bit respectively */
Index: pkgsrc/sysutils/xenkernel411/patches/patch-XSA294
diff -u /dev/null pkgsrc/sysutils/xenkernel411/patches/patch-XSA294:1.1
--- /dev/null   Thu Mar  7 11:13:27 2019
+++ pkgsrc/sysutils/xenkernel411/patches/patch-XSA294   Thu Mar  7 11:13:27 2019
@@ -0,0 +1,73 @@
+$NetBSD: patch-XSA294,v 1.1 2019/03/07 11:13:27 bouyer Exp $
+
+From: Jan Beulich <JBeulich%suse.com@localhost>
+Subject: x86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests
+
+For shadow mode guests (e.g. PV ones forced into that mode as L1TF
+mitigation, or during migration) update_cr3() -> sh_update_cr3() may
+result in a change to the (shadow) root page table (compared to the
+previous one when running the same vCPU with the same PCID). This can,
+first and foremost, be a result of memory pressure on the shadow memory
+pool of the domain. Shadow code legitimately relies on the original
+(prior to commit 5c81d260c2 ["xen/x86: use PCID feature"]) behavior of
+the subsequent CR3 write to flush the TLB of entries still left from
+walks with an earlier, different (shadow) root page table.
+
+Restore the flushing behavior, also for the second CR3 write on the exit
+path to guest context when XPTI is active. For the moment accept that
+this will introduce more flushes than are strictly necessary - no flush
+would be needed when the (shadow) root page table doesn't actually
+change, but this information isn't readily (i.e. without introducing a
+layering violation) available here.
+
+This is XSA-294.
+
+Reported-by: XXX PERSON <XXX EMAIL>
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Tested-by: Juergen Gross <jgross%suse.com@localhost>
+Reviewed-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+
+diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
+index b75ff6b..528413a 100644
+--- xen/arch/x86/pv/domain.c.orig
++++ xen/arch/x86/pv/domain.c
+@@ -296,21 +296,35 @@ int pv_domain_initialise(struct domain *d)
+ static void _toggle_guest_pt(struct vcpu *v)
+ {
+     const struct domain *d = v->domain;
++    struct cpu_info *cpu_info = get_cpu_info();
++    unsigned long cr3;
+ 
+     v->arch.flags ^= TF_kernel_mode;
+     update_cr3(v);
+     if ( d->arch.pv_domain.xpti )
+     {
+-        struct cpu_info *cpu_info = get_cpu_info();
+-
+         cpu_info->root_pgt_changed = true;
+         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
+                            (d->arch.pv_domain.pcid
+                             ? get_pcid_bits(v, true) : 0);
+     }
+ 
+-    /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
+-    write_cr3(v->arch.cr3);
++    /*
++     * Don't flush user global mappings from the TLB. Don't tick TLB clock.
++     *
++     * In shadow mode, though, update_cr3() may need to be accompanied by a
++     * TLB flush (for just the incoming PCID), as the top level page table may
++     * have changed behind our backs. To be on the safe side, suppress the
++     * no-flush unconditionally in this case. The XPTI CR3 write, if enabled,
++     * will then need to be a flushing one too.
++     */
++    cr3 = v->arch.cr3;
++    if ( shadow_mode_enabled(d) )
++    {
++        cr3 &= ~X86_CR3_NOFLUSH;
++        cpu_info->pv_cr3 &= ~X86_CR3_NOFLUSH;
++    }
++    write_cr3(cr3);
+ 
+     if ( !(v->arch.flags & TF_kernel_mode) )
+         return;

Prev by Date: CVS commit: pkgsrc/doc
Next by Date: CVS commit: pkgsrc/sysutils/xentools411
Previous by Thread: CVS commit: pkgsrc/doc
Next by Thread: CVS commit: pkgsrc/sysutils/xentools411
Indexes:

Home | Main Index | Thread Index | Old Index