Subject: Re: remplacing page mappings behind uvm
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 12/01/2007 13:28:35
On Sat, Dec 01, 2007 at 07:04:59PM +0900, YAMAMOTO Takashi wrote:
> > I played with this, it seems to work. See the attached privcmd.c
> > There is a wired entry accounting issue I still need to solve,
> > which I think is related to MMAPBATCH, but it looks promising.
> 
> i wonder how many objects are used with the usage pattern of xentools.

I'd say a few 10s for qemu-dm. But it's dynamic, they are mmaped and then
unmapped. I'll add a counter to watch it.

> 
> > 	case IOCTL_PRIVCMD_MMAP:
> 
> > 		for (i = 0; i < mcmd->num; i++) {
> 
> > 			va = mentry.va & ~PAGE_MASK;
> > 			ma = mentry.mfn <<  PGSHIFT; /* XXX ??? */
> > 			error  = privcmd_map_obj(vmm, va,
> > 			    (mentry.npages << PGSHIFT), &obj);
> 
> do you need an object per mentry?

I think so, because VA may not be contigous at this point. An optimisation
could be to try to merge contigous VAs in a single object and map entry,
but lets look at it later :)

> 
> > 			if (error)
> > 				return error;
> > 
> > 			for (j = 0; j < mentry.npages; j++) {
> > 				//printf("remap va 0x%lx to 0x%lx\n", va, ma);
> > 				maddr[j] = ma;
> > 				ma += PAGE_SIZE;
> > 			}
> > 			simple_lock(&obj->uobj.vmobjlock);
> > 			obj->maddr = maddr;
> > 			obj->npages = mentry.npages;
> > 			obj->domid = mcmd->dom;
> > 			simple_unlock(&obj->uobj.vmobjlock);
> 
> are you assuming callers are serialized and bug-free?
> otherwise the object should be set up before being mapped.

It would probably be better to pass maddr & al to privcmd_map_obj() too.

> 
> > 		/* force mapping to be loaded */
> > 		if ((error = uvm_map_pageable(vmm, va0,
> > 		    va0 + (pmb->num << PGSHIFT), false, 0)) != 0) {
> > 			printf("uvm_map_pageable error %d\n", error);
> > 			return(error);
> > 		}
> 
> is this uvm_map_pageable only for debug?

No, it's because we need to enter the mappings *now*, to be able to
report mappings that failed back to the ioctl caller.

Another way of doing it would be to map it to some kernel space to see if it
works, unmap it and enter it in the object. The main issue I have with this
is that UVM is not prepared to have privpgop_fault() fail to enter a
mapping.
Now that I'm thinking about it, it's possible that if we can report the
failed mappings to the caller without going up to privpgop_fault(),
qemu-dm may munmap() them without touching them, and so privpgop_fault() would
never be called for such mappings.

> 
> > 			/* XXX for proper ptp accountings */
> > 			pmap_remove(ufi->orig_map->pmap, vaddr, 
> > 			    vaddr + PAGE_SIZE);
> > 			pobj->maddr[maddr_i] |= 0xF0000000000;
> 
> does this error happen during normal operations?

Yes, quite a lot. And if it's not reported back to qemu-dm, it doens't
work. I suspect it uses it as some kind of probe of available ressources.

> 
> > 			error = 0;
> > 		}
> > 	}
> > 	uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj, NULL);
> > 	pmap_update(ufi->orig_map->pmap);
> > 	return (error);
> > }
> > 
> > static int
> > privcmd_map_obj(struct vm_map *map, vaddr_t start, off_t size,
> > 	struct privcmd_object **nobj)
> > {
> > 	struct vm_map_entry *dead_entries;
> > 	struct uvm_object *uobj;
> > 	int error;
> > 	uvm_flag_t uvmflag;
> > 	vaddr_t newstart = start;
> > 	vm_prot_t prot;
> > 
> > 	vm_map_lock(map);
> > 	/* get protections. This also check for validity of mapping */
> > 	if (uvm_map_checkprot(map, start, start + size - 1, VM_PROT_WRITE))
> > 		prot = VM_PROT_READ | VM_PROT_WRITE;
> > 	else if (uvm_map_checkprot(map, start, start + size - 1, VM_PROT_READ))
> > 		prot = VM_PROT_READ;
> > 	else {
> > 		printf("uvm_map_checkprot 0x%lx -> 0x%lx "
> > 		    "failed\n",
> > 		    start, start + size - 1);
> > 		vm_map_unlock(map);
> > 		return EINVAL;
> > 	}
> > 	/* remove current entries */
> > 	uvm_unmap_remove(map, start, start + size, &dead_entries, NULL, 0);
> > 	if (dead_entries != NULL)
> > 		uvm_unmap_detach(dead_entries, 0);
> > 
> > 	vm_map_unlock(map);
> 
> why do checkprot here?

To know if the va range was mapped read-only or read-write. Is there some
other way to get it ?

> 
> please don't use uvm_unmap_remove and uvm_unmap_detach here.
> they are uvm-internal.

What would you suggect to use instead here ?

> besides, normally uvm_unmap_detach should not be used with the map locked.
> (it's the main reason why it's separated from uvm_unmap_remove.)

OK, that's easy :)

> 
> if the process is multithreaded (is xentools multithreaded?),

xentools is multithreaded, but I don't think this interface is going to
be used from multiple threads.

> you need to replace the mapping atomically.
> othewise the va range can be taken by another thread while we unmap it.
> uvm_map_replace does something similar, but it doesn't deal with
> arbitrary existing mappings.
> 
> > 	uobj = malloc(sizeof(struct privcmd_object), M_TEMP, M_WAITOK);
> > 	if (uobj == NULL)
> > 		return ENOMEM;
> 
> please use kmem_alloc unless you have a reason.

uvm_device.c uses MALLOC() for its objects, but OK.

> 
> > 	UVM_OBJ_INIT(uobj, &privpgops, 1);
> > 	uvmflag = UVM_MAPFLAG(prot, prot, UVM_INH_NONE, UVM_ADV_NORMAL,
> > 	    UVM_FLAG_FIXED | UVM_FLAG_NOMERGE);
> > 	error = uvm_map(map, &newstart, size, uobj, 0, 0, uvmflag);
> > 
> > 	if (error) {
> > 		if (uobj)
> > 			uobj->pgops->pgo_detach(uobj);
> > 		return error;
> > 	}
> > 	if (newstart != start) {
> > 		printf("uvm_map didn't give us back our vm space\n");
> > 		return EINVAL;
> > 	}
> 
> it should be KASSERT given UVM_FLAG_FIXED.

I think so, but I was not sure at the time. As this server needs close to
60s to reboot, I try to avoid panic and KASSERT will developing :)

thanks for the comments !

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--