Source-Changes-D archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [stos, again] Re: CVS commit: src/sys/arch/amd64



Le 03/06/2020 à 01:49, Andrew Doran a écrit :
On Tue, Jun 02, 2020 at 08:41:53AM +0200, Maxime Villard wrote:

Le 02/06/2020 ? 00:58, Andrew Doran a ?crit?:
Module Name:	src
Committed By:	ad
Date:		Mon Jun  1 22:58:06 UTC 2020

Modified Files:
	src/sys/arch/amd64/amd64: cpufunc.S
	src/sys/arch/amd64/include: frameasm.h

Log Message:
Reported-by: syzbot+6dd5a230d19f0cbc7814%syzkaller.appspotmail.com@localhost

Instrument STOS/MOVS for KMSAN to unbreak it.


To generate a diff of this commit:
cvs rdiff -u -r1.58 -r1.59 src/sys/arch/amd64/amd64/cpufunc.S
cvs rdiff -u -r1.49 -r1.50 src/sys/arch/amd64/include/frameasm.h

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Can you just stop ignoring the remarks that are made?

That's up to you Maxime.  If you habitually make it difficult for people to
come to a reasonable compromise with you, then you're asking to not be taken
seriously and will find yourself being ignored.

You are confused. I asked for KMSAN to be unbroken, and proposed an alternative,
which is atomic_load_relaxed. You were free to explain why my point was a bad
idea or why it didn't matter, but you refused, and just added a hack on top of
another. I'm afraid that's up to you.

But whatever, it doesn't matter. Thanks for reverting the pmap.c changes, it
is more correct now than before.

I said explicitly
that adding manual instrumentation here is _wrong_.

I don't share your assessment in the general sense.  It should be possible
to instrument assembly code because KMSAN is useful AND we can't get by
without assembly - there are some things that C just can't do (or do well
enough).

On the assembly thing recall that recently you expressed a desire to remove
all of the amd64 assembly string functions from libc because of sanitizers -
I invested my time to do up a little demo to try and show you why that's not
a good idea:

	http://mail-index.netbsd.org/port-amd64/2020/04/19/msg003226.html

I saw, yes. I answered explaining that a conversation with Ryo Shimizu had
changed my mind a bit, and seeing your results (which as far as I can tell
do not indicate a performance improvement significant enough to not be
considered as noise), I asked you whether it was that relevant. You didn't
follow up though.

The system is a balancing act.  There are lots of factors to be taken into
account: maintainability, tooling like KMSAN, user's varying needs, the
capabilites of different machines, performance, feature set and so on, and
dare I say it even the enjoyment of the people working on the project is
important too.  Myopic focus on one factor only to the detriment of others
is no good.

I am well aware of that.

The kMSan ASM fixups
are limited to args/returns, and that is part of a sensical policy that
_should not_ be changed without a good reason.

x86_movs/x86_stos have strictly _no reason_ to exist. Of the 18 conversions
you made to them in pmap.c, not one is justified. memcpy/memset were all
correct.

I introduced these functions as a compromise because you were unhappy with
use of memcpy() to copy PDEs.  See:

	http://mail-index.netbsd.org/port-amd64/2020/05/23/msg003280.html

You wrote:

	In the [XXX] window, the PTEs could be used by userland.  If you
	copied them using memcpy(), some parts of the bytes could contain
	stale values.

Sure, I was explicitly referring to SVS, which has an unusual way of
accessing PTEs (contrary to pmap), which is why it needs special atomic
care that other places do not.

Live PDEs are also copied in pmap.c so I made a change there too.  After
that I decided "why not" and used the new functions everywhere PDEs/PTEs or
pages are block zeroed / copied.  But I'm also happy with memcpy()/memset().
Either way will work.  In fairness I do work too fast sometimes.

The only reason you made these big unneeded changes is for SVS not to take
the bus lock,

There is no bus lock on x86 (not since the 1990s anyway).

but as was said already, atomic_load_relaxed will do what
you want without the need for these functions.

Please revert _both changes now_, this one and the previous one which
introduced both functions, and let's use atomic_load_relaxed.

You're focusing on only one factor.  I'll explain in detail.  Here is the
original code I replaced:

     685 static inline pt_entry_t
     686 svs_pte_atomic_read(struct pmap *pmap, size_t idx)
     687 {
     688 	/*
     689 	 * XXX: We don't have a basic atomic_fetch_64 function?
     690 	 */
     691 	return atomic_cas_64(&pmap->pm_pdir[idx], 666, 666);
     692 }
...
     717 	/* User slots. */
     718 	for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
     719 		pte = svs_pte_atomic_read(pmap, i);
     720 		ci->ci_svs_updir[i] = pte;
     721 	}

There's no need for an atomic op there because fetches on x86 are by
definition atomic, and it does nothing to alter the memory ordering in this
case.  There are side effects to the atomic op: it's serializing and always
generates an unbuffered writeback, even in the failure case.  So the source
is being copied into itself, as well as into the destination, and the CPU's
store buffer is rendered ineffective.

A cut-n-paste replacement to use the relaxed functions predictably ties the
compiler in knots and the generated code is bad.

	/* User slots. */
	for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
		pte = atomic_load_relaxed(&pmap->pm_pdir[i]);
		atomic_store_relaxed(&ci->ci_svs_updir[i], pte);

Note that atomic_store_relaxed is not needed here. This is the per-cpu page
tables, and they cannot be accessed by someone else at the same time. All
we care about is fetching the pmap page tables in quads.

	}

0000000000400c9f <copy_relaxed>:
   400c9f:       48 8b 06                mov    (%rsi),%rax
   400ca2:       48 8b 17                mov    (%rdi),%rdx
   400ca5:       48 8d b0 00 00 40 06    lea    0x6400000(%rax),%rsi
   400cac:       48 8b 08                mov    (%rax),%rcx
   400caf:       48 89 4c 24 f8          mov    %rcx,-0x8(%rsp)
   400cb4:       48 8b 4c 24 f8          mov    -0x8(%rsp),%rcx
   400cb9:       48 89 0a                mov    %rcx,(%rdx)
   400cbc:       48 83 c0 08             add    $0x8,%rax
   400cc0:       48 83 c2 08             add    $0x8,%rdx
   400cc4:       48 39 f0                cmp    %rsi,%rax
   400cc7:       75 e3                   jne    400cac <copy_relaxed+0xd>
   400cc9:       c3                      retq

To get the relaxed functions working well I reckon you'd need to copy the
pointers and unroll the loop and even then it might still be bad, and
looks daft to boot:

	src = pmap->pm_pdir;
	dst = ci->ci_svs_updir;
	for (i = 0; i < PDIR_SLOT_USERLIM / 8; i += 8) {
		a = atomic_load_relaxed(&src[i+0]);
		b = atomic_load_relaxed(&src[i+1]);
		c = atomic_load_relaxed(&src[i+2]);
		d = atomic_load_relaxed(&src[i+3]);
		e = atomic_load_relaxed(&src[i+4]);
		f = atomic_load_relaxed(&src[i+5]);
		g = atomic_load_relaxed(&src[i+6]);
		h = atomic_load_relaxed(&src[i+7]);
		atomic_store_relaxed(&dst[i+0], a);
		atomic_store_relaxed(&dst[i+1], b);
		atomic_store_relaxed(&dst[i+2], c);
		atomic_store_relaxed(&dst[i+3], d);
		atomic_store_relaxed(&dst[i+4], e);
		atomic_store_relaxed(&dst[i+5], f);
		atomic_store_relaxed(&dst[i+6], g);
		atomic_store_relaxed(&dst[i+7], h);
	}
	for (; i < PDIR_SLOT_USERLIM; i++) {
		pte = atomic_load_relaxed(&src[i]);
		atomic_store_relaxed(&src[i], pte);
	}

Considering that atomic_store_relaxed is not needed, probably the situation
without it is better than what it looks like in this disassembly.

A way to satisfy all requirements is to use the instruction built into the
processor explicitly for block copies (REP MOVSQ) and tell sanitizers about
it.  memcpy() (or if you're worried about byte stores an explicit bit of
assembly) gives you access to that.  Here's a test I wrote to show how these
variants look in practice (minus the silly unrolled variant):

	http://www.netbsd.org/~ad/2020/copy.c

On a Xeon:

	$ ./a.out
	alloc
	fault
	run
	atomic  707.262056MB per second
	relaxed 3054.676292MB per second
	memcpy  3494.007480MB per second

On a low-power AMD CPU:

	$ ./a.out
	alloc
	fault
	run
	atomic  253.765483MB per second
	relaxed 788.343882MB per second
	memcpy  935.265225MB per second

Thanks for the measurements; I have a few remarks here:

 - You're doing copy tests with 104857600 bytes. But the kernel code in
   question only copies 2048 bytes. So how can it be compared precisely?
   It's been a while since I last checked the Intel performance manual,
   but last I checked, there were odd behaviors when it comes to loops;
   up to a certain iteration count, the CPU lets down internal
   optimizations and uses a slow path. So if you copy much more in your
   benchmark than in the real situation, you may well be triggering the
   slow paths in the CPU, that are not triggered in the real situation.

 - As I said, atomic_store_relaxed() is not needed, so the copy_relaxed
   count is likely lower in this benchmark than it needs to be in the
   real situation.

 - You didn't include movsq? Since you decided to go with movsq, it
   would have been relevant to compare it.

Thanks,
Maxime


Home | Main Index | Thread Index | Old Index