NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/50638: Extreme slowness on loading gzipped kernels on old CPUs



>Number:         50638
>Category:       bin
>Synopsis:       Extreme slowness on loading gzipped kernels on old CPUs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Sun Jan 10 17:40:00 +0000 2016
>Originator:     Izumi Tsutsui
>Release:        NetBSD 7.0
>Organization:
>Environment:
System: NetBSD 7.0
Architecture: all, but especially m68k and vax etc.
Machine: ditto
>Description:
As mentioned in the following post, loading gzipped kernels
via libsa based bootloader much slower since NetBSD 5.x
than 4.x and prior:
 http://mail-index.netbsd.org/port-vax/2014/05/25/msg002214.html

This post is about vax, but loading gzipped kernel on x68k is
also extremely slower than gunzipped kernels.

This seems caused by crc32() function changes in the following commits,
which added modified crc32() was added into libkern based on zlib one:
 http://mail-index.netbsd.org/source-changes/2009/03/25/msg218803.html
 http://mail-index.netbsd.org/source-changes/2009/03/25/msg218817.html

In the 4.x and prior, _STANDALONE programs used 32 bit tables
with "DYNAMIC_CRC_TABLE" option (which generated CRC tables runtime)
for crc32() calculations:
 http://nxr.netbsd.org/xref/src/common/dist/zlib/crc32.c?r=1.4#84

However, after the above "crc32() in libkern" changes,
_STANDALONE programs were changed to use "dumb" CRC functions
based on its definition:
 http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/sys/lib/libsa/cread.c.diff?r1=1.22&r2=1.23&f=u

As the comment claims this was smaller than the table version,
but actually it's also extremely slower on ancient CPUs
(requires additional several minutes).

>How-To-Repeat:
Boot gzipped kernels from any CD or HDD etc.

>Fix:

* Approach:

There are three approach to handle this issue:

(1) leave as is and use the dumb crc32() function (for small binary)
(2) pull "DYNAMIC_CRC_TABLE" implementation from zlib into libkern/crc32.c
(3) completely disable gunzip crc32() calculation in cread() function

* Pros. and cons:

(1) lowest risk? (slow but still works)
(2) better compromise?
(3) it looks CRC is not used in most case;
    the CRC can be confirmed when the whole file contents are read,
    but our libsa loadfile() finction loads only text, data, bss,
    and symbol table, which could also be padded by ldscript.

* Sample implementation:

I have confirmed the following dumb change for these three cases.

- "-DSMALL_CRC32" specifies (1)
- the default (no additional macro) specifies (2)
  (note: actual CRC calculation results are not confirmed, nor kernels)
- "-DCREAD_NOCRC" specifies (3)

The size of boot binaries and time of loading the netbsd-INSTALL.gz
kernel on LUNA-II with NetBSD/luna68k 7.0 are:

(1) -DSMALL_CRC32 (dumb crc32(), as current implementation)

size:
section      size      addr
.text       69384   7340032
.data        2432   7416896
.bss        17952   7419328

time:
9 min 22 sec

(2) with DYNAMIC_CRC_TABLE crc32() (as past default)

size:
section      size      addr
.text       70416   7340032
.data        2436   7417936
.bss        26144   7420376

time:
4 min 21 sec

(3) -DCREAD_NOCRC

size:
section      size      addr
.text       69288   7340032
.data        2432   7416800
.bss        17952   7419232

time:
3 min 59 sec

---

Index: sys/arch/luna68k/stand/boot/Makefile
===================================================================
RCS file: /cvsroot/src/sys/arch/luna68k/stand/boot/Makefile,v
retrieving revision 1.11
diff -u -p -d -r1.11 Makefile
--- sys/arch/luna68k/stand/boot/Makefile	16 Jan 2014 01:15:34 -0000	1.11
+++ sys/arch/luna68k/stand/boot/Makefile	10 Jan 2016 14:53:17 -0000
@@ -19,6 +19,8 @@ CPPFLAGS+=	-DSUPPORT_DHCP -DSUPPORT_BOOT
 #CPPFLAGS+=	-DRPC_DEBUG -DRARP_DEBUG -DNET_DEBUG -DDEBUG -DPARANOID
 CPPFLAGS+=	-DLIBSA_ENABLE_LS_OP
 CPPFLAGS+=	-DLIBSA_PRINTF_WIDTH_SUPPORT
+CPPFLAGS+=	-DCREAD_NOCRC32
+#CPPFLAGS+=	-DSMALL_CRC32
 
 CFLAGS=		-Os -msoft-float
 CFLAGS+=	-ffreestanding
Index: sys/lib/libkern/crc32.c
===================================================================
RCS file: /cvsroot/src/sys/lib/libkern/crc32.c,v
retrieving revision 1.4
diff -u -p -d -r1.4 crc32.c
--- sys/lib/libkern/crc32.c	26 Mar 2009 22:18:14 -0000	1.4
+++ sys/lib/libkern/crc32.c	10 Jan 2016 14:53:24 -0000
@@ -16,6 +16,10 @@
 
 /* @(#) Id */
 
+#if defined(_STANDALONE)
+#define DYNAMIC_CRC_TABLE
+#endif
+
 #include <sys/param.h>
 #include <machine/endian.h>
 
@@ -29,7 +33,91 @@ typedef uint32_t u4;
  * Tables of CRC-32s of all single-byte values, made by make_crc_table().
  */
 #include <lib/libkern/libkern.h>
+
+#ifdef DYNAMIC_CRC_TABLE
+
+static volatile bool crc_table_empty = true;
+static uint32_t crc_table[8][256];
+static void make_crc_table(void);
+
+/*
+  Generate tables for a byte-wise 32-bit CRC calculation on the polynomial:
+  x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x+1.
+
+  Polynomials over GF(2) are represented in binary, one bit per coefficient,
+  with the lowest powers in the most significant bit.  Then adding polynomials
+  is just exclusive-or, and multiplying a polynomial by x is a right shift by
+  one.  If we call the above polynomial p, and represent a byte as the
+  polynomial q, also with the lowest power in the most significant bit (so the
+  byte 0xb1 is the polynomial x^7+x^3+x+1), then the CRC is (q*x^32) mod p,
+  where a mod b means the remainder after dividing a by b.
+
+  This calculation is done using the shift-register method of multiplying and
+  taking the remainder.  The register is initialized to zero, and for each
+  incoming bit, x^32 is added mod p to the register if the bit is a one (where
+  x^32 mod p is p+x^32 = x^26+...+1), and the register is multiplied mod p by
+  x (which is shifting right by one and adding x^32 mod p if the bit shifted
+  out is a one).  We start with the highest power (least significant bit) of
+  q and repeat for all eight bits of q.
+
+  The first table is simply the CRC of all possible eight bit values.  This is
+  all the information needed to generate CRCs on data a byte at a time for all
+  combinations of CRC register values and incoming bytes.  The remaining tables
+  allow for word-at-a-time CRC calculation for both big-endian and little-
+  endian machines, where a word is four bytes.
+*/
+static void make_crc_table(void)
+{
+    uint32_t c;
+    int n, k;
+    uint32_t poly;                      /* polynomial exclusive-or pattern */
+    /* terms of polynomial defining this crc (except x^32): */
+    static volatile bool first = true;  /* flag to limit concurrent making */
+    static const unsigned char p[] = {0,1,2,4,5,7,8,10,11,12,16,22,23,26};
+
+    /* See if another task is already doing this (not thread-safe, but better
+       than nothing -- significantly reduces duration of vulnerability in
+       case the advice about DYNAMIC_CRC_TABLE is ignored) */
+    if (first) {
+        first = false;
+
+        /* make exclusive-or pattern from polynomial (0xedb88320UL) */
+        poly = 0;
+        for (n = 0; n < sizeof(p)/sizeof(unsigned char); n++)
+            poly |= 1 << (31 - p[n]);
+
+        /* generate a crc for every 8-bit value */
+        for (n = 0; n < 256; n++) {
+            c = (uint32_t)n;
+            for (k = 0; k < 8; k++)
+                c = c & 1 ? poly ^ (c >> 1) : c >> 1;
+            crc_table[0][n] = c;
+        }
+
+        /* generate crc for each value followed by one, two, and three zeros,
+           and then the byte reversal of those as well as the first table */
+        for (n = 0; n < 256; n++) {
+            c = crc_table[0][n];
+            crc_table[4][n] = REV(c);
+            for (k = 1; k < 4; k++) {
+                c = crc_table[0][c & 0xff] ^ (c >> 8);
+                crc_table[k][n] = c;
+                crc_table[k + 4][n] = REV(c);
+            }
+        }
+
+        crc_table_empty = false;
+    }
+    else {      /* not first */
+        /* wait for the other guy to finish (not efficient, but rare) */
+        while (crc_table_empty)
+            ;
+    }
+}
+
+#else /* !DYNAMIC_CRC_TABLE */
 #include "crc32.h"
+#endif
 
 #if BYTE_ORDER == LITTLE_ENDIAN
 /* ========================================================================= */
@@ -46,6 +134,11 @@ uint32_t crc32(uint32_t crc, const uint8
 
     if (buf == NULL) return 0UL;
 
+#ifdef DYNAMIC_CRC_TABLE
+    if (crc_table_empty)
+        make_crc_table();
+#endif /* DYNAMIC_CRC_TABLE */
+
     c = (u4)crc;
     c = ~c;
     while (len && ((uintptr_t)buf & 3)) {
@@ -87,6 +180,11 @@ uint32_t crc32(uint32_t crc, const uint8
 
     if (buf == NULL) return 0UL;
 
+#ifdef DYNAMIC_CRC_TABLE
+    if (crc_table_empty)
+        make_crc_table();
+#endif /* DYNAMIC_CRC_TABLE */
+
     c = REV((u4)crc);
     c = ~c;
     while (len && ((uintptr_t)buf & 3)) {
Index: sys/lib/libsa/cread.c
===================================================================
RCS file: /cvsroot/src/sys/lib/libsa/cread.c,v
retrieving revision 1.26
diff -u -p -d -r1.26 cread.c
--- sys/lib/libsa/cread.c	13 Oct 2013 20:09:02 -0000	1.26
+++ sys/lib/libsa/cread.c	10 Jan 2016 14:53:24 -0000
@@ -86,11 +86,11 @@ void	*zcalloc(void *, unsigned int, unsi
 void	zcfree(void *, void *);
 void	zmemcpy(unsigned char *, unsigned char *, unsigned int);
 
+#if defined(SMALL_CRC32) || defined(CREAD_NOCRC32)
 /*
- * The libkern version of this function uses an 8K set of tables.
  * This is the double-loop version of LE CRC32 from if_ethersubr,
- * lightly modified -- it is 200 bytes smaller than the version using
- * a 4-bit table and at least 8K smaller than the libkern version.
+ * lightly modified -- it is ~1KB smaller than libkern version with
+ * DYNAMIC_CRC_TABLE but too much slower especially on ancient poor CPUs.
  */
 #ifndef ETHER_CRC_POLY_LE
 #define ETHER_CRC_POLY_LE	0xedb88320
@@ -98,6 +98,7 @@ void	zmemcpy(unsigned char *, unsigned c
 uint32_t
 crc32(uint32_t crc, const uint8_t *const buf, size_t len)
 {
+#if !defined(CREAD_NOCRC)
 	uint32_t c, carry;
 	size_t i, j;
 
@@ -114,7 +115,9 @@ crc32(uint32_t crc, const uint8_t *const
 	    }
 	}
 	return (crc ^ 0xffffffffU);
+#endif /* defined(CREAD_NOCRC) */
 }
+#endif /* defined(SMALL_CRC32) || defined(CREAD_NOCRC) */
 
 /*
  * compression utilities
@@ -317,7 +320,9 @@ ssize_t
 read(int fd, void *buf, size_t len)
 {
 	struct sd *s;
+#if !defined(CREAD_NOCRC32)
 	unsigned char *start = buf; /* starting point for crc computation */
+#endif
 
 	s = ss[fd];
 
@@ -373,13 +378,24 @@ read(int fd, void *buf, size_t len)
 		s->z_err = inflate(&(s->stream), Z_NO_FLUSH);
 
 		if (s->z_err == Z_STREAM_END) {
+			uint32_t total_out;
+#if !defined(CREAD_NOCRC32)
+			uint32_t crc;
 			/* Check CRC and original size */
 			s->crc = crc32(s->crc, start, (unsigned int)
 					(s->stream.next_out - start));
 			start = s->stream.next_out;
+			crc = getLong(s);
+#else
+			(void)getLong(s);
+#endif
+			total_out = getLong(s);
 
-			if (getLong(s) != s->crc ||
-			    getLong(s) != s->stream.total_out) {
+			if (total_out != s->stream.total_out
+#if !defined(CREAD_NOCRC32)
+			    || crc != s->crc
+#endif
+			    ) {
 
 				s->z_err = Z_DATA_ERROR;
 			} else {
@@ -387,7 +403,9 @@ read(int fd, void *buf, size_t len)
 				check_header(s);
 				if (s->z_err == Z_OK) {
 					inflateReset(&(s->stream));
+#if !defined(CREAD_NOCRC32)
 					s->crc = crc32(0L, Z_NULL, 0);
+#endif
 				}
 			}
 		}
@@ -395,8 +413,10 @@ read(int fd, void *buf, size_t len)
 			break;
 	}
 
+#if !defined(CREAD_NOCRC32)
 	s->crc = crc32(s->crc, start,
 	               (unsigned int)(s->stream.next_out - start));
+#endif
 
 	return (int)(len - s->stream.avail_out);
 }

---

Comments?

---
Izumi Tsutsui



Home | Main Index | Thread Index | Old Index