NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/50638: Extreme slowness on loading gzipped kernels on old CPUs
>Number: 50638
>Category: bin
>Synopsis: Extreme slowness on loading gzipped kernels on old CPUs
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: change-request
>Submitter-Id: net
>Arrival-Date: Sun Jan 10 17:40:00 +0000 2016
>Originator: Izumi Tsutsui
>Release: NetBSD 7.0
>Organization:
>Environment:
System: NetBSD 7.0
Architecture: all, but especially m68k and vax etc.
Machine: ditto
>Description:
As mentioned in the following post, loading gzipped kernels
via libsa based bootloader much slower since NetBSD 5.x
than 4.x and prior:
http://mail-index.netbsd.org/port-vax/2014/05/25/msg002214.html
This post is about vax, but loading gzipped kernel on x68k is
also extremely slower than gunzipped kernels.
This seems caused by crc32() function changes in the following commits,
which added modified crc32() was added into libkern based on zlib one:
http://mail-index.netbsd.org/source-changes/2009/03/25/msg218803.html
http://mail-index.netbsd.org/source-changes/2009/03/25/msg218817.html
In the 4.x and prior, _STANDALONE programs used 32 bit tables
with "DYNAMIC_CRC_TABLE" option (which generated CRC tables runtime)
for crc32() calculations:
http://nxr.netbsd.org/xref/src/common/dist/zlib/crc32.c?r=1.4#84
However, after the above "crc32() in libkern" changes,
_STANDALONE programs were changed to use "dumb" CRC functions
based on its definition:
http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/sys/lib/libsa/cread.c.diff?r1=1.22&r2=1.23&f=u
As the comment claims this was smaller than the table version,
but actually it's also extremely slower on ancient CPUs
(requires additional several minutes).
>How-To-Repeat:
Boot gzipped kernels from any CD or HDD etc.
>Fix:
* Approach:
There are three approach to handle this issue:
(1) leave as is and use the dumb crc32() function (for small binary)
(2) pull "DYNAMIC_CRC_TABLE" implementation from zlib into libkern/crc32.c
(3) completely disable gunzip crc32() calculation in cread() function
* Pros. and cons:
(1) lowest risk? (slow but still works)
(2) better compromise?
(3) it looks CRC is not used in most case;
the CRC can be confirmed when the whole file contents are read,
but our libsa loadfile() finction loads only text, data, bss,
and symbol table, which could also be padded by ldscript.
* Sample implementation:
I have confirmed the following dumb change for these three cases.
- "-DSMALL_CRC32" specifies (1)
- the default (no additional macro) specifies (2)
(note: actual CRC calculation results are not confirmed, nor kernels)
- "-DCREAD_NOCRC" specifies (3)
The size of boot binaries and time of loading the netbsd-INSTALL.gz
kernel on LUNA-II with NetBSD/luna68k 7.0 are:
(1) -DSMALL_CRC32 (dumb crc32(), as current implementation)
size:
section size addr
.text 69384 7340032
.data 2432 7416896
.bss 17952 7419328
time:
9 min 22 sec
(2) with DYNAMIC_CRC_TABLE crc32() (as past default)
size:
section size addr
.text 70416 7340032
.data 2436 7417936
.bss 26144 7420376
time:
4 min 21 sec
(3) -DCREAD_NOCRC
size:
section size addr
.text 69288 7340032
.data 2432 7416800
.bss 17952 7419232
time:
3 min 59 sec
---
Index: sys/arch/luna68k/stand/boot/Makefile
===================================================================
RCS file: /cvsroot/src/sys/arch/luna68k/stand/boot/Makefile,v
retrieving revision 1.11
diff -u -p -d -r1.11 Makefile
--- sys/arch/luna68k/stand/boot/Makefile 16 Jan 2014 01:15:34 -0000 1.11
+++ sys/arch/luna68k/stand/boot/Makefile 10 Jan 2016 14:53:17 -0000
@@ -19,6 +19,8 @@ CPPFLAGS+= -DSUPPORT_DHCP -DSUPPORT_BOOT
#CPPFLAGS+= -DRPC_DEBUG -DRARP_DEBUG -DNET_DEBUG -DDEBUG -DPARANOID
CPPFLAGS+= -DLIBSA_ENABLE_LS_OP
CPPFLAGS+= -DLIBSA_PRINTF_WIDTH_SUPPORT
+CPPFLAGS+= -DCREAD_NOCRC32
+#CPPFLAGS+= -DSMALL_CRC32
CFLAGS= -Os -msoft-float
CFLAGS+= -ffreestanding
Index: sys/lib/libkern/crc32.c
===================================================================
RCS file: /cvsroot/src/sys/lib/libkern/crc32.c,v
retrieving revision 1.4
diff -u -p -d -r1.4 crc32.c
--- sys/lib/libkern/crc32.c 26 Mar 2009 22:18:14 -0000 1.4
+++ sys/lib/libkern/crc32.c 10 Jan 2016 14:53:24 -0000
@@ -16,6 +16,10 @@
/* @(#) Id */
+#if defined(_STANDALONE)
+#define DYNAMIC_CRC_TABLE
+#endif
+
#include <sys/param.h>
#include <machine/endian.h>
@@ -29,7 +33,91 @@ typedef uint32_t u4;
* Tables of CRC-32s of all single-byte values, made by make_crc_table().
*/
#include <lib/libkern/libkern.h>
+
+#ifdef DYNAMIC_CRC_TABLE
+
+static volatile bool crc_table_empty = true;
+static uint32_t crc_table[8][256];
+static void make_crc_table(void);
+
+/*
+ Generate tables for a byte-wise 32-bit CRC calculation on the polynomial:
+ x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x+1.
+
+ Polynomials over GF(2) are represented in binary, one bit per coefficient,
+ with the lowest powers in the most significant bit. Then adding polynomials
+ is just exclusive-or, and multiplying a polynomial by x is a right shift by
+ one. If we call the above polynomial p, and represent a byte as the
+ polynomial q, also with the lowest power in the most significant bit (so the
+ byte 0xb1 is the polynomial x^7+x^3+x+1), then the CRC is (q*x^32) mod p,
+ where a mod b means the remainder after dividing a by b.
+
+ This calculation is done using the shift-register method of multiplying and
+ taking the remainder. The register is initialized to zero, and for each
+ incoming bit, x^32 is added mod p to the register if the bit is a one (where
+ x^32 mod p is p+x^32 = x^26+...+1), and the register is multiplied mod p by
+ x (which is shifting right by one and adding x^32 mod p if the bit shifted
+ out is a one). We start with the highest power (least significant bit) of
+ q and repeat for all eight bits of q.
+
+ The first table is simply the CRC of all possible eight bit values. This is
+ all the information needed to generate CRCs on data a byte at a time for all
+ combinations of CRC register values and incoming bytes. The remaining tables
+ allow for word-at-a-time CRC calculation for both big-endian and little-
+ endian machines, where a word is four bytes.
+*/
+static void make_crc_table(void)
+{
+ uint32_t c;
+ int n, k;
+ uint32_t poly; /* polynomial exclusive-or pattern */
+ /* terms of polynomial defining this crc (except x^32): */
+ static volatile bool first = true; /* flag to limit concurrent making */
+ static const unsigned char p[] = {0,1,2,4,5,7,8,10,11,12,16,22,23,26};
+
+ /* See if another task is already doing this (not thread-safe, but better
+ than nothing -- significantly reduces duration of vulnerability in
+ case the advice about DYNAMIC_CRC_TABLE is ignored) */
+ if (first) {
+ first = false;
+
+ /* make exclusive-or pattern from polynomial (0xedb88320UL) */
+ poly = 0;
+ for (n = 0; n < sizeof(p)/sizeof(unsigned char); n++)
+ poly |= 1 << (31 - p[n]);
+
+ /* generate a crc for every 8-bit value */
+ for (n = 0; n < 256; n++) {
+ c = (uint32_t)n;
+ for (k = 0; k < 8; k++)
+ c = c & 1 ? poly ^ (c >> 1) : c >> 1;
+ crc_table[0][n] = c;
+ }
+
+ /* generate crc for each value followed by one, two, and three zeros,
+ and then the byte reversal of those as well as the first table */
+ for (n = 0; n < 256; n++) {
+ c = crc_table[0][n];
+ crc_table[4][n] = REV(c);
+ for (k = 1; k < 4; k++) {
+ c = crc_table[0][c & 0xff] ^ (c >> 8);
+ crc_table[k][n] = c;
+ crc_table[k + 4][n] = REV(c);
+ }
+ }
+
+ crc_table_empty = false;
+ }
+ else { /* not first */
+ /* wait for the other guy to finish (not efficient, but rare) */
+ while (crc_table_empty)
+ ;
+ }
+}
+
+#else /* !DYNAMIC_CRC_TABLE */
#include "crc32.h"
+#endif
#if BYTE_ORDER == LITTLE_ENDIAN
/* ========================================================================= */
@@ -46,6 +134,11 @@ uint32_t crc32(uint32_t crc, const uint8
if (buf == NULL) return 0UL;
+#ifdef DYNAMIC_CRC_TABLE
+ if (crc_table_empty)
+ make_crc_table();
+#endif /* DYNAMIC_CRC_TABLE */
+
c = (u4)crc;
c = ~c;
while (len && ((uintptr_t)buf & 3)) {
@@ -87,6 +180,11 @@ uint32_t crc32(uint32_t crc, const uint8
if (buf == NULL) return 0UL;
+#ifdef DYNAMIC_CRC_TABLE
+ if (crc_table_empty)
+ make_crc_table();
+#endif /* DYNAMIC_CRC_TABLE */
+
c = REV((u4)crc);
c = ~c;
while (len && ((uintptr_t)buf & 3)) {
Index: sys/lib/libsa/cread.c
===================================================================
RCS file: /cvsroot/src/sys/lib/libsa/cread.c,v
retrieving revision 1.26
diff -u -p -d -r1.26 cread.c
--- sys/lib/libsa/cread.c 13 Oct 2013 20:09:02 -0000 1.26
+++ sys/lib/libsa/cread.c 10 Jan 2016 14:53:24 -0000
@@ -86,11 +86,11 @@ void *zcalloc(void *, unsigned int, unsi
void zcfree(void *, void *);
void zmemcpy(unsigned char *, unsigned char *, unsigned int);
+#if defined(SMALL_CRC32) || defined(CREAD_NOCRC32)
/*
- * The libkern version of this function uses an 8K set of tables.
* This is the double-loop version of LE CRC32 from if_ethersubr,
- * lightly modified -- it is 200 bytes smaller than the version using
- * a 4-bit table and at least 8K smaller than the libkern version.
+ * lightly modified -- it is ~1KB smaller than libkern version with
+ * DYNAMIC_CRC_TABLE but too much slower especially on ancient poor CPUs.
*/
#ifndef ETHER_CRC_POLY_LE
#define ETHER_CRC_POLY_LE 0xedb88320
@@ -98,6 +98,7 @@ void zmemcpy(unsigned char *, unsigned c
uint32_t
crc32(uint32_t crc, const uint8_t *const buf, size_t len)
{
+#if !defined(CREAD_NOCRC)
uint32_t c, carry;
size_t i, j;
@@ -114,7 +115,9 @@ crc32(uint32_t crc, const uint8_t *const
}
}
return (crc ^ 0xffffffffU);
+#endif /* defined(CREAD_NOCRC) */
}
+#endif /* defined(SMALL_CRC32) || defined(CREAD_NOCRC) */
/*
* compression utilities
@@ -317,7 +320,9 @@ ssize_t
read(int fd, void *buf, size_t len)
{
struct sd *s;
+#if !defined(CREAD_NOCRC32)
unsigned char *start = buf; /* starting point for crc computation */
+#endif
s = ss[fd];
@@ -373,13 +378,24 @@ read(int fd, void *buf, size_t len)
s->z_err = inflate(&(s->stream), Z_NO_FLUSH);
if (s->z_err == Z_STREAM_END) {
+ uint32_t total_out;
+#if !defined(CREAD_NOCRC32)
+ uint32_t crc;
/* Check CRC and original size */
s->crc = crc32(s->crc, start, (unsigned int)
(s->stream.next_out - start));
start = s->stream.next_out;
+ crc = getLong(s);
+#else
+ (void)getLong(s);
+#endif
+ total_out = getLong(s);
- if (getLong(s) != s->crc ||
- getLong(s) != s->stream.total_out) {
+ if (total_out != s->stream.total_out
+#if !defined(CREAD_NOCRC32)
+ || crc != s->crc
+#endif
+ ) {
s->z_err = Z_DATA_ERROR;
} else {
@@ -387,7 +403,9 @@ read(int fd, void *buf, size_t len)
check_header(s);
if (s->z_err == Z_OK) {
inflateReset(&(s->stream));
+#if !defined(CREAD_NOCRC32)
s->crc = crc32(0L, Z_NULL, 0);
+#endif
}
}
}
@@ -395,8 +413,10 @@ read(int fd, void *buf, size_t len)
break;
}
+#if !defined(CREAD_NOCRC32)
s->crc = crc32(s->crc, start,
(unsigned int)(s->stream.next_out - start));
+#endif
return (int)(len - s->stream.avail_out);
}
---
Comments?
---
Izumi Tsutsui
Home |
Main Index |
Thread Index |
Old Index