tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

split(1): auto-extend suffix length



Hello,

When running split(1) such that it would require more
files than the default suffix-length would allow, it
currently errors out with 'too many files'.

For example, if I have a file with 1 million lines and
run

split file

then split(1) will generate 676 files 'xaa' through
'xzz' and then error out.  That is, I need to know in
advance that the default suffix isn't enough.

Attached is a patch to have split(1) automatically
extend the suffix length as needed, generating longer
files to accommodate any input length.  It does that
by extending the suffix by two, but keeping the
pattern such that names sort lexically.

I.e., if I split a 1M lines file into (default) 1K
line files, it generates output into

xaa
xab
...
xyy
xyz
xzaaa
xzaab
...
xzanl

If I split the 1M lines file into 10 line files, it
would generate output into

xaa
xab
...
xyy
xyz
xzaaa
xzaab
...
xzyzy
xzyzz
xzzaaaa
xzzaaab
...
xzzerzd

and if I split the 1M lines into 1 line files, the
output files generated would be named

xaa
xab
...
xyy
xyz
xzaaa
xzaab
...
xzyzy
xzyzz
xzzaaaa
xzzaaab
...
xzzyzzy
xzzyzzz
xzzzaaaaa
xzzzaaaab
...
xzzzbexin


If the user specifies a suffix length via '-a', then
this auto-extending behavior is disabled, and an
insufficient number of files will again result in the
'too many files' error.


This behavior matches what GNU sort(1) does since
version 8.16.

Does anybody have any objections to this change?

-Jan
Index: split.1
===================================================================
RCS file: /cvsroot/src/usr.bin/split/split.1,v
retrieving revision 1.15
diff -u -p -r1.15 split.1
--- split.1	31 May 2007 01:35:35 -0000	1.15
+++ split.1	29 Jan 2023 22:33:19 -0000
@@ -29,7 +29,7 @@
 .\"
 .\"	@(#)split.1	8.3 (Berkeley) 4/16/94
 .\"
-.Dd May 28, 2007
+.Dd January 28, 2023
 .Dt SPLIT 1
 .Os
 .Sh NAME
@@ -99,7 +99,12 @@ characters in the range
 .Dq Li a-z .
 If
 .Fl a
-is not specified, two letters are used as the suffix.
+is not specified, two letters are used as the initial
+suffix.
+If the output does not fit into the resulting number
+of files, then the suffix length is automatically
+extended as needed such that all output files continue
+to sort in lexical order.
 .Pp
 If the
 .Ar name
Index: split.c
===================================================================
RCS file: /cvsroot/src/usr.bin/split/split.c,v
retrieving revision 1.28
diff -u -p -r1.28 split.c
--- split.c	27 Jan 2023 19:39:04 -0000	1.28
+++ split.c	29 Jan 2023 22:33:19 -0000
@@ -60,6 +60,7 @@ static int file_open;		/* If a file is o
 static int ifd = STDIN_FILENO, ofd = -1; /* Input/output file descriptors. */
 static char *fname;		/* File name prefix. */
 static size_t sfxlen = 2;	/* Suffix length. */
+static int autosfx = 1;		/* Whether to auto-extend the suffix length. */
 
 static void newfile(void);
 static void split1(off_t, int) __dead;
@@ -120,6 +121,7 @@ main(int argc, char *argv[])
 			    (sfxlen = (size_t)strtoul(optarg, &ep, 10)) == 0 ||
 			    *ep != '\0')
 				errx(1, "%s: illegal suffix length.", optarg);
+			autosfx = 0;
 			break;
 		case 'n':		/* Chunks. */
 			if (!isdigit((unsigned char)optarg[0]) ||
@@ -323,6 +325,48 @@ newfile(void)
 		err(1, "%s", fname);
 
 	quot = fnum;
+
+	/* If '-a' is not specified, then we automatically expand the
+	 * suffix length to accomodate splitting all input.  We do this
+	 * by moving the suffix pointer (fpnt) forward and incrementing
+	 * sfxlen by one, thereby yielding an additional two characters
+	 * and allowing all output files to sort such that 'cat *' yields
+	 * the input in order.  I.e., the order is '... xyy xyz xzaaa
+	 * xzaab ... xzyzy, xzyzz, xzzaaaa, xzzaaab' and so on. */
+	if (autosfx) {
+		int allz = 1;
+		for (i = sfxlen - 1; i > 0; i--) {
+			if (fpnt[i] != 'z') {
+				allz = 0;
+				break;
+			}
+		}
+
+		if (allz && (fpnt[0] == 'y')) {
+			if ((fname = realloc(fname, strlen(fname) + sfxlen + 2 + 1)) == NULL)
+				err(EXIT_FAILURE, NULL);
+				/* NOTREACHED */
+
+			fpnt = fname + strlen(fname) - sfxlen;
+			fpnt[sfxlen + 2] = '\0';
+
+			fpnt[0] = 'z';
+			fpnt[1] = 'a';
+
+			/*  Basename | Suffix
+			 *  before:
+			 *  x        | yz
+			 *  after:
+			 *  xz       | a.. */
+			fpnt++;
+			sfxlen++;
+
+			/* Reset so we start back at all 'a's in our extended suffix. */
+			quot = 0;
+			fnum = 0;
+		}
+	}
+
 	for (i = sfxlen - 1; i >= 0; i--) {
 		fpnt[i] = quot % 26 + 'a';
 		quot = quot / 26;


Home | Main Index | Thread Index | Old Index