NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/60307: diff(1) pathalogical case with mostly similar inputs
>Number: 60307
>Category: bin
>Synopsis: diff(1) pathalogical case with mostly similar inputs
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Jun 02 13:15:00 +0000 2026
>Originator: Simon Burge
>Release: NetBSD 11.99.5 (-current after Dec 2025)
>Organization:
Disorganised
>Environment:
System: NetBSD thoreau.thistledown.com.au 11.99.5 NetBSD 11.99.5 (THOREAU.git) #79: Sun Apr 19 00:38:42 AEST 2026 simonb%thoreau.thistledown.com.au@localhost:/NetBSD/netbsd-zfsboot-git/sys/arch/amd64/compile/THOREAU amd64
Architecture: x86_64
Machine: amd64
>Description:
diff(1) has a pathalogical O(n^2) case with input on mostly
similar files where there are some differences on lines
"far" apart. First observed comparing the changelog between
xorg-server-21.1.22 and xorg-server-21.1.23:
thoreau 10> time diff xorg-server-21.1.2?/ChangeLog > /dev/null
40.355u 0.023s 0:40.41 99.9% 0+0k 0+1io 0pf+0w
thoreau 11> time ./diff-netbsd-11-rc4 xorg-server-21.1.2?/ChangeLog > /dev/null
0.036u 0.016s 0:00.05 80.0% 19+6k 0+1io 0pf+0w
A number of the differences between these two files are lines
where there is a change in the line length. For example:
193681c194166
< Merge: caffac384 3930da3f6
---
> Merge: caffac384320 3930da3f6209
193758c194243
< Merge: c5b3aa120 9fa73721f
---
> Merge: c5b3aa120bf9 9fa73721f0c3
This can be reduced down to a simple test case with a file with
a couple of thousand single character lines where a copy of the
file has an extra character at the end of the first and last
lines. See How-To-Repeat below for this.
Compiling diff with -O0 -pg shows that the number of calls
to diffreg.c:search() with the test cases below increase
quadratically. For a given number of input lines here
are the number of times search() is called:
input % cumulative self self total
lines time seconds seconds calls Ts/call Ts/call name
102 0.00 0.00 0.00 5250 0.00 0.00 search
202 0.00 0.00 0.00 20500 0.00 0.00 search
402 100.00 0.00 0.00 81000 0.00 0.00 search
802 85.71 0.02 0.02 322000 0.00 0.00 search
1602 80.00 0.08 0.08 1284000 0.00 0.00 search
3202 79.86 0.22 0.22 5128000 0.00 0.00 search
There appears to be some effect on "similar" lines in the
inputs. Using the test case below but with "seq 5000" for
example where each line is different results in a fast diff
instead of a slow diff.
>How-To-Repeat:
thoreau 10> seq -f a 5000 > /tmp/a
thoreau 11> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
thoreau 12> time diff /tmp/[ab] > /dev/null
0.348u 0.000s 0:00.34 100.0% 0+0k 0+1io 0pf+0w
thoreau 13> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
0.001u 0.000s 0:00.00 0.0% 0+0k 0+1io 0pf+0w
thoreau 14> seq -f a 10000 > /tmp/a
thoreau 15> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
thoreau 16> time diff /tmp/[ab] > /dev/null
1.520u 0.007s 0:01.52 100.0% 0+0k 0+1io 0pf+0w
thoreau 17> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
0.000u 0.000s 0:00.00 0.0% 0+0k 0+1io 0pf+0w
thoreau 18> seq -f a 20000 > /tmp/a
thoreau 19> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
thoreau 20> time diff /tmp/[ab] > /dev/null
6.162u 0.003s 0:06.18 99.6% 0+0k 0+1io 0pf+0w
thoreau 21> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
0.000u 0.003s 0:00.00 0.0% 0+0k 0+1io 0pf+0w
>Fix:
Yes, please.
Home |
Main Index |
Thread Index |
Old Index