NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/60307: diff(1) pathalogical case with mostly similar inputs



>Number:         60307
>Category:       bin
>Synopsis:       diff(1) pathalogical case with mostly similar inputs
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jun 02 13:15:00 +0000 2026
>Originator:     Simon Burge
>Release:        NetBSD 11.99.5 (-current after Dec 2025)
>Organization:
Disorganised
>Environment:
System: NetBSD thoreau.thistledown.com.au 11.99.5 NetBSD 11.99.5 (THOREAU.git) #79: Sun Apr 19 00:38:42 AEST 2026 simonb%thoreau.thistledown.com.au@localhost:/NetBSD/netbsd-zfsboot-git/sys/arch/amd64/compile/THOREAU amd64
Architecture: x86_64
Machine: amd64
>Description:
        diff(1) has a pathalogical O(n^2) case with input on mostly
        similar files where there are some differences on lines
        "far" apart.  First observed comparing the changelog between
        xorg-server-21.1.22 and xorg-server-21.1.23:

	   thoreau 10> time diff xorg-server-21.1.2?/ChangeLog > /dev/null
	   40.355u 0.023s 0:40.41 99.9%	0+0k 0+1io 0pf+0w
	   thoreau 11> time ./diff-netbsd-11-rc4 xorg-server-21.1.2?/ChangeLog > /dev/null
	   0.036u 0.016s 0:00.05 80.0%	19+6k 0+1io 0pf+0w

	A number of the differences between these two files are lines
	where there is a change in the line length.  For example:

	   193681c194166
	   < Merge: caffac384 3930da3f6
	   ---
	   > Merge: caffac384320 3930da3f6209
	   193758c194243
	   < Merge: c5b3aa120 9fa73721f
	   ---
	   > Merge: c5b3aa120bf9 9fa73721f0c3

	This can be reduced down to a simple test case with a file with
	a couple of thousand single character lines where a copy of the
	file has an extra character at the end of the first and last
	lines.  See How-To-Repeat below for this.

	Compiling diff with -O0 -pg shows that the number of calls
	to diffreg.c:search() with the test cases below increase
	quadratically.  For a given number of input lines here
	are the number of times search() is called:

	input    %   cumulative   self              self     total
	lines   time   seconds   seconds    calls  Ts/call  Ts/call  name
	  102    0.00      0.00     0.00     5250     0.00     0.00  search
	  202    0.00      0.00     0.00    20500     0.00     0.00  search
	  402  100.00      0.00     0.00    81000     0.00     0.00  search
	  802   85.71      0.02     0.02   322000     0.00     0.00  search
	 1602   80.00      0.08     0.08  1284000     0.00     0.00  search
	 3202   79.86      0.22     0.22  5128000     0.00     0.00  search

	There appears to be some effect on "similar" lines in the
	inputs.  Using the test case below but with "seq 5000" for
	example where each line is different results in a fast diff
	instead of a slow diff.

>How-To-Repeat:
	thoreau 10> seq -f a 5000 > /tmp/a
	thoreau 11> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
	thoreau 12> time diff /tmp/[ab] > /dev/null
	0.348u 0.000s 0:00.34 100.0%	0+0k 0+1io 0pf+0w
	thoreau 13> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
	0.001u 0.000s 0:00.00 0.0%	0+0k 0+1io 0pf+0w

	thoreau 14> seq -f a 10000 > /tmp/a
	thoreau 15> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
	thoreau 16> time diff /tmp/[ab] > /dev/null
	1.520u 0.007s 0:01.52 100.0%	0+0k 0+1io 0pf+0w
	thoreau 17> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
	0.000u 0.000s 0:00.00 0.0%	0+0k 0+1io 0pf+0w

	thoreau 18> seq -f a 20000 > /tmp/a
	thoreau 19> sed -e '1s/$/a/' -e '$s/$/a/' /tmp/a > /tmp/b
	thoreau 20> time diff /tmp/[ab] > /dev/null
	6.162u 0.003s 0:06.18 99.6%	0+0k 0+1io 0pf+0w
	thoreau 21> time /tmp/diff-netbsd-11-rc4 /tmp/[ab] > /dev/null
	0.000u 0.003s 0:00.00 0.0%	0+0k 0+1io 0pf+0w

>Fix:
	Yes, please.




Home | Main Index | Thread Index | Old Index