NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Semetic/fuzzy-logic code comparison tool ?



On Tue, Dec 13, 2016 at 03:08:02PM -0700, Swift Griggs wrote:
> Let's say one wants to make general statement that "This code is 30%
> the same as that code!" Another example would be someone wants to
> make the statement that "XX% of this code really came from project
> X." In my case I'm only interested in "honest" code, not trying to
> catch someone stealing/permuting existing code. Oh, and everything I
> care about is in C.
> 
> My questions are:
> 
> * Are there tools that already do this?

I don't know if there is a tool that does what you want---please let
us know what you find.

You might have a look at 'spdiff', a program that infers "semantic
patches" that can be applied with the Coccinelle program, spatch.  I'm
not sure how you would assign a "sameness" to the result.  Shortness of
the patch, maybe?

You might look at my ARFE tools, which I have put in NetBSD's othersrc
repo.  ARFE uses a dynamic programming algorithm to align one text with
another, seeking to match like characters or lexical items ("tokens")
with like while minimizing the amount of unmatched text ("residue").
Imperfect matches and residue are added up to produce a "score" for the
alignment.  The algorithm, a variant of Hirschberg's algorithm, seeks to
minimize the score.  ARFE understands some common tokens like numbers,
whitespace, and C-like identifiers.

ARFE does not "understand" nested structures, yet.  Also, it does not
favor an alignment where every instance of a token x in the first text
is replaced by the token y in a second text, over an alignment where x
has assorted replacements.  I cannot make up my mind whether it would be
more difficult to make ARFE understand nested structures, or to favor
alignments where one token always replaced another.  Since you are
concerned with comparing C programs, you would want to do both, and you
would want to respect the scope rules.

Speaking of nested structures, there are algorithms for aligning trees
rather than strings.  You could conceivably compare the abstract syntax
trees produced by two C programs, and judge their sameness that way.

> I know that this is essentially an AI problem and thus can get
> complex in a hurry.

I wouldn't call it an AI problem, myself.  It's an optimization problem.
Or maybe that is just the way I choose to think of it. :-)

I suspect that it is easier to produce a tool that produces useful
results on many (but not all) texts consisting of tokens and nested
structures that are common on the web, than to produce a tool that in
produces a perfect result on, say, every compilable C program.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981


Home | Main Index | Thread Index | Old Index