NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Semetic/fuzzy-logic code comparison tool ?
Let's say one wants to make general statement that "This code is 30% the
same as that code!" Another example would be someone wants to make the
statement that "XX% of this code really came from project X." In my case
I'm only interested in "honest" code, not trying to catch someone
stealing/permuting existing code. Oh, and everything I care about is in C.
My questions are:
* Are there tools that already do this?
* What do you do about whitespace, simple variable permutation, and
formatting issues? Ie.. times when a tiny thing changes the "checksum"
of your content but it's essentially still the same code.
I know that this is essentially an AI problem and thus can get complex in
a hurry. I was writing some scripts to take a swing at some kind of
prototype (and I even made some early progress), but then I though "surely
someone's already done this, genius."
Anyone know of any place to start, here? I know it's awfully arbitrary and
subjective. However, as long as the algorithm isn't partisan and generates
reproducible and at least somewhat defensible results, I can live with the
subjectivity.
-Swift
Now for those that might be somewhat interested this is what I started
with on tissue paper (just notes). Feel free to critique if you have ideas
or know of preexisting stuff I should look at. I'd rather not invent this
wheel.
* Substitute all whitespace for a single space, yeah, for sure. Forget
about wrapping characters, too (CR, LF, etc..).
* Possibly use something like soundex on variables? Hmm, how to detect
when the same variable is used under a new name? Leading/trailing
characters?
* Count braces and nesting levels? Does this generate a unique enough
pattern? Add it to an overall heuristic score ala Bayesian style?
* How to solve the problem of old code with a new location? Also when it's
slightly permuted?
* What will I use for quanta/units to analyze. Going by lines is dumb
since it implies whitespace (which is ignored). By function? By sets of
braces or parens? By scope ? Multiple types of quanta? Hmmmm....
* I'll start with multiple scripts. Each one builds it's own score based
on a different technique. Then we aggregate the scores and see which
ones are most useful/accurate for my use cases. Then see if any track
together or diverge in different cases.
* What about old K&R code that's simply been updated with a newer function
declaration and C99 or C11 stuff? Should be able to regex to detect this ?
* Probably better to write the tool in script, too much string handling to
dork with it in C.
* If one file is 100k and another 50k make sure that the tools never
assert a difference of less than 50%? What if file B is just 2x a bunch
of code still found in file A? Grrr... think...
Those were just rough notes with my ideas.
Home |
Main Index |
Thread Index |
Old Index