NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Semetic/fuzzy-logic code comparison tool ?




Let's say one wants to make general statement that "This code is 30% the same as that code!" Another example would be someone wants to make the statement that "XX% of this code really came from project X." In my case I'm only interested in "honest" code, not trying to catch someone stealing/permuting existing code. Oh, and everything I care about is in C.

My questions are:

* Are there tools that already do this?

* What do you do about whitespace, simple variable permutation, and
  formatting issues? Ie.. times when a tiny thing changes the "checksum"
  of your content but it's essentially still the same code.

I know that this is essentially an AI problem and thus can get complex in a hurry. I was writing some scripts to take a swing at some kind of prototype (and I even made some early progress), but then I though "surely someone's already done this, genius."

Anyone know of any place to start, here? I know it's awfully arbitrary and subjective. However, as long as the algorithm isn't partisan and generates reproducible and at least somewhat defensible results, I can live with the subjectivity.

-Swift


Now for those that might be somewhat interested this is what I started with on tissue paper (just notes). Feel free to critique if you have ideas or know of preexisting stuff I should look at. I'd rather not invent this wheel.

* Substitute all whitespace for a single space, yeah, for sure. Forget
  about wrapping characters, too (CR, LF, etc..).

* Possibly use something like soundex on variables? Hmm, how to detect
  when the same variable is used under a new name? Leading/trailing
  characters?

* Count braces and nesting levels? Does this generate a unique enough
  pattern? Add it to an overall heuristic score ala Bayesian style?

* How to solve the problem of old code with a new location? Also when it's
  slightly permuted?

* What will I use for quanta/units to analyze. Going by lines is dumb
  since it implies whitespace (which is ignored). By function? By sets of
  braces or parens? By scope ? Multiple types of quanta? Hmmmm....

* I'll start with multiple scripts. Each one builds it's own score based
  on a different technique. Then we aggregate the scores and see which
  ones are most useful/accurate for my use cases. Then see if any track
  together or diverge in different cases.

* What about old K&R code that's simply been updated with a newer function
  declaration and C99 or C11 stuff? Should be able to regex to detect this ?

* Probably better to write the tool in script, too much string handling to
  dork with it in C.

* If one file is 100k and another 50k make sure that the tools never
  assert a difference of less than 50%? What if file B is just 2x a bunch
  of code still found in file A? Grrr... think...

Those were just rough notes with my ideas.


Home | Main Index | Thread Index | Old Index