Semetic/fuzzy-logic code comparison tool ?

To: netbsd-users%netbsd.org@localhost
Subject: Semetic/fuzzy-logic code comparison tool ?
From: Swift Griggs <swiftgriggs%gmail.com@localhost>
Date: Tue, 13 Dec 2016 15:08:02 -0700 (MST)

Let's say one wants to make general statement that "This code is 30% thesame as that code!" Another example would be someone wants to make thestatement that "XX% of this code really came from project X." In my caseI'm only interested in "honest" code, not trying to catch someonestealing/permuting existing code. Oh, and everything I care about is in C.


My questions are:

* Are there tools that already do this?

* What do you do about whitespace, simple variable permutation, and
  formatting issues? Ie.. times when a tiny thing changes the "checksum"
  of your content but it's essentially still the same code.

I know that this is essentially an AI problem and thus can get complex ina hurry. I was writing some scripts to take a swing at some kind ofprototype (and I even made some early progress), but then I though "surelysomeone's already done this, genius."

Anyone know of any place to start, here? I know it's awfully arbitrary andsubjective. However, as long as the algorithm isn't partisan and generatesreproducible and at least somewhat defensible results, I can live with thesubjectivity.


-Swift

Now for those that might be somewhat interested this is what I startedwith on tissue paper (just notes). Feel free to critique if you have ideasor know of preexisting stuff I should look at. I'd rather not invent thiswheel.


* Substitute all whitespace for a single space, yeah, for sure. Forget
  about wrapping characters, too (CR, LF, etc..).

* Possibly use something like soundex on variables? Hmm, how to detect
  when the same variable is used under a new name? Leading/trailing
  characters?

* Count braces and nesting levels? Does this generate a unique enough
  pattern? Add it to an overall heuristic score ala Bayesian style?

* How to solve the problem of old code with a new location? Also when it's
  slightly permuted?

* What will I use for quanta/units to analyze. Going by lines is dumb
  since it implies whitespace (which is ignored). By function? By sets of
  braces or parens? By scope ? Multiple types of quanta? Hmmmm....

* I'll start with multiple scripts. Each one builds it's own score based
  on a different technique. Then we aggregate the scores and see which
  ones are most useful/accurate for my use cases. Then see if any track
  together or diverge in different cases.

* What about old K&R code that's simply been updated with a newer function
  declaration and C99 or C11 stuff? Should be able to regex to detect this ?

* Probably better to write the tool in script, too much string handling to
  dork with it in C.

* If one file is 100k and another 50k make sure that the tools never
  assert a difference of less than 50%? What if file B is just 2x a bunch
  of code still found in file A? Grrr... think...

Those were just rough notes with my ideas.

Follow-Ups:
- Re: Semetic/fuzzy-logic code comparison tool ?
  - From: David Young

Prev by Date: Re: NetBSD 7.02 on APU2 PcEngines
Next by Date: Re: A single-board computer for NetBSD
Previous by Thread: disklabel warnings
Next by Thread: Re: Semetic/fuzzy-logic code comparison tool ?
Indexes:

Home | Main Index | Thread Index | Old Index