Thursday, 4 March 2010

Dot Plot

-->

Dot Plot

Pascal

See PascalDotPlot.

C#

New versions will be in C# because I want to learn that language and a real application is the best way to learn. Slightly disillusioned now, so on to PascalDotPlot (was JavaDotPlot).

Parsing

The original tokeniser splits on whitespace; this means that strings such as:

  using System.Windows.Forms;

appear as two tokens instead of the expected four (or five if you count the semicolon, seven if you count the to dots). This can obscure some similarities but is so much faster and so much simpler to implement than a general parsing algorithm that it is probably the best compromise. The real problem comes in analysing files created in general purpose text editors. Such files do not always include spaces between tokens such as operators. For instance the MS VB IDE editor adds spaces around infix operators, suppresses them before opening brackets, adds them before end of line comments and so on; this has the effect of ensuring that operators are seen as tokens, however, such spacing is not required by the compiler.

For instance the following line is perfectly legal VB:

  A=B+C

but if you enter it using the VB IDE it will become

  A = B + C    

The first will be seen as one token and the second as five tokens. This is a serious limitation if we think in terms of detecting plagiarism because then we must assume that the programmer of at least one of the files is hostile to the process, on the other hand in professional contexts we are generally looking for sloppy programming that involves verbatim copying without any intent to conceal, only an intent to get the job done as quickly as possible. This means that copied source will tend to have the same spacing as the original even if the programmer has altered it a little to fit its new home.

The problems associated with the original tokenisation method are easily avoided with the .NET framework regular expression class because it has a Split method that allows a regular expression to be used to identify the delimiters.

Implementation Notes

  • Make the xGetToken function responsible for the maintenance of the current character position variable
  • Connect token axis values directly to token positions in the source documents.
  • Drop the current file strings and read the file directly into the text boxes. This should work properly now that I am using C# because the regular expression Split method returns the delimiters as well as the tokens.

New Features

Speed is better after the change to the way the tokens were stored. Now it is time to add more features. The first question is which ones. Possible features include:

  • Allow user to specify that files in sub-directories should be processed.
  • Allow user to specify a regular expression to match filenames.
  • Split on non-word characters using a regular expression.
  • Provide flags to ignore delimiters or treat them as tokens
  • Allow user to supply a regular expression that identifies delimiters or one that identifies tokens.
  • Supply sample regular expressions intended for specific languages
  • When clicking in the plot to select text in the files make sure that the selection in the text boxes covers all of the tokens that are represented by the selected pixel. For large texts one screen pixel represents many tokens, perhaps even hundreds of lines.
  • All files in a directory,
  • Selected files
  • Files in a VBP
  • Files mentioned in another file such as a make file, SharpDevelop combine, Visual Studio solution, etc.
  • zoom
  • synchronise selection in text box with selection in picture box so that selecting in text moves a cross hair in the plot.
  • use cursor keys in plot
  • draw boundary lines between files on the plot
  • save plots as jpg, png etc.
  • save parsed files and plots so that a session can be restored without reparsing the files or replotting. The reason for this is that large files or sets of files will be very slow.

Improvements to Algorithms

  • Instead of parsing line by line just Split the whole file on spaces. This should be faster but how do we connect a token number to the position in the file? One way is to accumulate the offset by adding the token lengths together, must take care of the delimiting spaces. Then we must use the token offset instead of the line offset because we will not know where the line boundaries are.
  • Read the whole file and split on end of line characters.

Dotplot in and for VB

Dotplots are two dimensional charts. To produce such a chart two text files are tokenised. The values along the axes are the tokens; imagine the whole text of the file written along the axis.

For example:

         The quick brown fox

The       X            
dirty          
brown                X
dog

Dots are placed at each intersection point where the tokens are the same. The same technique can look for repeated sequences within a file.

Example with case ignored:

         The quick brown fox and the dirty brown dog
The       X                       X                
quick          X                                 
brown                X                       X           
fox                       X                         
and                           X                         
the       X                       X               
dirty                                  X         
brown                X                       X
dog                                               X

Notice that the diagonal line is completely filled and that there are diagonals in the upper right and lower left quadrants.

The purpose of such a plot is to make detection of duplicated code easier. Many academic citations of this method, other than those in bilogical sciences, tend to couple it with plagiarism detection. While it can of course be used to detect stolen code in software development course and in detecting stolen passages from novels and scientific reports my feeling is that the best use for the technique is to aid in software maintenance by making it easy to detect code that has simply been copied and edited in order to get a program up and running quickly. There is nothing wrong with cut and paste development so long as the developer stays awake and refactors the code when the duplication reaches a certain level (one copy is alright, two should make you make a mental note, three means it's time to refactor)

Quick and Dirty Implementation

  • Create a simple VB project.
  • Read two files
  • Convert all to lower case
  • Tokenise each line with the Split function using space as the delimiter.
  • Discard empty tokens
  • Scale a picture box so that the width and height correspond to the number of tokens in each file.
  • Iterate over the tokens and place a dot at each point in the picture box that corresponds to identical tokens.

Dotplot versions

DotPlotVersionThree
Token streaming from file list. Multiple files.
DotPlotVersionTwo
Improved data structures that speed up plotting by not requiring string comparisons.
DotPlotVersionOne
Simple version that plots the correspondence between two files and has a slow plotter.

Links to dotplot resources on the web

Google search for dotplot
finds altogether too many hits. Mostly from the molecular biology field, interesting but only occasionally directly relevant.
Google search for: dotplot source code
Much better.
Jonathan Helfman and Ken Church
This has to be the best, most compact and most easily digested introduction to the concepts of dotplots.
Visual Detection of Duplicated Code
Rieger and Ducasse at Berne.
Identifying Duplication in Program Traces
Interesting suggestion for using dotplots to examine not the static structure but the behaviour of a program.
Mi-Nyeong Hwang, Jeong-Hyeon Choi, Do-Hoon Lee, and Hwan-Gue Cho
An overview paper in M$ Word format (yuck) giving a little more background and some cross references to biological uses for dot plots.
Genetics Computer Group
The manual for a real commercial dot plotting programming used for DNA analysis.
Peter J. Balafas
A dissertation on plagiarism detection from Loughborough University. I've only scanned it briefly, largely because I am allergic to the whole idea of automated plagiarism detection. Just in case you care you can find my reasoning on the IrrelevantAsides page
Paul Clough - Dotplot: Visually exploring patterns in text
Another presentation of the Meter project that is the subject of Balafas' paper, this time from Sheffield University

Links to related pages

Real artificial life: Where we may be
Very interesting paper on software as a living system. Makes passing reference to dotplots in Software genetics.

No comments:

Post a Comment

Blog Archive

Followers