diff options
Diffstat (limited to 'textproc/p5-Algorithm-RabinKarp/pkg-descr')
-rw-r--r-- | textproc/p5-Algorithm-RabinKarp/pkg-descr | 18 |
1 files changed, 18 insertions, 0 deletions
diff --git a/textproc/p5-Algorithm-RabinKarp/pkg-descr b/textproc/p5-Algorithm-RabinKarp/pkg-descr new file mode 100644 index 000000000000..40d892395f70 --- /dev/null +++ b/textproc/p5-Algorithm-RabinKarp/pkg-descr @@ -0,0 +1,18 @@ +This is an implementation of Rabin and Karp's streaming hash, as described +in "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, +Wilkerson, and Aiken. Following the suggestion of Schleimer, I am using +their second equation: + + $H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k + +The results of this hash encodes information about the next k values in +the stream (hense k-gram.) This means for any given stream of length n +integer values (or characters), you will get back n - k + 1 hash values. + +For best results, you will want to create a code generator that filters +your data to remove all unnecessary information. For example, in a large +english document, you should probably remove all white space, as well as +removing all capitalization. + +WWW: http://search.cpan.org/dist/Algorithm-RabinKarp +Author: Norman Nunley <nnunley@gmail.com> |