P5-algorithm-rabinkarp

Jul 20, 2023

Rabin-Karp streaming hash

This is an implementation of Rabin and Karp’s streaming hash, as described in “Winnowing Local Algorithms for Document Fingerprinting” by Schleimer, Wilkerson, and Aiken. Following the suggestion of Schleimer, I am using their second equation

$H[ $c[2..$k + 1] ] = $H[ $c[1..$k] ] - $c[1] ** $k + $c[$k+1] * $k

The results of this hash encodes information about the next k values in the stream hense k-gram. This means for any given stream of length n integer values or characters, you will get back n - k + 1 hash values.

For best results, you will want to create a code generator that filters your data to remove all unnecessary information. For example, in a large english document, you should probably remove all white space, as well as removing all capitalization.



Checkout these related ports:
  • Zxing-cpp - ZXing C++ Library for QR code recognition
  • Zu-hunspell - Zulu hunspell dictionaries
  • Zu-aspell - Aspell Zulu dictionary
  • Zq - Easier and faster alternative to jq
  • Zorba - General purpose C++ XQuery processor
  • Zenxml - Simple C++ XML Processing
  • Zed - Command-line tool to manage and query Zed data lakes
  • Yq - Command-line YAML and XML processor, jq wrapper for YAML/XML documents
  • Yould - Pronounceable word generator
  • Yodl - Easy to use but powerful document formatting/preparation language
  • Yi-hunspell - Yiddish hunspell dictionaries
  • Yi-aspell - Aspell Yiddish dictionary
  • Yelp-xsl - DocBook XSLT stylesheets for yelp
  • Yelp-tools - Utilities to help manage documentation for Yelp and the web
  • Ydiff - Diff readability enhancer for color terminals