Tuesday, 3 July 2012

BloomFilter Crash Course

BloomFilters are essentially bit vectors. At a high level BloomFilters work in the following manner:
1.        Add the element to the filter.
2.        Hash it a few times, than set the bits to 1 where the index matches the results of the hash.
When testing if an element is in the set, you follow the same hashing procedure and check if the bits are set to 1 or 0. This process is how a BloomFilter can guarantee an element does not exist. If the bits aren’t set, it’s simply impossible for the element to be in the set. However, a positive answer means the element is in the set or a hashing collision occurred.
Google uses BloomFilters in BigTable to avoid disk lookups for non-existent items.
And Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free.


With Example: http://llimllib.github.com/bloomfilter-tutorial/


Sample Code

public class TestHashingString {

   static String testStringArray[] = { "Raja", "Vimal", "Sundar", "KaniKannan" };

   static String resultArray[] = new String[10];

   static Funnel<String> personFunnel = new Funnel<String>() {
      public void funnel(String abc, Sink into) {

   public static void main(String... args) {
      BloomFilter<String> bloomFilter = BloomFilter.create(personFunnel, 1000);
      int resultCount = 0;

      String holder;
      for (int i = 0; i < 1000; i++) {
         for (int count = 0; count < testStringArray.length; count++) {
            holder = testStringArray[count];
            if (!bloomFilter.mightContain(holder)) {
               resultArray[resultCount++] = holder;