SXEmacs Lisp Reference Manual: Bloom Filters

The concept of a Bloom filter was introduces by Burton H. Bloom in 1970. It is a constant space complexity, probabilistic data structure that is used to do so-called membership decisions on anonymous sets. That is, you can easily decide whether a given element is a member of a certain set, whereas you cannot select elements from it (or even traverse through all elements). Moreover, like hash-tables the time complexity of Bloom filters for adding and removing elements, as well as membership-decision is in O(1) – it is in O(k) indeed where k is the degree of the Bloom filter.

Probabilistic, however, means that false positives are possible, but false negatives are not. The probability of false positives grows subexponentially with the number of added elements.

Another interesting property of Bloom filters of equal order and degree is the ability to perform set union and intersection operations within O(1) space and time complexity!

SXEmacs’ Bloom filters have their own lisp-type, but they do not have a special input syntax.

For reasons of convenience, we also provide a constructor for a complete Bloom filter. That is a bloom filter which owns any possible element.

There are two further auxiliary functions to determine the bloom filter parameters:

(setq bl (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloomp bl)
  ⇒ t

(bloom-add bl 12)
  ⇒ 12
(bloom-add bl 21)
  ⇒ 21
(bloom-add bl 0)
  ⇒ 0

(bloom-owns-p bl 12)
  ⇒ t
(bloom-owns-p bl 21)
  ⇒ t
(bloom-owns-p bl 0)
  ⇒ t
(bloom-owns-p bl 13)
  ⇒ nil
(bloom-owns-p bl 17)
  ⇒ nil

(bloom-remove bl 12)
  ⇒ 12
(bloom-remove bl 0)
  ⇒ 12

(bloom-owns-p bl 12)
  ⇒ nil
(bloom-owns-p bl 21)
  ⇒ t
(bloom-owns-p bl 0)
  ⇒ nil

Of course, Bloom filters work with any lisp type (not just integers). Internally, they use an object’s hash value and scatter it over a fixed-length array.

(bloom-owns-p bl "horse")
  ⇒ nil
(bloom-add bl "horse")
  ⇒ "horse"
(bloom-owns-p bl "horse")
  ⇒ t
(bloom-owns-p bl "snake")
  ⇒ nil

As remarked above, union and intersection of Bloom-filters of equal order and degree is possible in constant time regardless of the size.

(setq bl1 (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-add bl1 2)
  ⇒ 2
(bloom-add bl1 4)
  ⇒ 4
(bloom-add bl1 'foobar)
  ⇒ foobar

(setq bl2 (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-add bl2 "horse")
  ⇒ "horse"
(bloom-add bl2 "snail")
  ⇒ "snail"
(bloom-add bl2 'foobar)
  ⇒ foobar

(setq blu (bloom-union bl1 bl2))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 6>
(bloom-owns-p blu 2)
  ⇒ t
(bloom-owns-p blu 4)
  ⇒ t
(bloom-owns-p blu "horse")
  ⇒ t
(bloom-owns-p blu "snail")
  ⇒ t
(bloom-owns-p blu 'foobar)
  ⇒ t

(setq bli (bloom-intersection bl1 bl2))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(bloom-owns-p bli 2)
  ⇒ nil
(bloom-owns-p bli 4)
  ⇒ nil
(bloom-owns-p bli "horse")
  ⇒ nil
(bloom-owns-p bli "snail")
  ⇒ nil
(bloom-owns-p bli 'foobar)
  ⇒ t

Now we want to illustrate the extreme performance gain over lists, and compare to the performance of hash-tables. Therefore, we use a setup routine which fills a data container with 10000 distinct elements. Then we do 50000 membership tests. Let’s start with ordinary lisp lists. For each of these both tests we measure the time.

(defun test-element ()
  "Return a random test element."
  (incf cnt))
  ⇒ test-element

(setq tlist (list))
  ⇒ nil
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (setq tlist (cons (test-element) tlist)))
  (- (current-btime) start))
  ⇒ 140733

;; 50000 membership tests on tlist … take a coffee break
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (member (test-element) tlist))
  (- (current-btime) start))
  ⇒ 154817323

(setq thash (make-hash-table))
  ⇒ #<hash-table size 0/29 0x1619>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (puthash (test-element) 'some-value thash))
  (- (current-btime) start))
  ⇒ 173210

;; 50000 membership tests on thash
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (gethash (test-element) thash))
  (- (current-btime) start))
  ⇒ 723030

(setq tbfil (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 10000)
    (bloom-add tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 128565

;; 50000 membership tests on tbfil
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 50000)
    (bloom-owns-p tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 757945

Evaluating all these results, we can state that hash-tables and Bloom-filters provide equal performance. For extremely large sets, Bloom-filters may profit from the absence of resizing the underlying array.

;; 400000 insertions into a bloom-filter
(setq tbfil (make-bloom))
  ⇒ #<bloom-filter :order 256 :degree 8 :size 0>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 400000)
    (bloom-add tbfil (test-element)))
  (- (current-btime) start))
  ⇒ 5464420

;; 400000 insertions into a hash-table
(setq thash (make-hash-table))
  ⇒ #<hash-table size 0/29 0x1066>
(setq cnt 0)
  ⇒ 0
(let ((start (current-btime)))
  (dotimes (i 400000)
    (puthash (test-element) 'some-value thash))
  (- (current-btime) start))
  ⇒ 6392474

Furthermore, Bloom filters are able to track how often equal elements have been inserted, for hash-tables the value field may be used to implement something similar.

However, bloom filters tend to report more and more false positives the more elements you add. False negatives are never possible. That means with a sufficiently small bloom filter and a sufficiently large set of elements it is just a matter of probability when this filter will turn into a universe.

(setq b4 (make-bloom 256 64))
  ⇒ #<bloom-filter :order 256 :degree 64 :size 0>

;; now add 800000 numbers to it
(dotimes (i 800000)
  (bloom-add b4 i))
  ⇒ nil

;; now check this
(bloom-owns-p b4 'a-symbol-I-never-added)
  ⇒ t
(bloom-owns-p b4 "a string I never added")
  ⇒ t
(bloom-owns-p b4 [a vector I never added])
  ⇒ t

We see that b4 has turned into a quasi-universe. There might be objects which are not yet in b4, but the probability to find one is extremely small.

Given a bloom filter of degree d and size s, and given the bloom filter already contains n elements, the exact probability to get a false positive (an element which has not been added, but bloom-owns-p reports so) is:

So for our example the probability is veeery close to one, to be precise, it is:

11.13 Bloom Filters