The concept of a Bloom filter was introduces by Burton H. Bloom in 1970. It is a constant space complexity, probabilistic data structure that is used to do so-called membership decisions on anonymous sets. That is, you can easily decide whether a given element is a member of a certain set, whereas you cannot select elements from it (or even traverse through all elements). Moreover, like hash-tables the time complexity of Bloom filters for adding and removing elements, as well as membership-decision is in O(1) – it is in O(k) indeed where k is the degree of the Bloom filter.
Probabilistic, however, means that false positives are possible, but false negatives are not. The probability of false positives grows subexponentially with the number of added elements.
Another interesting property of Bloom filters of equal order and degree is the ability to perform set union and intersection operations within O(1) space and time complexity!
SXEmacs’ Bloom filters have their own lisp-type, but they do not have a special input syntax.
Return an empty bloom-filter.
Optional argument order (a positive integer) specifies the “length” of the internal filter vector, defaults to 256.
Optional argument degree (a positive integer) specifies the number of hash slices, defaults to 8.
For reasons of convenience, we also provide a constructor for a complete Bloom filter. That is a bloom filter which owns any possible element.
Return a complete bloom-filter.
Optional argument order (a positive integer) specifies the “length” of the internal filter vector, defaults to 256.
Optional argument degree (a positive integer) specifies the number of hash slices, defaults to 8.
Return non-nil
if object is a bloom filter,
nil
otherwise.
There are two further auxiliary functions to determine the bloom filter parameters:
Return the order of the bloom-filter bloom.
Return the degree of the bloom-filter bloom.
Adding/removing elements to/from a bloom filter always works by side-effect.
Add element to the bloom-filter bloom.
Remove element from the bloom-filter bloom.
The membership decision is done with bloom-owns-p
.
Return non-nil
if element is in the bloom-filter
bloom.
(setq bl (make-bloom)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (bloomp bl) ⇒ t
Now we want to add three integers and test some memberships.
(bloom-add bl 12) ⇒ 12 (bloom-add bl 21) ⇒ 21 (bloom-add bl 0) ⇒ 0
(bloom-owns-p bl 12) ⇒ t (bloom-owns-p bl 21) ⇒ t (bloom-owns-p bl 0) ⇒ t (bloom-owns-p bl 13) ⇒ nil (bloom-owns-p bl 17) ⇒ nil
Now let us remove some elements
(bloom-remove bl 12) ⇒ 12 (bloom-remove bl 0) ⇒ 12
(bloom-owns-p bl 12) ⇒ nil (bloom-owns-p bl 21) ⇒ t (bloom-owns-p bl 0) ⇒ nil
Of course, Bloom filters work with any lisp type (not just integers). Internally, they use an object’s hash value and scatter it over a fixed-length array.
(bloom-owns-p bl "horse") ⇒ nil (bloom-add bl "horse") ⇒ "horse" (bloom-owns-p bl "horse") ⇒ t (bloom-owns-p bl "snake") ⇒ nil
As remarked above, union and intersection of Bloom-filters of equal order and degree is possible in constant time regardless of the size.
Return the union Bloom filter of all arguments.
Return the intersection Bloom filter of all arguments.
(setq bl1 (make-bloom)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (bloom-add bl1 2) ⇒ 2 (bloom-add bl1 4) ⇒ 4 (bloom-add bl1 'foobar) ⇒ foobar
(setq bl2 (make-bloom)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (bloom-add bl2 "horse") ⇒ "horse" (bloom-add bl2 "snail") ⇒ "snail" (bloom-add bl2 'foobar) ⇒ foobar
(setq blu (bloom-union bl1 bl2)) ⇒ #<bloom-filter :order 256 :degree 8 :size 6> (bloom-owns-p blu 2) ⇒ t (bloom-owns-p blu 4) ⇒ t (bloom-owns-p blu "horse") ⇒ t (bloom-owns-p blu "snail") ⇒ t (bloom-owns-p blu 'foobar) ⇒ t
(setq bli (bloom-intersection bl1 bl2)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (bloom-owns-p bli 2) ⇒ nil (bloom-owns-p bli 4) ⇒ nil (bloom-owns-p bli "horse") ⇒ nil (bloom-owns-p bli "snail") ⇒ nil (bloom-owns-p bli 'foobar) ⇒ t
Now we want to illustrate the extreme performance gain over lists, and compare to the performance of hash-tables. Therefore, we use a setup routine which fills a data container with 10000 distinct elements. Then we do 50000 membership tests. Let’s start with ordinary lisp lists. For each of these both tests we measure the time.
(defun test-element () "Return a random test element." (incf cnt)) ⇒ test-element
(setq tlist (list)) ⇒ nil (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 10000) (setq tlist (cons (test-element) tlist))) (- (current-btime) start)) ⇒ 140733
;; 50000 membership tests on tlist … take a coffee break (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 50000) (member (test-element) tlist)) (- (current-btime) start)) ⇒ 154817323
Let’s consider hash-tables:
(setq thash (make-hash-table)) ⇒ #<hash-table size 0/29 0x1619> (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 10000) (puthash (test-element) 'some-value thash)) (- (current-btime) start)) ⇒ 173210
;; 50000 membership tests on thash (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 50000) (gethash (test-element) thash)) (- (current-btime) start)) ⇒ 723030
Let’s finish our considerations with Bloom filters:
(setq tbfil (make-bloom)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 10000) (bloom-add tbfil (test-element))) (- (current-btime) start)) ⇒ 128565
;; 50000 membership tests on tbfil (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 50000) (bloom-owns-p tbfil (test-element))) (- (current-btime) start)) ⇒ 757945
Evaluating all these results, we can state that hash-tables and Bloom-filters provide equal performance. For extremely large sets, Bloom-filters may profit from the absence of resizing the underlying array.
;; 400000 insertions into a bloom-filter (setq tbfil (make-bloom)) ⇒ #<bloom-filter :order 256 :degree 8 :size 0> (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 400000) (bloom-add tbfil (test-element))) (- (current-btime) start)) ⇒ 5464420
;; 400000 insertions into a hash-table (setq thash (make-hash-table)) ⇒ #<hash-table size 0/29 0x1066> (setq cnt 0) ⇒ 0 (let ((start (current-btime))) (dotimes (i 400000) (puthash (test-element) 'some-value thash)) (- (current-btime) start)) ⇒ 6392474
Furthermore, Bloom filters are able to track how often equal elements have been inserted, for hash-tables the value field may be used to implement something similar.
However, bloom filters tend to report more and more false positives the more elements you add. False negatives are never possible. That means with a sufficiently small bloom filter and a sufficiently large set of elements it is just a matter of probability when this filter will turn into a universe.
(setq b4 (make-bloom 256 64)) ⇒ #<bloom-filter :order 256 :degree 64 :size 0> ;; now add 800000 numbers to it (dotimes (i 800000) (bloom-add b4 i)) ⇒ nil ;; now check this (bloom-owns-p b4 'a-symbol-I-never-added) ⇒ t (bloom-owns-p b4 "a string I never added") ⇒ t (bloom-owns-p b4 [a vector I never added]) ⇒ t
We see that b4
has turned into a quasi-universe. There might
be objects which are not yet in b4
, but the probability to find
one is extremely small.
Given a bloom filter of degree d and size s, and given
the bloom filter already contains n elements, the exact
probability to get a false positive (an element which has not been
added, but bloom-owns-p
reports so) is:
So for our example the probability is veeery close to one, to be precise, it is:
0.9999...99993386493554255105254275560... ^^^^^^^^^^^ 87027 times