Every time you navigate to a new URL, your browser checks to see if it is in a list of millions of malicious sites. Normally, searching through such a list would take minutes. Anything more than a few milliseconds would make it unusable. So how do they pull it off? Bloom filters!
A bloom filter tells you that something is definitely not in a set, or that it may be in a set, with some defined false positive rate (for instance, 0.1%). This makes it useful at deferring expensive disk accesses, when set membership drives some sort of action.
How does it work?
Bloom filters take a bit array of length , and different hash functions that map to . A false positive rate is used to determine and .
To add an item, hash the item and enable the corresponding bit on . To query, hash the query and check to see if the bit at that position is on.
Example: Common password checking
For this example, I will walk through the process of creating code that verifies that a password is not on a list of known passwords. If you’re interested in this task, check out haveibeenpwned.
First, download the rockyou.txt password list and install pybloom. Then do the necessary imports.
import random, time, sys from pybloom import BloomFilter
Next, populate the passwords into a list. There are unicode errors with this file because of course there are.
passwords =  for password in open('rockyou.txt','rb'): try: passwords.append(password.decode('utf8').strip()) except UnicodeDecodeError: pass
Next, let’s add these to a bloom filter.
bloom_filter = BloomFilter(capacity=len(passwords), error_rate=0.001) for password in passwords: bloom_filter.add(password)
Using the bloom filter
We can create an interface for our bloom filter as follows:
while True: password = input('Enter password: ') if password: if password in bloom_filter: print('Please choose a new password.') else: print('Password not in list!') continue break
A full search typically follows a hit from the bloom filter. In this case, the cost of a false positive is low, so there’s nothing wrong with choosing a different password. You should be using a password manager anyway if you’re not already. This makes sufficiently long passwords (>24 characters) easy to use.
Running sys.getsizeof() reveals that the password list is 116082096 bytes. The bloom filter is just 56 bytes!
Searching for a random password in the list took an average of 68 ms, compared to 0.043 ms for the bloom filter. An even larger list would typically be used, so the difference would be even larger.
Cuckoo filters (2014) are better in practice than bloom filters. They are more space efficient, and allow for element deletion without rebuilding the whole array.