# Bloom filters in Python

Every time you navigate to a new URL, your browser checks to see if it is in a list of millions of malicious sites. Normally, searching through such a list would take minutes. Anything more than a few milliseconds would make it unusable. So how do they pull it off? Bloom filters!

A bloom filter tells you that something is definitely not in a set, or that it may be in a set, with some defined false positive rate (for instance, 0.1%). This makes it useful at deferring expensive disk accesses, when set membership drives some sort of action.

## How does it work?

Bloom filters take a bit array $A$ of length $m$, and $k$ different hash functions that map to $A$. A false positive rate is used to determine $m$ and $k$.

To add an item, hash the item and enable the corresponding bit on $A$. To query, hash the query and check to see if the bit at that position is on.

For this example, I will walk through the process of creating code that verifies that a password is not on a list of known passwords. If you’re interested in this task, check out haveibeenpwned.

### Setup

import random, time, sys
from pybloom import BloomFilter

Next, populate the passwords into a list. There are unicode errors with this file because of course there are.

try:
except UnicodeDecodeError:
pass

Next, let’s add these to a bloom filter.

### Using the bloom filter

We can create an interface for our bloom filter as follows:

while True:
else:
continue
break

A full search typically follows a hit from the bloom filter. In this case, the cost of a false positive is low, so there’s nothing wrong with choosing a different password. You should be using a password manager anyway if you’re not already. This makes sufficiently long passwords (>24 characters) easy to use.

### Performance

Running sys.getsizeof() reveals that the password list is 116082096 bytes. The bloom filter is just 56 bytes!

Searching for a random password in the list took an average of 68 ms, compared to 0.043 ms for the bloom filter. An even larger list would typically be used, so the difference would be even larger.

## Further improvements

Cuckoo filters (2014) are better in practice than bloom filters. They are more space efficient, and allow for element deletion without rebuilding the whole array.

Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.

Are the $k$ hash functions the same at each query, or is there a larger set that is sampled from? In the case of checking URL’s, wouldn’t using the same functions each time harm a website that always triggers false positives by always increasing its load time?
1. They are the same for each query, all performed in parallel and the resulting bits are all flipped. A lower $k$ leads to a higher capacity but more false positives. The functions should be independent and uniformly distributed, but need not be cryptographic, which slows things down.