Check if an item exists in a large set in Python

Every time you navigate to a new URL, your browser checks to see if it is in a list of millions of malicious sites. Normally, searching through such a list would take minutes. Anything more than a few milliseconds would make it unusable. So how do they pull it off? Bloom filters!

A bloom filter tells you that something is definitely not in a set, or that it may be in a set, with some defined false positive rate (for instance, 0.1%). This makes it useful at deferring expensive disk accesses, when set membership drives some sort of action.

How does it work?

Source: Commons

Bloom filters take a bit array A of length m, and k different hash functions that map to A. A false positive rate is used to determine m and k.

To add an item, hash the item and enable the corresponding bit on A. To query, hash the query and check to see if the bit at that position is on.

Example: Common password checking

For this example, I will walk through the process of creating code that verifies that a password is not on a list of known passwords. If you’re interested in this task, check out haveibeenpwned.

Setup

First, download the rockyou.txt password list and install pybloom. Then do the necessary imports.

import random, time, sys
from pybloom import BloomFilter

Next, populate the passwords into a list. There are unicode errors with this file because of course there are.

passwords = []
for password in open('rockyou.txt','rb'):
    try:
        passwords.append(password.decode('utf8').strip())
    except UnicodeDecodeError:
        pass

Next, let’s add these to a bloom filter.

bloom_filter = BloomFilter(capacity=len(passwords), error_rate=0.001)
for password in passwords:
    bloom_filter.add(password)

Using the bloom filter

We can create an interface for our bloom filter as follows:

while True:
    password = input('Enter password: ')
    if password:
        if password in bloom_filter:
            print('Please choose a new password.')
        else:
            print('Password not in list!')
        continue
    break

A full search typically follows a hit from the bloom filter. In this case, the cost of a false positive is low, so there’s nothing wrong with choosing a different password. You should be using a password manager anyway if you’re not already. This makes sufficiently long passwords (>24 characters) easy to use.

Performance

Running sys.getsizeof() reveals that the password list is 116082096 bytes. The bloom filter is just 56 bytes!

Searching for a random password in the list took an average of 68 ms, compared to 0.043 ms for the bloom filter. An even larger list would typically be used, so the difference would be even larger.

Further improvements

Cuckoo filters (2014) are better in practice than bloom filters. They are more space efficient, and allow for element deletion without rebuilding the whole array.


About the author



Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.


Leave a Reply

Your email address will not be published. Required fields are marked *