Modeling Hawkes Processes in Python

This article has been updated to reflect changes suggested by one of our readers. Thanks Andrew! Full code is at the bottom of the post.

New information, released in an instant, can synchronize the behavior of millions of people. This is magnified by the impact of the information, and the transience of its release.

Wikipedia editors edit related pages during such events, reducing the amount of time between edits, or increasing the edit rate.

Celebrity deaths

On June 25, 2009, the news of Michael Jackson’s death caused Google, Twitter, Wikipedia, and other websites to go offline.

The estimated intensity \lambda(t) of a Hawkes process fit to Michael Jackson edit events.

Here is a zoomed-in view of the event. The edits started coming in about 3 hours after he was pronounced dead.

This is the first edit to mention Michael Jackson having medical issues. It cited a news article posted 61 minutes before the Los Angeles Times confirmed his death.

Here is where the edit happens in the list. Note that prior edits were semi-daily. After the event, they all happen within the same hour.

Running the script on a collection of celebrities who died in 2014 has a similar result. There is a notable lack of events after death.

Dates and holidays

For year pages, there is a peak during New Year’s and a smaller one at year-end.

This works for other scheduled events as well.

Terrorism and crime

Mass shootings typically have their own Wikipedia pages within 150 minutes from the start of the event, decreasing every year.

Year Event Article Creation
2017 Las Vegas shooting 84 minutes
2017 Sutherland Springs church shooting 158 minutes
2016 Orlando nightclub shooting 170 minutes
2012 Sandy Hook Elementary School shooting 210 minutes
2007 Virginia Tech shooting 241 minutes

Conclusion

In conjunction with Wikipedia’s detailed hourly pageviews, Wikipedia article edit rates yield high-quality, live information.

Full Code

import os,time

import pandas as pd
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
from tick.hawkes import SimuHawkesSumExpKernels, HawkesSumExpKern

import requests
session = requests.Session()

def fetch_revisions(title):

    file_name = './revisions/{0}_revisions.csv'.format(title.replace(' ','_'))
    if not os.path.isfile(file_name):
        parameters = {
            'action': 'query',
            'format': 'json',
            'continue': '',
            'titles': title,
            'prop': 'revisions',
            'rvprop': 'ids|flags|timestamp|user|userid|size|comment',
            'rvlimit': 'max'
        }

        wp_call = requests.get('http://en.wikipedia.org/w/api.php', params=parameters)
        response = wp_call.json()

        print('Fetching revisions for {0}...'.format(title))
        revisions = []
        while True:
            print('tFetched {0} revisions'.format(len(revisions)))
            wp_call = session.get('http://en.wikipedia.org/w/api.php', params=parameters)
            response = wp_call.json()
            for page_id in response['query']['pages']:
                revisions += response['query']['pages'][page_id]['revisions']
            if 'continue' in response:
                parameters['continue'] = response['continue']['continue']
                parameters['rvcontinue'] = response['continue']['rvcontinue']
            else:
                break
            time.sleep(3)

        # Format revisions as pandas dataframe
        revisions = pd.DataFrame(revisions).set_index('timestamp')
        revisions.index = pd.to_datetime(revisions.index)
        revisions = revisions.sort_index()
        if 'anon' in revisions.columns:
            revisions['anon'] = ~revisions['anon'].astype(bool)
        if 'minor' in revisions.columns:
            revisions['minor'] = ~revisions['minor'].astype(bool)

        # Remove revisions if the size is the same before and after edit (remove edit-wars)
        revisions = revisions.loc[revisions['size'].shift(1)!=revisions['size'].shift(-1)]

        # Remove minor revisions
        if 'minor' in revisions.columns:
            revisions = revisions.loc[~revisions['minor']]

        # Remove edits by bots
        revisions = revisions.loc[~revisions['user'].str.contains('(Bot|bot|BOT)([^A-Za-z]|$)',na=False)]

        # Group revisions by same user. Date is of first revision & size is of last revision
        revisions.loc[revisions.user != revisions['user'].shift(1),'size'] = revisions.loc[revisions['user'] != revisions['user'].shift(-1)]['size'].values
        revisions = revisions.loc[revisions['user'] != revisions['user'].shift(1)]

        # Remove edits that don't change >=10 bytes
        revisions = revisions.loc[revisions['size'].diff().fillna(np.inf).abs()>=10]

        # Remove "page blank" edits
        revisions = revisions.loc[revisions['size'].pct_change().fillna(0).abs()<=0.9]

        revisions.to_csv(file_name)
    else:
        revisions = pd.read_csv(file_name,index_col=0,parse_dates=True)
    return revisions

def estimate_intensity(event_timestamps,event_titles):

    start_time = min([min(timestamp) for timestamp in event_timestamps])
    end_time = max([max(timestamp) for timestamp in event_timestamps])
    event_timestamps = [(timestamps-start_time).astype(int)/86400e9 for timestamps in event_timestamps]

    decays = [0.02,0.01,0.5]
    learner = HawkesSumExpKern(decays, penalty='elasticnet', elastic_net_ratio=0.8)
    learner.fit(event_timestamps)

    tracked_intensity, intensity_tracked_times = learner.estimated_intensity(event_timestamps,intensity_track_step=1/24.)
    estimated_intensity_index = start_time + pd.to_timedelta(intensity_tracked_times,unit='D')
    estimated_intensity = pd.DataFrame(np.vstack(tracked_intensity).T,index=estimated_intensity_index,columns=event_titles)

    return estimated_intensity

def main():

    titles = ['Michael Jackson']
    event_timestamps = []
    for title in titles:
        event_timestamps.append(fetch_revisions(title=title).index.values)
    estimated_intensity = estimate_intensity(event_timestamps,titles).resample('1H').mean()

    print('Filtering...')
    order = 2
    Wn = 0.001
    B,A = signal.butter(order, Wn, output='ba')
    for column in estimated_intensity.columns:
        estimated_intensity[column] -= signal.filtfilt(B, A, estimated_intensity[column].values)

    print('Plotting...')
    ax = estimated_intensity.plot(linewidth=0.7)#,logy=True)
    ax.set_ylim([0,100])
    plt.savefig('output.png',ppi=300)

if __name__ == '__main__':
    main()

About the author



Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.


Comments

  1. This domain strikes me as a really good use case for the Hawkes process – basically a generalization of the poisson process where arrivals causes the intensity of the process to increase (or decrease). The model even has a clean multivariate extension to multiple dependent processes.

    The general definition (http://mathworld.wolfram.com/HawkesProcess.html) is too broad to make sense, but the relevant formulation is shown in https://hawkeslib.readthedocs.io/en/latest/tutorial.html

    1. Wow, that’s exactly what this is. That’s a great point, definitely better than taking some sort of moving average and then thresholding. I wish the likelihood estimation was non-trivial, but I’ll make a post about this since I was already going to post about similar subjects. Thanks for the suggestion!

      Edit:
      Updated!

Leave a Reply

Your email address will not be published. Required fields are marked *