# Real-time event discovery using Wikipedia edits

New information, released in an instant, can synchronize the behavior of millions of people. This is magnified by the impact of the information, and the transience of its release.

Wikipedia editors edit related pages during such events, reducing the amount of time between edits, or increasing the edit rate.

## Celebrity deaths

On June 25, 2009, the news of Michael Jackson’s death caused Google, Twitter, Wikipedia, and other websites to go offline.

The estimated intensity $\lambda(t)$ of a Hawkes process fit to Michael Jackson edit events.

Here is a zoomed-in view of the event. The edits started coming in about 3 hours after he was pronounced dead.

This is the first edit to mention Michael Jackson having medical issues. It cited a news article posted 61 minutes before his death the Los Angeles Times confirmed his death.

Here is where the edit happens in the list. Note that prior edits were semi-daily. After the event, they all happen within the same hour.

Running the script on a collection of celebrities who died in 2014 has a similar result. There is a notable lack of events after death.

## Dates and holidays

For year pages, there is a peak during New Year’s and a smaller one at year-end.

This works for other scheduled events as well.

## Terrorism and crime

Mass shootings typically have their own Wikipedia pages within 150 minutes from the start of the event, decreasing every year.

Year Event Article Creation
2017 Las Vegas shooting 84 minutes
2016 Orlando nightclub shooting 170 minutes
2007 Virginia Tech shooting 241 minutes
2012 Sandy Hook Elementary School shooting 210 minutes
2017 Sutherland Springs church shooting 158 minutes

## Conclusion

In conjunction with Wikipedia’s detailed hourly pageviews, Wikipedia article edit rates yield high-quality, live information.

## Full Code

import os,time

import pandas as pd
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
from tick.hawkes import SimuHawkesSumExpKernels, HawkesSumExpKern

import requests
session = requests.Session()

def fetch_revisions(title):

file_name = './revisions/{0}_revisions.csv'.format(title.replace(' ','_'))
if not os.path.isfile(file_name):
parameters = {
'action': 'query',
'format': 'json',
'continue': '',
'titles': title,
'prop': 'revisions',
'rvprop': 'ids|flags|timestamp|user|userid|size|comment',
'rvlimit': 'max'
}

wp_call = requests.get('http://en.wikipedia.org/w/api.php', params=parameters)
response = wp_call.json()

print('Fetching revisions for {0}...'.format(title))
revisions = []
while True:
print('tFetched {0} revisions'.format(len(revisions)))
wp_call = session.get('http://en.wikipedia.org/w/api.php', params=parameters)
response = wp_call.json()
for page_id in response['query']['pages']:
revisions += response['query']['pages'][page_id]['revisions']
if 'continue' in response:
parameters['continue'] = response['continue']['continue']
parameters['rvcontinue'] = response['continue']['rvcontinue']
else:
break
time.sleep(3)

# Format revisions as pandas dataframe
revisions = pd.DataFrame(revisions).set_index('timestamp')
revisions.index = pd.to_datetime(revisions.index)
revisions = revisions.sort_index()
if 'anon' in revisions.columns:
revisions['anon'] = ~revisions['anon'].astype(bool)
if 'minor' in revisions.columns:
revisions['minor'] = ~revisions['minor'].astype(bool)

# Remove revisions if the size is the same before and after edit (remove edit-wars)
revisions = revisions.loc[revisions['size'].shift(1)!=revisions['size'].shift(-1)]

# Remove minor revisions
if 'minor' in revisions.columns:
revisions = revisions.loc[~revisions['minor']]

# Remove edits by bots
revisions = revisions.loc[~revisions['user'].str.contains('(Bot|bot|BOT)([^A-Za-z]|\$)',na=False)]

# Group revisions by same user. Date is of first revision &amp; size is of last revision
revisions.loc[revisions.user != revisions['user'].shift(1),'size'] = revisions.loc[revisions['user'] != revisions['user'].shift(-1)]['size'].values
revisions = revisions.loc[revisions['user'] != revisions['user'].shift(1)]

# Remove edits that don't change >=10 bytes
revisions = revisions.loc[revisions['size'].diff().fillna(np.inf).abs()>=10]

# Remove "page blank" edits
revisions = revisions.loc[revisions['size'].pct_change().fillna(0).abs()<=0.9]

revisions.to_csv(file_name)
else:
return revisions

def estimate_intensity(event_timestamps,event_titles):

start_time = min([min(timestamp) for timestamp in event_timestamps])
end_time = max([max(timestamp) for timestamp in event_timestamps])
event_timestamps = [(timestamps-start_time).astype(int)/86400e9 for timestamps in event_timestamps]

decays = [0.02,0.01,0.5]
learner = HawkesSumExpKern(decays, penalty='elasticnet', elastic_net_ratio=0.8)
learner.fit(event_timestamps)

tracked_intensity, intensity_tracked_times = learner.estimated_intensity(event_timestamps,intensity_track_step=1/24.)
estimated_intensity_index = start_time + pd.to_timedelta(intensity_tracked_times,unit='D')
estimated_intensity = pd.DataFrame(np.vstack(tracked_intensity).T,index=estimated_intensity_index,columns=event_titles)

return estimated_intensity

def main():

titles = ['Michael Jackson']
event_timestamps = []
for title in titles:
event_timestamps.append(fetch_revisions(title=title).index.values)
estimated_intensity = estimate_intensity(event_timestamps,titles).resample('1H').mean()

print('Filtering...')
order = 2
Wn = 0.001
B,A = signal.butter(order, Wn, output='ba')
for column in estimated_intensity.columns:
estimated_intensity[column] -= signal.filtfilt(B, A, estimated_intensity[column].values)

print('Plotting...')
ax = estimated_intensity.plot(linewidth=0.7)#,logy=True)
ax.set_ylim([0,100])
plt.savefig('output.png',ppi=300)

if __name__ == '__main__':
main()


Hi, I'm Nathan. I'm an electrical engineer in the Los Angeles area. Keep an eye out for more content being posted soon.

1. Andrew Benton says:

This domain strikes me as a really good use case for the Hawkes process – basically a generalization of the poisson process where arrivals causes the intensity of the process to increase (or decrease). The model even has a clean multivariate extension to multiple dependent processes.

The general definition (http://mathworld.wolfram.com/HawkesProcess.html) is too broad to make sense, but the relevant formulation is shown in https://hawkeslib.readthedocs.io/en/latest/tutorial.html

1. Wow, that’s exactly what this is. That’s a great point, definitely better than taking some sort of moving average and then thresholding. I wish the likelihood estimation was non-trivial, but I’ll make a post about this since I was already going to post about similar subjects. Thanks for the suggestion!

Edit:
Updated!