Delia Rusu¶

PhD: Natural Language Processing and Machine Learning

delia@knowsis.com

https://github.com/deliarusu

Mattia Boni Sforza¶

MSc: Statistics

mattia@knowsis.com

https://github.com/mattiabs

Knowsis¶

Knowsis is a social-intelligence company providing social media analytics for finance.

Knowsis¶

Topics of research:

• named entity recognition, disambiguation and linking
• event detection
• sentiment analysis
• social network analysis
• anomaly detection
• novelty detection

A: Detect Novel Anomalies in Twitter¶

System which promptly flags the user when there is a novel anomaly in social activity

• Novel?
• Not previously encountered
• Novel Anomaly?
• Something that deviates from what is standard, normal, or expected
• Novel Anomaly in social activity?
• Unexpected amount of tweets in short time which creates a spike in tweets volume

Main Challenges¶

• Identify unusual activity promptly
• Avoid false anomalies

In [1]:
from IPython.core.display import Image, display
display(Image('images/VW_chart.jpg', unconfined=True))


What you need¶

1. Define a variable to measure
• what is the main characteristic that makes the data anomalous?
2. Establish if the observed variable is unexpected
• how to quantify the unexpected nature of this quantity?

Approach¶

1. Variable to measure is the count of tweets given a fixed time window (aka rate)
2. Consists of two steps:
• Predict the tweets count given some predictor variables (characteristics of the data known in advance)
• Use the prediction to establish how (un)likely it is to observe such a tweets volume

Model the tweets rate¶

• Static volume
• Dinamic amount to account for seasonality
• Include short time memory to only detect spikes

Natural way to model counts of events happening in time is to use a Poisson process $N(t)$ with parameter $\lambda$, the rate at which the events happen

$$\mathbb{E}[N(t)] = \textbf{expected number of events occurred until time } t = \lambda t$$

A Poisson regression models the logarithm of the expected number of events as a linear combination of predictor variables

$$\log(\mathbb{E}[N]) = log(\lambda) = \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k$$

with $\beta_i$ coefficients of the model estimated based on volume history and $x_i$ observable quantities of the datapoint.

Predictor variables for $\lambda$¶

Consider seasonality at different scales

• day of the week $x_w$
• hour of the day $x_h$
• month (?)
• ...

Consider recent behaviour

• count in the previous time window $x_{t-1}$
• count 2 time windows before $x_{t-2}$
• ...
• count for the day so far or yesterday (?)

Other variables?

• ...

Detect the anomaly¶

With the expected rate $\lambda_{pred}$, how (un)likely is that we see a number of tweets $N$ higher than the one we have so far?

If this event has probability lower than $\alpha$, it is anomalous!

$$\mathbb{P}(N \geq n_{obs} | \lambda_{pred}) < \alpha$$

Pros¶

• unsupervised (no need for annotations)
• easily adapt to different time scales
• easy to extend and include more predictors
• applicable to discrete data of different nature
• Generalised Linear Model applicable to different distribution families

...and Cons¶

• needs other algorithms to determine if the anomaly is geneuine (spam driven, somehow expected, ...)
• higly sensitive (try Zero inflated Poisson)
• variance and expected value are the same (try Negative Binomial)

Novelty Detection¶

Given an anomaly period, identify which tweets are novel

Observation: tweets tend to cluster in stories

Given an anomaly period, identify the tweets which are novel not part of old stories

Challenges¶

• streaming setting - each tweet needs to be processed in bounded space and time
• bounded space - constant amount of stories in memory
• bounded time - Locality Sensitive Hashing (Indyk and Motwani, 1998)
• how much history is required to decide if a tweet is novel
• stories can be reoccurring

Locality Sensitive Hashing¶

• hashing each tweet into buckets to maximize the probability of collision for tweets which are close
• requires a measure of distance for tweets

• LSH works well if tweet is close to its nearest neighbor

• LSH fails to return the nearest neighbor for distant tweets
• Solution: if the LSH doesn't return a neighbor, check novelty against most recent tweets

Conclusions¶

• general anomaly detection algorithm which is:

• applicable to different problems
• easily understandable
• fast training and detection
• works well in a streaming pipeline
• two-step novelty detection

• fast lookup using LSH
• additional lookup in inverted index for the most recent tweets