#!/usr/bin/env python
# coding: utf-8

# # Fast Lomb-Scargle Periodograms in Python
# 
# The Lomb-Scargle Periodogram is a well-known method of finding periodicity in irregularly-sampled time-series data.
# The common implementation of the periodogram is relatively slow: for $N$ data points, a frequency grid of $\sim N$ frequencies is required and the computation scales as $O[N^2]$.
# In a 1989 paper, Press and Rybicki presented a faster technique which makes use of fast Fourier transforms to reduce this cost to $O[N\log N]$ on a regular frequency grid.
# The ``gatspy`` package implement this in the ``LombScargleFast`` object, which we'll explore below.
# 
# But first, we'll motivate *why* this algorithm is needed at all.
# We'll start this notebook with some standard imports:

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt

# use seaborn's default plotting styles for matplotlib
import seaborn; seaborn.set()


# To begin, let's make a function which will create $N$ noisy, irregularly-spaced data points containing a periodic signal, and plot one realization of that data:

# In[2]:


def create_data(N, period=2.5, err=0.1, rseed=0):
    rng = np.random.RandomState(rseed)
    t = np.arange(N, dtype=float) + 0.3 * rng.randn(N)
    y = np.sin(2 * np.pi * t / period) + err * rng.randn(N)
    return t, y, err

t, y, dy = create_data(100, period=20)
plt.errorbar(t, y, dy, fmt='o');


# From this, our algorithm should be able to identify any periodicity that is present.

# ## Choosing the Frequency Grid
# The Lomb-Scargle Periodogram works by evaluating a power for a set of candidate frequencies $f$. So the first question is, how many candidate frequencies should we choose?
# It turns out that this question is *very* important. If you choose the frequency spacing poorly, it may lead you to miss strong periodic signal in the data!

# ### Frequency spacing
# First, let's think about the frequency spacing we need in our grid. If you're asking about a candidate frequency $f$, then data with range $T$ contains $T \cdot f$ complete cycles. If our error in frequency is $\delta f$, then $T\cdot\delta f$ is the error in number of cycles between the endpoints of the data.
# If this error is a significant fraction of a cycle, this will cause problems. This givs us the criterion
# $$
# T\cdot\delta f \ll 1
# $$
# Commonly, we'll choose some oversampling factor around 5 and use $\delta f = (5T)^{-1}$ as our frequency grid spacing.

# ### Frequency limits
# Next, we need to choose the limits of the frequency grid. On the low end, $f=0$ is suitable, but causes some problems – we'll go one step away and use $\delta f$ as our minimum frequency.
# But on the high end, we need to make a choice: what's the highest frequency we'd trust our data to be sensitive to?
# At this point, many people are tempted to mis-apply the Nyquist-Shannon sampling theorem, and choose some version of the Nyquist limit for the data.
# But this is entirely wrong! The Nyquist frequency applies for regularly-sampled data, but irregularly-sampled data can be sensitive to much, much higher frequencies, and the upper limit should be determined based on what kind of signals you are looking for.
# 
# Still, a common (if dubious) rule-of-thumb is that the high frequency is some multiple of what Press & Rybicki call the "average" Nyquist frequency,
# $$
# \hat{f}_{Ny} = \frac{N}{2T}
# $$
# With this in mind, we'll use the following function to determine a suitable frequency grid:

# In[3]:


def freq_grid(t, oversampling=5, nyquist_factor=3):
    T = t.max() - t.min()
    N = len(t)
    
    df = 1. / (oversampling * T)
    fmax = 0.5 * nyquist_factor * N / T
    N = int(fmax // df)
    return df + df * np.arange(N)


# Now let's use the ``gatspy`` tools to plot the periodogram:

# In[4]:


t, y, dy = create_data(100, period=2.5)
freq = freq_grid(t)
print(len(freq))


# In[5]:


from gatspy.periodic import LombScargle
model = LombScargle().fit(t, y, dy)
period = 1. / freq
power = model.periodogram(period)
plt.plot(period, power)
plt.xlim(0, 5);


# The algorithm finds a strong signal at a period of 2.5.
# 
# To demonstrate explicitly that the Nyquist rate doesn't apply in irregularly-sampled data, let's use a period below the averaged sampling rate and show that we can find it:

# In[6]:


t, y, dy = create_data(100, period=0.3)
period = 1. / freq_grid(t, nyquist_factor=10)

model = LombScargle().fit(t, y, dy)
power = model.periodogram(period)
plt.plot(period, power)
plt.xlim(0, 1);


# With a data sampling rate of approximately $1$ time unit, we easily find a period of $0.3$ time units. The averaged Nyquist limit clearly does not apply for irregularly-spaced data!
# Nevertheless, short of a full analysis of the temporal window function, it remains a useful milepost in estimating the upper limit of frequency.

# ### Scaling with $N$
# With these rules in mind, we see that the size of the frequency grid is approximately
# $$
# N_f = \frac{f_{max}}{\delta f} \propto \frac{N/(2T)}{1/T} \propto N
# $$
# So for $N$ data points, we will require some multiple of $N$ frequencies (with a constant of proportionality typically on order 10) to suitably explore the frequency space.
# This is the source of the $N^2$ scaling of the typical periodogram: finding periods in $N$ datapoints requires a grid of $\sim 10N$ frequencies, and $O[N^2]$ operations.
# When $N$ gets very, very large, this becomes a problem.

# ## Fast Periodograms with ``LombScargleFast``
# Finally we get to the meat of this discussion.
# 
# In a [1989 paper](http://adsabs.harvard.edu/full/1989ApJ...338..277P), Press and Rybicki proposed a clever method whereby a Fast Fourier Transform is used on a grid *extirpolated* from the original data, such that this problem can be solved in $O[N\log N]$ time. The ``gatspy`` package contains a pure-Python implementation of this algorithm, and we'll explore it here.
# If you're interested in seeing how the algorithm works in Python, check out the code in [the gatspy source](https://github.com/astroML/gatspy/blob/master/gatspy/periodic/lomb_scargle_fast.py).
# It's far more readible and understandable than the Fortran source presented in Press *et al.*
# 
# For convenience, the implementation has a ``periodogram_auto`` method which automatically selects a frequency/period range based on an oversampling factor and a nyquist factor:

# In[7]:


from gatspy.periodic import LombScargleFast
help(LombScargleFast.periodogram_auto)


# In[8]:


from gatspy.periodic import LombScargleFast

t, y, dy = create_data(100)
model = LombScargleFast().fit(t, y, dy)
period, power = model.periodogram_auto()
plt.plot(period, power)
plt.xlim(0, 5);


# Here, to illustrate the different computational scalings, we'll evaluate the computational time for a number of inputs, using ``LombScargleAstroML`` (a fast implementation of the $O[N^2]$ algorithm) and ``LombScargleFast``, which is the fast FFT-based implementation:

# In[9]:


from time import time
from gatspy.periodic import LombScargleAstroML, LombScargleFast
            

def get_time(N, Model):
    t, y, dy = create_data(N)
        
    model = Model().fit(t, y, dy)
    t0 = time()
    model.periodogram_auto()
    t1 = time()
    result = t1 - t0
    
    # for fast operations, we should do several and take the median
    if result < 0.1:
        N = min(50, 0.5 / result)
        times = []
        for i in range(5):
            t0 = time()
            model.periodogram_auto()
            t1 = time()
            times.append(t1 - t0)
        result = np.median(times)
    return result

N_obs = list(map(int, 10 ** np.linspace(1, 4, 5)))
times1 = [get_time(N, LombScargleAstroML) for N in N_obs]
times2 = [get_time(N, LombScargleFast) for N in N_obs]


# In[10]:


plt.loglog(N_obs, times1, label='Naive Implmentation')
plt.loglog(N_obs, times2, label='FFT Implementation')
plt.xlabel('N observations')
plt.ylabel('t (sec)')
plt.legend(loc='upper left');


# For fewer than 100 observations, the naive implementation wins out, but as the number of points grows, we observe the clear trends in scaling: $O[N^2]$ for the Naive method, and $O[N\log N]$ for the fast method. We could push this plot higher, but the trends are already clear: for $10^5$ points, while the FFT method would complete in a couple seconds, the Naive method would take nearly two hours! Who's got the time for that plot?