In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image
import pandas as pd
import scipy.stats as stats
```

- Learn how to deal with bivariate data (fitting lines, curves).
- Apply line fitting to determine the age of the Universe. Cool.
- An brief introduction to
**scikit-learn**

Until now, we've worked with data that do not depend on other data - they are *univariate*. But there are many examples in the Earth Sciences where we are interested in the dependence of one set of data on another (*bivariate data*) or even *multivariate data* which will get to in later lectures. This lecture focusses on *bivariate data*. For an example of bivariate data in Earth Science consider these cases: the distance from the ridge crest versus age gives you a spreading rate; the depth in a sediment core versus age gives you a sediment accumulation rate; the ratio of the radioactive form of carbon, $^{14}$C, to a stable form,$^{12}$C, is a function of the age of the material being dated. And we already saw that the difference in arrival times of the $P$ and $S$ seismic waves is related to distance from the source to the receiver. These examples rely on the use of bivariate statistics to get at the desired relationships.

My favorite example of bivariate data in Earth (and Space) Science is to the use of retreat velocity of galaxies as a function of their distance to determine the age of the universe (how cool is that!). This is the basic relationship that underlies what has come to be known as "Hubble's Law" (same Hubble as for the Hubble telescope).

Hubble published his results in 1929 [Hubble, E. P. (1929) Proc. Natl. Acad. Sci., 15, 168–173.] At the time, it was unclear whether the universe was static, expanding, or collapsing. Hubble hypothesized that if the universe were expanding, then everything in it would be moving away from us. The greater the distance between the Earth and the galaxy, the faster it must be moving. So all he had to do was measure the distance and velocity of distant galaxies. Easy-peasy - right?

To measure velocity, Hubble made use of the doppler shift. To understand how this works, recall that the pitch you hear as an ambulance approaches changes. During doppler shift, the ambulance's pitch changes from high (as it approaches) to low (as it recedes). The pitch changes because the relative frequency of the sound waves changes. The frequency increases as the ambulance approaches, leading to a higher pitch, and then decreases as it moves away, resulting in a lower pitch.

In [2]:

```
Image(filename='Figures/doppler.jpg',width=400)
```

Out[2]:

*Figure from https://www.physicsclassroom.com/class/waves/Lesson-3/The-Doppler-Effect*

The same principle applies to light, but rather than hear a change in frequency, we observe a shift in the wavelength (the color) detected from the galaxy. But how can we measure this shift? Recall our class on solar elemental abundances that stars are made primarily of hydrogen. Hydrogen, like all atoms, absorbs energy at discrete wavelengths associated with the excitation of electrons. So if light passes through hydrogen, it will lose energy at those specific wavelengths. If a star or galaxy is moving away from us, these distinct absorption bands (detected as black bands in the spectrum of light) are shifted towards longer wavelengths - the red end of the visible spectrum. The faster the star or galaxy travels away from the observer, the greater the shift will be to the red:

In [3]:

```
Image(filename='Figures/dopp-redshift.jpg',width=400)
```

Out[3]:

*Figure from http://www.a-levelphysicstutor.com/wav-doppler.php*

So, Hubble measured the red shift of different galaxies and converted them to velocities.

He then estimated the distance to these objects, which is harder to do. There are many ways to do this and you can look them up, if you like. In any case, we can use this dataset

http://keyah.asu.edu/lessons/AgeOfUniverse/Exercises.html

which contains distances to different galaxies and their velocities.

Before we read in the data and plot them, let's learn about styling text in **matplotlib**. To create nice labels with superscripts, we can use latex styling, the same as in a markdown cell. For a superscript, first we need to encase the text in dollar signs and then use the ^ symbol to make the following text a superscript. If there is more than one number in the superscript, you must enclose what you want as the superscript in curly braces.
For example, to print $10^3$, we use:

\$ 10^3 \\$

and for for 'per second' (s$^{-1}$), we use:

s\$^{-1}\\$.

For a subscript, replace the ^ symbol with an underscore (_) symbol.

Okay, now let's read in the data and take a look.

In [4]:

```
data=pd.read_csv('Datasets/redShiftDistance/d_v.csv',header=1)
print (data.head()) # take a peek
```

We can plot the data as little red stars ('r*') and using the fancy latex styling tricks we just learned.

In [5]:

```
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*') # plot the points as red stars
plt.ylabel('v (10$^3$km s$^{-1}$)') # notice the latex style labels
plt.xlabel('Distance (10$^3$ Mpc)');
```

So, we have distance on the X axis in Megaparsecs and velocity on the Y axis in 10$^3$km/s.

To calculate the age of the universe, we can use Hubble's logic:

Here is "Hubble's Law": $v = H_o d$, where $v$ is velocity, $d$ is distance, and $H_o$ is "Hubble's constant".

This looks a lot like the equation for a line through the data ($y=mx + b$) where $m$ is the slope and $b$ is the y-intercept. In this case, the y-intercept should be 0 or nearly so, and $m$ is $H_o$.

We can calculate the coefficients $m$ and $b$ in **NumPy** fairly easily using a function called **np.polyfit( )**. (A line is just a first degree polynomial after all).

In [6]:

```
help(np.polyfit)
```

**np.polyfit( )** can be used to calculate best fit lines (setting the degree (**deg**) to 1), or higher order curves (setting degree to 2 or higher).

All we want is a best fit line for now, so let's see how well it does:

In [7]:

```
print (np.polyfit(data['d (10^3 Mpc)'],data['v (10^3km/s)'],1))
```

Ahah! So $H_o$, the slope of the best-fit line, is 70.3 (in some weird units).

Before going on, let's plot the best fit line on our graph.

We can assign the best fitting slope and y-intercept from **np.polyfit( )** to a variable (**m_b**) and feed that to another function **np.polyval( )** which will calculate new Y values using the model of a linear fit:

In [8]:

```
help(np.polyval)
```

In [9]:

```
# save the statistics in an array called m_b
m_b= np.polyfit(data['d (10^3 Mpc)'],data['v (10^3km/s)'],1)
print (m_b) # see if that worked
```

**m_b** seems to be an array of coefficients, where the first is the slope and the second is the y-intercept.

Let's feed **m_b** into **np.polyval( )**, along with our X array to get a new set of Y values, assuming a linear fit. Then we can plot the model data as a black line along with the original Y values.

In [10]:

```
modelYs=np.polyval(m_b,data['d (10^3 Mpc)'])
# now plot the data and the best-fit line:
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*')
plt.plot(data['d (10^3 Mpc)'],modelYs,'k-') # plot as black line
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)');
```

It would be handy to also know how well our data fit the model (not everything is linear, after all).

To do that, we need a different function: **scipy.stats.linregress( )**.

In [30]:

```
help(stats.linregress)
```

And use it, to get what is normally called the $R^2$ value, which when 1. represents perfect agreement.

In [31]:

```
print (stats.linregress(data['d (10^3 Mpc)'],data['v (10^3km/s)']))
# and save the statistics as:
slope,intercept, r_value, p_value, std_err = stats.linregress(data['d (10^3 Mpc)'],data['v (10^3km/s)'])
print ('\n')
print ('slope: %7.3f, intercept: %4.1f, R^2: %5.3f'%\
(slope, intercept, r_value**2))
```

Not a bad fit! The Universe is expanding.

Now let's get back to Hubble's Law and the age of the universe.

We had $v=H_o d$ as Hubble's law and we know that distance = velocity x time, or, $d=vt$. So, if we divide both sides by $v$ and we get:

1=$H_o$t.

Solving for $t$ (the age of the universe), we get

$t=1/H_o$ [in some weird units.]

In [32]:

```
print (1./slope)
t=1./slope
```

But the units are weird (not years, per Mpc s/km). To fix this, we need to know how many kilometers are in a megaparsec. As it happens, there are 3.09 x 10$^{19}$ km/Mpc. [I bet you didn't know that off-hand. You're welcome. ]
So, we can calculate the age of the universe in seconds (**Age_sec**) by converting the megaparsecs to km:

Age (s) = $t \frac{s \cdot Mpc}{km}$ x $3.09 x 10^{19} \frac {km}{Mpc}$

In [33]:

```
Age_sec=t*3.09e19
print (Age_sec)
```

That's a lot of seconds! We should convert seconds to years. Here's another fun fact: there are approximately $\pi$ x 10$^7$ seconds in a year.

More precisely, there are 60(s/min) x 60(min/hr) x 24(hr/day) x 365.25 (days/yr)

In [34]:

```
s_yr=60*60*24*365.25
print ('%e'%(s_yr))
```

Ok. so not exactly $\pi \times 10^7$, but close....

In [35]:

```
Age_yrs=Age_sec/s_yr
print (Age_yrs)
```

And now in billions of years please:

In [36]:

```
print ('Age of the universe: %5.2f'%(Age_yrs*1e-9), 'Ga')
```

There are many other curve fitting methods in **scipy** and **NumPy**. I encourage you to look around when the need arises.

As an example of a curved relationship, there are new data sets available for the classic Hubble problem. I found one published by Betoule et al. in 2014 http://dx.doi.org/10.1051/0004-6361/201423413. Cosmologists plot things differently now than they did in Hubble's time. They use the parameters $z$ and $\mu$ which are related to the red shift velocity and distance. $z$ is the fractional shift in the spectral wavelength and $\mu$ has something to do with distance.

Here is a plot from the Betoule et al. paper:

In [37]:

```
Image(filename='Figures/betoule14.png',width=400)
```

Out[37]:

*[Figure modified from Betoule et al., 2014.] The different colors are different types of objects (galaxies, supernovae, etc.).*

Notice that they plotted the data on a log scale. (This hides some surprising things.)

To compare the old estimate of the age of the universe (~14 Ga) with that calculated from the new data set, we will need to convert $z$ and $\mu$ to distance and velocity.

According to http://hyperphysics.phy-astr.gsu.edu/hbase/Astro/hubble.html

velocity $v$ (as fraction of the speed of light, $c$) is given by

${v\over c}= \bigl({{(z+1)^2-1} \over {(z+1)^2+1}}\bigr)$

where $c=3 \times 10^8$m s$^{-1}$.

And according to the Betoule et al. (2014) paper, $\mu$ relates to distance in parsecs $d$ like this:

$\mu=5\log(d/10)$.

Let's read in the data (available from this website: http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/568/A22#sRM2.2), which are averages of the data shown in the figure above,and take a peek.

In [38]:

```
new_data=pd.read_csv('Datasets/redShiftDistance/mu_z.csv',header=1)
print (new_data.columns)
```

Now we can plot them the same way as the cosmologists do, using $\mu$ and $\log z$:

In [39]:

```
plt.semilogx(new_data.z,new_data.mu,'bo')
plt.xlabel('log $z$')
plt.ylabel('$\mu$');
```

To compare the new data with the 'old' data, we must do the following:

- Transform $z$ to velocity
- Transform $\mu$ to distance using the equations provided.
- Truncate the new dataset which goes to much farther distances than the 'old' data set
- Write a nice
**lambda**function to recalculate the age of the universe

Let's begin by writing a lambda function:

In [40]:

```
# function returns age in Ga for Ho
age_from_Ho= lambda Ho : 1e-9*3.09e19/(Ho*np.pi*1e7)
```

Now we convert, truncate, and recalculate H$_o$:

In [41]:

```
# convert z to velocity in 10^3 km/s (or 10^6 m/s)
c=3e8 # speed of light in m/s
new_data['velocity']=1e-6*c* \
(((new_data.z+1.)**2-1.)/((new_data.z+1.)**2+1.)) # the formula for v from z (and c)
# convert mu to distance in 10^3 Mpc (a Gpc):
new_data['distance']=10.*(10.**((new_data['mu'])/5.))*1e-9 # convert mu to Gpc
# and filter for the closest objects
close_data=new_data[new_data.distance<0.7]
close_fit= np.polyfit(close_data['distance'],close_data['velocity'],1) # calculate the coefficients
close_modelYs=np.polyval(close_fit,close_data['distance']) # get the model values
age=age_from_Ho(close_fit[0]) # get a new age
print (age)
```

Wow - that is different!

Now we can plot them and see what happened!

In [42]:

```
# plot the old data (in red dots) with the new (in blue dots)
plt.plot(data['d (10^3 Mpc)'],data['v (10^3km/s)'],'r*',label='Old data') # as red stars
plt.plot(data['d (10^3 Mpc)'],modelYs,'r--',label='Old linear fit') # plot old fit
plt.plot(close_data.distance,close_data.velocity,'bd',label='New data') # as blue diamonds
plt.plot(close_data.distance,close_modelYs,'b-',label='New Linear fit') # plot old fit
plt.xlabel('Distance (10$^3$ Mpc)')
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.text(0.4,15,'Old Age: 13.9 Ga',color='red') # put on a note with the old age
plt.text(0.4,10,'New Age: %4.1f'%age+' Ga',color='blue') # put on a note with the new age
plt.legend(numpoints=1,loc=2); # and a legend
```

It looks like the velocity estimates in the new data are slower than the velocities estimated in the old data. The age of the universe just got a lot older! But don't get too excited yet!

Although the two data sets look superficially similar, if you read the paper, you will see there are many adjustments required to calculate the distances (Hubble was in fact WAY off, almost by a factor of 10!).

We filtered the data, so we only worked with 'near-by' object, less than .7 Mpc. But that only accounts for a small portion of the data set. Let's take a look at the entire dataset.

In [43]:

```
plt.plot(new_data.distance,new_data.velocity,'bd',label='new data')
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)');
```

That doesn't look linear at all! The cosmologists believe that the expansion of the universe is accelerating. Why? Ask your physics friends about dark energy. Or read Neal de Grasse Tyson's "Astrophysics for People in a Hurry". I loved it.

Given that the data are hardly linear, we should calculate the fit with a higher order polynomial instead of with a linear fit.

In [44]:

```
colors = ['m','m','r','g','k'] # make a list of colors for plotting
plt.plot(new_data.distance,new_data.velocity,'bd',label='new data')
# try with higher degree polynomials
for deg in range(1,5): # degree of polynomial
fit= np.polyfit(new_data['distance'],new_data['velocity'],deg) # second degree
Y=np.polyval(fit,new_data['distance'])
plt.plot(new_data.distance,Y,colors[deg],label="%i degree"%deg)
Y=np.polyval(fit,new_data['distance'])
plt.plot(new_data.distance,Y)
plt.ylabel('v (10$^3$km s$^{-1}$)')
plt.xlabel('Distance (10$^3$ Mpc)')
plt.legend(numpoints=1,loc=2);
```

So, it appears that a 3rd degree polynomial does the trick. But this is an active area of research. Read this article in the New York times -
https://nyti.ms/2loeFoX and saved in Background/expandingUniverse.pdf. The apparent acceleration may be controlled by *dark energy* and values for H$_o$ hover between 67 (~14.7 Ga) and 73 (13.5 Ga).

Normally, using a high order polynomial is a bad idea, because of overfitting. Of course, *underfitting* is also a bad idea. So how do you choose?

With overfitting, you can fit the data really well, but with a model that doesn't make much sense. To illustrate this, let's make a sine wave with some added noise and try to fit the data. First we'll try with an degree 6 polynomial.

In [45]:

```
x=np.linspace(0,np.pi,20)
sine_wave=np.sin(3*x) # make a sine wave
normal_distribution=stats.norm(0,0.15) #Create some 'noise' to add to the sine wave
y=sine_wave+normal_distribution.rvs(len(x)) # add the noise to the sine wave
good_fit=np.polyfit(x,y,6); #Fit with an degree 6 polynomial
interpolated_x=np.linspace(0,np.pi,100) #Make an array of x values to make a continuous line
predicted_y=np.polyval(good_fit,interpolated_x) #Fit to the continuous line
plt.plot(x,y,'ko',label='Data') #Plot the noisy data data
plt.plot(interpolated_x,predicted_y,'g',label='Degree 6 Polynomial') #Plot the original data
plt.plot(interpolated_x,np.sin(3*interpolated_x),'k',label='Sine Wave')
plt.legend();
```

As we can see, an degree 6 polynomial approximates the sine wave relatively well, but not perfectly. Now let's try with an degree 13 polynomial:

In [46]:

```
over_fit=np.polyfit(x,y,13)
predicted_y=np.polyval(over_fit,interpolated_x)
plt.plot(x,y,'ko',label='Data')
plt.plot(interpolated_x,predicted_y,'r',label='Degree 13 Polynomial')
plt.plot(interpolated_x,np.sin(3*interpolated_x),'k',label='Sine Wave');
```

The order 13 polynomial captured more of the noise than we wanted, and seems to go to odd values in other areas to be able to fit it. This is known as overfitting. How do we deal with this? There are a few ways to do this.

**Scikit-learn** is a machine learning package which we will learn a lot about in later lectures. One of the things that **scikit-learn** can do well is regression as it offers many different types of regression, all contained within the **sklearn.linear_model** package.

To use **Scikit-learn**, we need to understand a few things. First, it treats data slightly differently than, for example **np.polyfit( )**. It expects each data point to be described as an array of *features* (think of these as 'coordinates' for each datapoint). In our case the *features* are $x$,$x^2$,$x^3$ etc. Fortunately, **scikit-learn** allows us to do this automatically using **sklearn.preprocessing.PolynomialFeatures**.

So, to start, let's transform our data into something **scikit-learn** can understand. Let's make a data container and and take a look.

In [47]:

```
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(degree=13) #We're going to make a 13th degree fit to start with
x_transform=poly.fit_transform(x[:,np.newaxis]) #Makes our x data into a column array, rather than a row.
print(x_transform[:5])
```

This has given us an array of [$x^0$,$x^1$,$x^2$,$x^3$,...,$x^{13}$] for each data point.

**Scikit-learn** has a regression algorithm called **Ridge**. This tries to minimize the coefficients on each of the polynomial fits to get a less unrealistic result. Let's try this on our 13th degree polynomial fit.

In [48]:

```
from sklearn.linear_model import Ridge
interp_x_transform=poly.fit_transform(np.reshape((interpolated_x),(len(interpolated_x),1))) #Transform interpolated points
reg=Ridge(alpha=1e-2) #reg is the function we're going to use to transform our data
fit=reg.fit(x_transform,y); #fit data
y_reg=fit.predict(interp_x_transform) #Predict the curve for our interpolated points
plt.plot(interpolated_x,y_reg,'b',label='Ridge fit') #Plot the new fit
plt.plot(x,y,'ko',label='Data') #Plot the original data
#plt.plot(interpolated_x,predicted_y,'r')
plt.plot(interpolated_x,np.sin(3*interpolated_x),'k',label='Sine Wave') #Plot the original sine wave
plt.legend();
```

As we can see, the Ridge fit with a 13th degree polynomial isn't doing quite as well as for our original 6th degree polynomial fit, but it's a big improvement over our original 13th degree polynomial fit.
This may seem confusing for now. We will return to **scikit-learn** to see some more uses for the package in some later lectures.

In [ ]:

```
```