#!/usr/bin/env python
# coding: utf-8
# # Tutorial Step 2: What's in a GWOSC Data File?
# In this tutorial, we will use python to read the file downloaded in [Step 1](<./01 - Download GWOSC Data.ipynb>).
# This step presents some details about the file format use in GWOSC data files.
# [Step 5](<./05 - GWpy examples.pynb>) will present a higher-level API to read those files.
#
#
⚠ Warning
# Uncomment the following cell if running in Google Colab.
# If you use other running methods, it should not be necessary.
#
# In[1]:
#! wget http://gwosc.org/archive/data/O3b_4KHZ_R1/1263534080/H-H1_GWOSC_O3b_4KHZ_R1-1264312320-4096.hdf5
# ## The GWOSC HDF5 file format
# The data file uses the standard [HDF5 file format](https://www.hdfgroup.org/solutions/hdf5).
# This format is used to store large amounts of scientific data and their meta-data (called attributes) in a hierarchical structure and process them efficiently.
# It is a common format in the scientific world and there are many tools and libraries to work with such files.
# A first option to view the structure of an HDF5 file is the [HDFView tool](https://www.hdfgroup.org/downloads/hdfview/), a free software which can be downloaded from the HDF Group web site.
# This will give you an interactive display of the content inside GWOSC data file.
# ## Programmatic access with python
# ### A first glimpse
# [h5py](http://www.h5py.org/) is a python package to work with HDF5 files.
# With it, you can access the content using a dictionary interface.
# In[2]:
#----------------------
# Import needed modules
#----------------------
import numpy as np
import matplotlib.pylab as plt
import h5py
# In[3]:
#--------------
# Open the File
#--------------
filename = 'H-H1_GWOSC_O3b_4KHZ_R1-1264312320-4096.hdf5'
dataFile = h5py.File(filename, 'r')
# In[4]:
#-----------------
# Explore the file
#-----------------
for key in dataFile.keys():
print(key)
# You can now access the data as in a dictionary.
# ### Plot a time series
# Let's continue our exploration to make a plot of a few seconds of data. To store the strain data in a convenient place, use the code below:
# In[5]:
#--------------------
# Read in strain data
#--------------------
strain = dataFile['strain']['Strain']
ts = dataFile['strain']['Strain'].attrs['Xspacing']
print(f"ts = {ts}s, sample rate = {1/ts}Hz")
# The code above accesses the `'Strain'` data object that lives inside the group `'strain'` - we store this as `strain`.
# The "attribute" `'Xspacing'` tells how much time there is between each sample, and we store this as `ts`.
# To see all the structure of a GWOSC data file, see the end of this tutorial.
# Now, let's use the meta-data to make a vector that will label the time stamp of each sample. In the same way that we indexed `dataFile` as a python dictionary, we can also index `dataFile['meta']`. To see what meta-data we have to work with, use the code below:
# In[6]:
#-------------------------
# Print out some meta data
#-------------------------
metaKeys = dataFile['meta'].keys()
meta = dataFile['meta']
for key in metaKeys:
print(key, meta[key])
# You should see that the GPS start time and the duration are both stored as meta-data.
# To calculate how much time passes between samples, we can divide the total duration by the number of samples:
# In[7]:
#---------------------
# Create a time vector
#---------------------
gpsStart = meta['GPSstart'][()]
duration = meta['Duration'][()]
gpsEnd = gpsStart + duration
time = np.arange(gpsStart, gpsEnd, ts)
# The numpy command `arange` creates an array (think a column of numbers) that starts at `gpsStart`, ends at `gpsEnd` (non-inclusive), and has an increment of `ts` between each element.
# We can now plot the data:
# In[8]:
#---------------------
# Plot the time series
#---------------------
plt.plot(time, strain[()])
plt.xlabel('GPS Time (s)')
plt.ylabel('H1 Strain')
plt.show()
# Finally, let's use all of this to plot a few seconds worth of data.
# Since this data is sampled at 4096 Hz, 10,000 samples corresponds to about 2.4 s.
# We will start at time 1256779566.0.
# In[9]:
#------------------------
# Zoom in the time series
#------------------------
numsamples = 10000
startTime = 1264316116.0
startIndex = np.min(np.nonzero(startTime < time))
time_seg = time[startIndex:(startIndex+numsamples)]
strain_seg = strain[startIndex:(startIndex+numsamples)]
plt.plot(time_seg, strain_seg)
plt.xlabel('GPS Time (s)')
plt.ylabel('H1 Strain')
plt.show()
# You may be surprised that the data looks smooth and curvy, rather than jagged and jumpy as you might expect for [white noise](https://en.wikipedia.org/wiki/White_noise).
# That's because white noise has roughly equal power at all frequencies, which GW data does not.
# Rather, GW data includes noise that is a strong function of frequency - we often say the noise is "colored" to distinguish it from white noise.
# The wiggles you see in the plot above are at the low end of the LVK band (around 20 Hz).
# In general, GW noise is dominated by these low frequencies.
# To learn more about GW noise as a function of frequency, take a look at the [Step 4 of this tutorial](<04 - Working in Frequency Domain.ipynb>).
# ### A slightly more advanced script
# Let's write a short python script for viewing the _entire_ structure of a GWOSC data file.
# It is based on the [visititems](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.visititems) method.
# In[10]:
def dump_info(name, obj):
print("{0} :".format(name))
try:
print(" value: {0}".format(str(obj[()])))
for key in obj.attrs.keys():
print(" attrs[{0}]: {1}".format(key, obj.attrs[key]))
except:
pass
file = h5py.File(filename)
file.visititems(dump_info)
# ## What's next?
# In the [next step](<03 - Working with Data Quality.ipynb>), you will learn how to work with data quality.