Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk
Spring 2023
A handy link: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html
In many situations in data analytics, you may be expected to work from multiple computers or share projects among multiple users. A few general best practices will avoid common pitfalls related to collaborative work.
Jupyter notebooks can use absolute or relative paths, but relative paths are more robust and should be used where possible. Relative paths will be relative to where the Jupyter notebook lives and OS commands can navigate up or down the directory structure.
!
¶OS-specific commands can be called within Jupyter by preceding them with a "!
". For example, in Windows you can list the contents of the folder containing the script you are running using "! dir
". On unix machines, this would be "! ls
"
#OS specific command for showing the current working directory
!pwd #for mac/linux based machines (!cd #for PCs)
#List the contents of the current directory ("!ls" also works)
!dir
#List the contents of the "data" sub directory
!dir data
#List the contents of the directory containing the current notebook
!dir ..
os
module¶#Import the os module
import os
#Create a variable holding the current working directory
projectDir = os.getcwd()
#Display the current working directory
projectDir
#Change the directory to the data folder
os.chdir('data')
os.getcwd()
#Go back to the current working directory (stored in the "projectDir" variable above)
os.chdir(projectDir)
os.listdir()
As in R, packages should be loaded early in the script.
import pandas as pd #Import pandas, refering to it as "pd"
A note on installing packages¶
Most packages you'll need are already installed in this containerized environment. However, if you need to install a package, you can use "pip". For example, to install pandas, you'd issue the command:
!pip install pandas
To install other packages, just replace pandas with the package you want to install.
The easiest way to import CSV data for data analysis is using Panda's read_csv()
function which reads data directly into a Pandas dataframe object.
As in R, we supply the path to the CSV file, using relative path conventions.
df_USGS = pd.read_csv('./data/Raw/USGS_Site02085000_Flow_Raw.csv')
Take a moment to read through the README file associated with the USGS dataset on discharge at the Eno River. Where can you find this file? How does the placement and information found in this file relate to the best practices for reproducible data analysis?
#View all records
df_USGS
#Confirm the data type -- R: class(df_USGS)
type(df_USGS)
#Display the column names -- R: colnames(df_USGS)
df_USGS.columns
#Rename columns -- R: colnames(df_USGS) <- c(...)
df_USGS.columns = ("agency_cd", "site_no", "datetime",
"discharge_max", "discharge_max_approval",
"discharge_min", "discharge_min_approval",
"discharge_mean", "discharge_mean_approval",
"gage_height_max", "gage_height_max_approval",
"gage_height_min", "gage_height_min-approval",
"gage_height_mean", "gage_height_mean_approval")
#Display the structure of the dataframe -- R: str(df_USGS))
df_USGS.info()
#Display the dimensions
df_USGS.shape
df_USGS.size
#View the head (first 5 records) of the dataset
df_USGS.head()
#Altenatively, view the first 9 records
df_USGS.head(9)
#Or 6 records, selected at random
df_USGS.sample(6)
#Or, the last 3 records
df_USGS.tail(3)
#View records 30000 to 30005, columns 3, 8, and 14
df_USGS.iloc[29999:30004,[2,7,13]]
#Show the count of values in the discharge_min_approval category
df_USGS['discharge_max_approval'].value_counts()
#Show the data type of the 'datetime' column
df_USGS['datetime'].dtype
#Show the data type of all columns
df_USGS.dtypes
#Summary of all data
df_USGS.describe()
#Summary of a specific column
df_USGS['discharge_mean'].describe()
Yep, as in R, dates can be a pain. By default they are imported as generic, non-numeric "objects" (hence the dtype of "O" above).
The Pandas to_datetime
function (link) works like R's as.Date
function to help convert dates, in various formats, into actual date objects (called "timestamps" in Pandas lingo).
#Create an example of a date, in string format
exampleDate = "2021-04-14"
#Convert to a Pandas "timestamp" object
dateObj = pd.to_datetime(exampleDate)
dateObj
If the date is in a non-standard format, we tell the command what format...
#Create a date string in a non-standard format
exampleDate2 = "Wednesday, 14 Apr. 2021"
dateObj2 = pd.to_datetime(exampleDate2,format = '%A, %d %b. %Y')
dateObj2
Timestamp objects can be displayed in various other date formats using the strftime
function. See http://strftime.org/ for all the formatting options and try a few yourself.
#Display the timestamp objects in various formats using "strftime"
print(dateObj.strftime('%m/%d/%Y'))
Wednesday, Apr. 14, 2021
print(dateObj.strftime('%A, %b. %d, %Y'))
print(dateObj.)
print(dateObj.strftime('%j'))
print(dateObj.)
datetime
values to timestamps¶We can apply the .to_datetime()
function to our datetime column.
#Update the datetime column to be dates, not strings
df_USGS['datetime'] = pd.to_datetime(df_USGS['datetime'])
#Display a few samples
df_USGS.head()
As in our R example, the 2-digit dates in our raw data file are mistakenly assumed to be in the 21st century. We need to convert back to the 20th century. As we did in R, we'll apply a function to find and fix these dates...
df_USGS.iloc[-1,2] > pd.to_datetime('2019-01-10')
#Create a function called "fixDate" that corrects date values
def fixDate(d):
if d > pd.to_datetime('2019-01-10'):
return d - pd.DateOffset(years=100)
else:
return d
#Apply the function to the datetime values
df_USGS['datetime'] = df_USGS['datetime'].apply(fixDate)
#View the result
df_USGS.head()
Notice in our dataset that our discharge and gage height observations have many NAs, meaning no measurement was recorded for a specific day. In some cases, it might be in our best interest to remove NAs from a dataset. Removing NAs or not will depend on your research question.
#List the number of missing values in each column (sum across rows)
df_USGS.isna().sum(axis='rows')
#Show NAs in just one variable
df_USGS['discharge_mean'].isna().sum()
#Drop rows that have missing data in any column; -- R: "omit.na"
df_USGS_cleaned = df_USGS.dropna()
df_USGS_cleaned.shape
We just edited our raw dataset into a processed form. We may want to return to this processed dataset later, which will be easier to do if we save it as a spreadsheet.
#Save the file
df_USGS_cleaned.to_csv("./data/Processed/USGS_Site02085000_Flow_Processed.csv", index=False)
Files should be saved as .csv or .txt for easy import into Pandas. Note that complex formatting, including formulas in Excel, are not saved when spreadsheets are converted to comma separated or text formats (i.e., values alone are saved).
The first row is reserved for column headers.
A second, secondary row for column headers (e.g., units) should not be used if data are being imported into R. Incorporate units into the first row column headers if necessary.
Short names are preferred for column headers, to the extent they are informative. Additional information can be stored in comments within Python scripts and/or in README files.
Spaces in column names are allowed in Pandas, but should be replaced with underscores ("_
") to avoid issues.
Avoid symbols in column headers. This can cause issues when importing into Pandas.