from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
Arrays are a data structure that holds a sequence of values of the same type. For example, a squence of all numbers, or a squence of all strings, etc.
We can use use the make_array
function from the datascience
package to create what are called ndarray
that are array implemented by the NumPy
package. One can perform a range of operations on these arrays in a very efficient manner.
Range functions allow one to create arrays of ordered sequences of numbers. We can use the np.arange()
function to create NumPy ndarrays.
Tables stored structured data. We can use the datascience
package to create Table
objects that we can perform data manipulation operations on (the Table
object is a simplified version of a Pandas DataFrame).
Some methods we can perform on Table
objects are:
tb.show(k)
: show the first k rows of the tabletb.select('col1', 'col2')
: select col1
and col2
from the tabletb.drop('col')
: remove col
from the tabletb.sort('col')
: sort the rows in the table based on the values in col
tb.where('col', value)
: reduce the table to rows where col
is equal to value
These methods all return Table objects that have been modified based on the methods that have been called.
Let's look at data on ice cream cones that is described in the class textbook.
# Load the ice cream data. Each row represents one ice cream cone.
cones = Table.read_table('cones.csv')
cones
Flavor | Color | Price |
---|---|---|
strawberry | pink | 3.55 |
chocolate | light brown | 4.75 |
chocolate | dark brown | 5.25 |
strawberry | pink | 5.25 |
chocolate | dark brown | 5.25 |
bubblegum | pink | 4.75 |
# Show the first 2 rows of the data
# select only the Flavor column
# the original cones Table is not modified
# select the Flavor and Price columns
# remove the Color column
# sort by price
# sort by price highest to loweset
# select only the chocolate cones
# We can combine mulitple method called. Let's drop the color and then sort by price
Let's look basketball (NBA) salaries from the 2015-2016 season. The data is originally from https://www.statcrunch.com/app/index.php?dataid=1843341
# NBA players, 2015-2016 season
nba = Table.read_table('nba_salaries.csv').relabeled(3, 'SALARY')
nba
PLAYER | POSITION | TEAM | SALARY |
---|---|---|---|
Paul Millsap | PF | Atlanta Hawks | 18.6717 |
Al Horford | C | Atlanta Hawks | 12 |
Tiago Splitter | C | Atlanta Hawks | 9.75625 |
Jeff Teague | PG | Atlanta Hawks | 8 |
Kyle Korver | SG | Atlanta Hawks | 5.74648 |
Thabo Sefolosha | SF | Atlanta Hawks | 4 |
Mike Scott | PF | Atlanta Hawks | 3.33333 |
Kent Bazemore | SF | Atlanta Hawks | 2 |
Dennis Schroder | PG | Atlanta Hawks | 1.7634 |
Tim Hardaway Jr. | SG | Atlanta Hawks | 1.30452 |
... (407 rows omitted)
# Let's get Stephen Curry's data
# Let's get data from the New York Knicks
We can extract columns from a Table
as either:
Table
with fewer columns using tb.select()
ndarray
using tb.column()
# extract a column from a Tables as a Table
# extract a column from a Tables as an ndarray
We can also create tables from scratch using the Tables()
method and then adding columns to the table using the tb.with_colum("col_name", ndarray)
method.