In [36]:

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Review and continuation of functions¶

Functions (also refered to as call expressions) take in input values (called arguments) and return an output value.

A few example functions are:

abs(): Takes the absolute value of a number
min(x, y, ..., z): Takes the minimum value of x, y, ..., z
round(): round a number, potentially to a given number of decimal places

In [38]:

abs(-5)   # take the absolute value

Out[38]:

In [39]:

# take the absolute value of a difference of numbers
a = 5
b = 7
abs(a - b)   

Out[39]:

In [40]:

# get the minimum value
min(8, 3)

Out[40]:

In [41]:

# round a number
round(123.56789)

Out[41]:

In [42]:

# round a number to 2 decimal places
round(123.56789, 2)

Out[42]:

123.57

Review and continuation of numbers¶

Python (and most programming languages) have different ways to represent numbers. Two import numerical representation of numbers are:

ints: These represent integers (whole numbers)
floats: These represent (approximate) real numbers (although still with finite precision)

In [43]:

2  # int

Out[43]:

In [ ]:

10 + 3   # int 

In [45]:

1.7 + 4 # float

Out[45]:

5.7

In [47]:

2. + 3  # float

Out[47]:

5.0

In [48]:

10 / 3   # float  

Out[48]:

3.3333333333333335

In [49]:

10 / 2    # still a float

Out[49]:

5.0

In [50]:

123 ** 4   

Out[50]:

228886641

In [53]:

30 / 400

Out[53]:

0.075

In [54]:

30 / 4000000000        # output in scientific notation 

Out[54]:

7.5e-09

In [55]:

9 ** 0.5

Out[55]:

3.0

In [ ]:

.12345678901234567890123456789     # limited precision

In [56]:

13 ** 0.5

Out[56]:

3.605551275463989

In [57]:

(13 ** 0.5) ** 2         # After arithmetic, the final few decimal places can be wrong

Out[57]:

12.999999999999998

In [58]:

float(3)

Out[58]:

3.0

In [59]:

int(6.75)

Out[59]:

Strings¶

A string is the term used to describe data that is text. There are a number of operations (functions) that can be applied to strings as shown below. We can also convert numbers (ints and floats) into string representations, and strings back to number representations.

In [60]:

'Flavor'

Out[60]:

'Flavor'

In [61]:

'any snippet of text'

Out[61]:

'any snippet of text'

In [62]:

'2' + 'x'    # concatenation

Out[62]:

'2x'

In [63]:

'straw' + 'berry'

Out[63]:

'strawberry'

In [64]:

'two ' + 'words' # notice the space after the 'two '

Out[64]:

'two words'

In [65]:

'ha' * 5

Out[65]:

'hahahahaha'

In [66]:

str(2)

Out[66]:

'2'

In [67]:

int('2')

Out[67]:

In [68]:

int('2.3')

In [69]:

float('2.3')

Out[69]:

2.3

In [70]:

int(float('2.3'))

Out[70]:

In [ ]:

str('3', '2') # To concatenate strings, use +

In [72]:

'3'+'2'

Out[72]:

'32'

In [ ]:

2 + 'x'

In [ ]:

"I'm a data scientist!"

In [ ]:

'I'm a data scientist!'

Types¶

As we have seen there are different types of data; e.g., ints, floats, strings, etc. Different functions operate on particular types of data. For example, one can't take the absolute value of a string (try it and you will see you will get a TypeError).

We can use the type() function to take what type of value a given name is holding.

In [ ]:

type(2.3)

In [ ]:

type(100)

In [ ]:

type('abcd')

In [ ]:

a = 5.7
type(a)

Arrays¶

Arrays are a data structure that holds a sequence of values of the same type. For example, a squence of all numbers, or a squence of all strings, etc.

We can use use the make_array function from the datascience package to create what are called ndarray that are array implemented by the NumPy package. One can perform a range of operations on these arrays in a very efficient manner.

In [ ]:

make_array(1, 2, 3, 4)

In [ ]:

my_array = make_array(5, 6, 7, 8)

In [ ]:

my_array

In [ ]:

len(my_array)

In [ ]:

sum(my_array)

In [ ]:

sum(my_array) / len(my_array)

In [ ]:

my_array

In [ ]:

my_array * 2

In [ ]:

another_one = make_array(20, 30, 40, 50)

In [ ]:

my_array + another_one

In [ ]:

yet_another = make_array(1, 2, 3, 4, 5, 6)
my_array + yet_another

In [ ]:

my_array

In [ ]:

my_array.item(0)

Ranges¶

We can use use the numpy np.arange() function to create ndarrays that contain useful sequences of numbers.

In [88]:

np.arange(4)

Out[88]:

array([0, 1, 2, 3])

In [ ]:

np.arange(5, 25)

In [ ]:

np.arange(5, 25, 10)

In [ ]:

np.arange(5, 26, 10)

Tables¶

Tables stored structured data. We can use the datascience package to create Table objects that we can perform data manipulation operations on (the Table object is a simplified version of a Pandas DataFrame).

Some methods we can perform on Table objects are:

tb.show(k): show the first k rows of the table
tb.select('col1', 'col2'): select col1 and col2 from the table
tb.drop('col'): remove col from the table
tb.sort('col'): sort the rows in the table based on the values in col
tb.where('col', value): reduce the table to rows where col is equal to value

These methods all return Table objects that have been modified based on the methods that have been called.

Let's look at data on ice cream cones that is described in the class textbook.

In [ ]:

# Load the ice cream data. Each row represents one ice cream cone.
cones = Table.read_table('cones.csv')
cones

In [ ]:

type(cones)

In [ ]:

# Show the first 2 rows of the data
cones.show(2)

In [ ]:

# select only the Flavor column
only_flavor = cones.select('Flavor')
only_flavor

In [ ]:

# the original cones Table is not modified
cones

In [ ]:

# select the Flavor and Price columns
cones.select('Flavor', 'Price')

In [74]:

# remove the Color column
no_color = cones.drop('Color')
no_color

Out[74]:

Flavor	Price
strawberry	3.55
chocolate	4.75
chocolate	5.25
strawberry	5.25
chocolate	5.25
bubblegum	4.75

In [75]:

# sort by price
cones.sort('Price')

Out[75]:

Flavor	Color	Price
strawberry	pink	3.55
chocolate	light brown	4.75
bubblegum	pink	4.75
chocolate	dark brown	5.25
strawberry	pink	5.25
chocolate	dark brown	5.25

In [76]:

# sort by price highest to loweset
cones.sort('Price', descending=True)

Out[76]:

Flavor	Color	Price
chocolate	dark brown	5.25
strawberry	pink	5.25
chocolate	dark brown	5.25
bubblegum	pink	4.75
chocolate	light brown	4.75
strawberry	pink	3.55

In [77]:

# select only the chocolate cones
cones.where('Flavor', 'chocolate')

Out[77]:

Flavor	Color	Price
chocolate	light brown	4.75
chocolate	dark brown	5.25
chocolate	dark brown	5.25

In [78]:

# We can combine mulitple method called. Let's drop the color and then sort by price
cones.drop('Color').sort('Price', descending=True)

Out[78]:

Flavor	Price
chocolate	5.25
strawberry	5.25
chocolate	5.25
bubblegum	4.75
chocolate	4.75
strawberry	3.55

Columns of Tables are Arrays¶

We can extract columns from a Table as either:

A new Table with fewer columns using tb.select()
An ndarray using tb.column()

In [79]:

cones.select('Price')  # still a table

Out[79]:

Price
3.55
4.75
5.25
5.25
5.25
4.75

In [80]:

type(cones.select('Price'))

Out[80]:

datascience.tables.Table

In [81]:

cones.column('Price') # an array

Out[81]:

array([3.55, 4.75, 5.25, 5.25, 5.25, 4.75])

In [82]:

type(cones.column('Price'))

Out[82]:

numpy.ndarray