# First let's import some key data attributes
!pip install numpy pandas matplotlib seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
To begin - let's show a very quick example of generating some dataset and plotting this. Do not worry if you do not understand the commands being used - you will do soon. For now, this is just to show how quickly we can generate this using Python.
a = np.random.normal(1.0, 0.3, [1000,1000])
b = np.random.normal(1.8, 0.2, [100,100])
plt.scatter(a[0], a[1])
plt.scatter(b[0], b[1])
<matplotlib.collections.PathCollection at 0x7fa094a77a10>
number = 1
text = 'hello_everyone'
l = [1,2,3,4,5]
d = {'name':'bob', 'value':100}
Basic variables in Python include numerical values (integers, floats, doubles, etc.) and string values (text). Basic data structures include lists and dictionaries.
First let's look at some basics of Python and get use to manipulating data using the built in variable types.
# First some simple variables
an_integer = 12
a_floating_point_number = 18.4732
# We can do some simple maths on these variables and see the output
an_integer + a_floating_point_number
30.4732
# We can also write functions to do simple maths
def multiply(number1, number2):
return number1 * number2
multiply(an_integer, a_floating_point_number)
221.67839999999998
# We can also create text variables just as easily
a_string = 'Hello there!'
print (a_string)
Hello there!
# We can do some simple manipulation of text
my_name = 'Phil'
message = a_string + ' My name is ' + my_name
print (message)
# Including spliting sentences to corrupt a message
imposter_name = 'Dave'
s_m = message.split(" ")
new_message = ' '.join(s_m[:-1]) + ' ' + imposter_name
print (new_message)
Hello there! My name is Phil Hello there! My name is Dave
# We also have lists of data that can sort variables
fruits = ['apple','banana','orange','lemon']
print (fruits)
# We can access sets of variables using indexes
print (fruits[0:2])
print (fruits[:-1])
# We can append items to the list, and remove items from the list
fruits.append('mango')
print (fruits)
fruits.remove('banana')
print (fruits)
['apple', 'banana', 'orange', 'lemon'] ['apple', 'banana'] ['apple', 'banana', 'orange'] ['apple', 'banana', 'orange', 'lemon', 'mango'] ['apple', 'orange', 'lemon', 'mango']
# We can also create dictionary objects
# This is helpful for storing related variables about an object
person = {}
person['name'] = 'bob'
person['age'] = 23
person['height'] = 185
person['email'] = 'bob@bobmail.com'
print (person)
{'name': 'bob', 'age': 23, 'height': 185, 'email': 'bob@bobmail.com'}
# Like earlier, we could use a function to create 'person' objects
people = []
def create_person(name, age, height, email):
global people
new_person = {'name':name,
'age':age,
'height':height,
'email':email}
people.append(new_person)
create_person('bob', 23, 177, 'bob@bobmail.com')
create_person('john', 41, 185, 'john@johnmail.com')
create_person('sophie', 31, 157, 'sophie@sophiemail.com')
create_person('wendy', 19, 174, 'wendy@wendymail.com')
# Here we store our person objects in our people list
# to make a group of 'persons' - a.k.a. people!
print (people)
[{'name': 'bob', 'age': 23, 'height': 177, 'email': 'bob@bobmail.com'}, {'name': 'john', 'age': 41, 'height': 185, 'email': 'john@johnmail.com'}, {'name': 'sophie', 'age': 31, 'height': 157, 'email': 'sophie@sophiemail.com'}, {'name': 'wendy', 'age': 19, 'height': 174, 'email': 'wendy@wendymail.com'}]
We have covered a lot very quickly here. You've now already used the main built in variables of Python, that allow you to store numerical and text data, and the data structures such as lists (which are essentially arrays), and dictionaries (which are essentially objects). Let's now explore this deeper by introducing some of the data science libraries.
# We can import libraries using the following
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Our people dictionary is difficult for us to read clearly
# Pandas DataFrames help manipulate tabular data like this very easily
data = pd.DataFrame(people)
data
name | age | height | ||
---|---|---|---|---|
0 | bob | 23 | 177 | bob@bobmail.com |
1 | john | 41 | 185 | john@johnmail.com |
2 | sophie | 31 | 157 | sophie@sophiemail.com |
3 | wendy | 19 | 174 | wendy@wendymail.com |
# We can access individual columns of the data now
data['age']
0 23 1 41 2 31 3 19 Name: age, dtype: int64
# Who is the tallest of the users? Let's find out
data[data['height'] == np.max(data['height'])]
name | age | height | ||
---|---|---|---|---|
1 | john | 41 | 185 | john@johnmail.com |
# Who is the shortest of the users? Let's find out
data[data['height'] == np.min(data['height'])]
name | age | height | ||
---|---|---|---|---|
2 | sophie | 31 | 157 | sophie@sophiemail.com |
# What if we want to plot this data quickly?
data.plot()
plt.show()
Spend some time researching into Pandas, Matplotlib, and Numpy - these are core to manipulating numerical and tabular data, and then being able to visualize the results. There are many great examples online of getting started with these libraries. E.g., (Free e-book 'A Whirlwind Tour of Python': http://www.oreilly.com/programming/free/a-whirlwind-tour-of-python.csp)