We will learn the basic lingo around data and specifically talk about two types of data: quantitative and qualitative.
We then go over how to succinctly describe both types of data with summary statistics.
While basic, this lesson will give us a common vocabulary that we will use for the rest of the class.
We begin the journey into data science with a single assumption, we have data. While this may sound like a must, it is actually not always the case. Sometimes you will begin without any data and you will be asked to rectify that problem in any number of ways: collecting it, purchasing, scraping, etc.
So while the first stage of data science might actually be collecting data, this is not where we will start. First we will get a feel for data. Try to understand what we can do with it and what is a 'good dataset'. Then hopefully that first problem will work itself out.
So as you will come to expect, let's start off by listing the assumptions we have made (and as you will further come to expect we will do it in graph form):
import networkx as nx
from nxpd import draw
from nxpd import nxpdParams
nxpdParams['show'] = 'ipynb'
G = nx.DiGraph()
G.add_node('Inputs (x_1, x_2, ..., x_n)')
draw(G)
Just one node/bubble (we will get more don't worry). And as you will see in a moment, there is some that we can do by only relying on this assumption, but not much.
I will call this node the input node, it's basically the dataset. Notice that our dataset contains multiple points (x_1, x_2, ...).
We start off by assuming we have a series of data points. We will focus on describing these data points today.
So we are not just in the realm of the theoretical, let's "get" some actual data.
data = {
'quantitative1': [1, 4, 6, 10, 15, 16],
'quantitative2': [4, 6, 12, 13, 16, 19],
'qualitative': ['red', 'blue', 'red', 'blue', 'yellow', 'red']
}
import pandas as pd
# and now let's put that data in a DataFrame (python's Excel spreadsheet)
pd.DataFrame(data)
qualitative | quantitative1 | quantitative2 | |
---|---|---|---|
0 | red | 1 | 4 |
1 | blue | 4 | 6 |
2 | red | 6 | 12 |
3 | blue | 10 | 13 |
4 | yellow | 15 | 16 |
5 | red | 16 | 19 |
There we have it. Notice that my data has multiple columns (quant and qual) and rows (6 rows to be exact). The columns can sometimes go by other names: dimensions, features, covariates, m, etc. and the rows can go by other names too: samples, data points, n, etc.
I will generally call the columns features and the rows data points.
And notice that they come in two flavors:
You will also notice very fine lines. The definitions can be crossed pretty easily. But perhaps the best question you can ask about a particular feature to determine if it is a quantitative or a qualitative feature is: can I average the feature?
In this respect we can see how things that you classically thought were qualitative can become quantitative. For example, this might seem plausible:
ave(depressed , content) = melancholy
Then maybe instead of using qualitative descriptions, you should be using a happiness score/index. But then you can look at this puzzling question:
ave(confused , giddy) = ?
Perhaps you should just be using a qualitative feature at this point.
But now we are well on our way to describing data.
But there is so much more that we can do!
The classic way that we can describe features (other than what we alluded to above) is with summary statistics a description of the data using a single number. An example would be the average of a particular column.
A summary statistic is innately a quantitative one (it is a number). Because qualitative features are non-quantitative (are not numbers themselves), there are only a few summary statistics (numbers describing them) that make sense.
One of them is the count of each term. We can do this using the python library pandas below:
df = pd.DataFrame(data)
df.groupby('qualitative').qualitative.count()
qualitative blue 2 red 3 yellow 1 Name: qualitative, dtype: int64
In such a way we can get a sense of what types of data we have and how many of each. We can easily get the mode (or the most common type) from this analysis as well.
As a final note, some qualitative data can be ordered, and in this case you will be able to find the median (the center) of that data as well.
Now we are getting into the good stuff. We are looking to describe quantitative features with quantitative statistics. Remember that we are doing the impossible. We are taking what could be millions of data points and summarizing them with a single number, so these had better be pretty important numbers! And it turns out they are. They fall into four categories:
These are the most important summary statistics and if you know them about your data, you can generally infer a ton about it. But don't be fooled! You might know less than you think! Just knowing the mean of a data set does not tell you too much about it. The prime example is a bimodal distribution (google it).
A question to ask yourself is: why do we care about any of these measures
Below is how we get the center and spead statistics:
print df.quantitative1.mean()
# standard deviation
print df.quantitative1.std()
print df.quantitative1.median()
8.66666666667 6.05530070819 8.0
df.describe()
quantitative1 | quantitative2 | |
---|---|---|
count | 6.000000 | 6.000000 |
mean | 8.666667 | 11.666667 |
std | 6.055301 | 5.750362 |
min | 1.000000 | 4.000000 |
25% | 4.500000 | 7.500000 |
50% | 8.000000 | 12.500000 |
75% | 13.750000 | 15.250000 |
max | 16.000000 | 19.000000 |
Let me briefly explain what these terms mean:
During our next lesson (when we make visualizations) we will get a better feel for these quantities and what they look like.
For reference, here is the original data. Does it make sense that the mean of the quantitative1 is lower than the mean of quantitative2?
df
qualitative | quantitative1 | quantitative2 | |
---|---|---|---|
0 | red | 1 | 4 |
1 | blue | 4 | 6 |
2 | red | 6 | 12 |
3 | blue | 10 | 13 |
4 | yellow | 15 | 16 |
5 | red | 16 | 19 |
And below here is how you would get the dependence summary statistic that we will be using:
df.corr()
quantitative1 | quantitative2 | |
---|---|---|
quantitative1 | 1.000000 | 0.961129 |
quantitative2 | 0.961129 | 1.000000 |
The correlation tells us how related two factors are. It lies between -1 and 1. If it is close to 1, the two quantities are very related (when one is positive, the other is also positive), if 0 they seem unrelated and if -1 they are inversely related (when one is positive the other is negative). We will explore this more later in the course.
So now you know some stuff.
Well who cares?
Well once you get more of a feel for what these summary statistics mean (what data that exhibit these statistics looks like) it turns out that you can communicate a lot by just using them. So instead of walking through pages of data during a presentation, you can just go through a couple numbers.
Then a couple of lectures down the line we will show that we can use them to infer larger truths!
Because remember, if you only make the above assumption, we have data, you can't say too much.
For example, if you collect a dataset of people's heights and you find the mean is 6 feet, you might be tempted to make assumptions about the height of people in general, BUT with the above assumption you can only say: The mean of my data is six feet. Or: the mean heights of the people I sampled is 6 feet. YOU CANNOT SAY ANYTHING IN GENERAL!!
But don't worry, we will get there in time.
We will learn the basic lingo around data and specifically talk about two types of data: quantitative and qualitative.
We then go over how to succinctly describe both types of data with summary statistics.
While basic, this lesson will give us a common vocabulary that we will use for the rest of the class.