Dplyr is one of the most popular r-packages and also part of tidyverse that's been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it's one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us - in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else's.
Hence, this notebook aims to bring out some well known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential
# This R environment comes with all of CRAN preinstalled, as well as many other helpful packages
# The environment is defined by the kaggle/rstats docker image: https://github.com/kaggle/docker-rstats
# For example, here's several helpful packages to load in
library(dplyr) # Loading Dplyr package
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
system("ls ../input")
# Any results you write to the current directory are saved as output.
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
Let us start by reading the input training file using the base r function read.csv
train <- read.csv('../input/train.csv',stringsAsFactors = F, header = T)
train #just to see if it's been loaded
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | C | |
11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58 | 0 | 0 | 113783 | 26.5500 | C103 | S |
13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20 | 0 | 0 | A/5. 2151 | 8.0500 | S | |
14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39 | 1 | 5 | 347082 | 31.2750 | S | |
15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14 | 0 | 0 | 350406 | 7.8542 | S | |
16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55 | 0 | 0 | 248706 | 16.0000 | S | |
17 | 0 | 3 | Rice, Master. Eugene | male | 2 | 4 | 1 | 382652 | 29.1250 | Q | |
18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NA | 0 | 0 | 244373 | 13.0000 | S | |
19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) | female | 31 | 1 | 0 | 345763 | 18.0000 | S | |
20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NA | 0 | 0 | 2649 | 7.2250 | C | |
21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35 | 0 | 0 | 239865 | 26.0000 | S | |
22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34 | 0 | 0 | 248698 | 13.0000 | D56 | S |
23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15 | 0 | 0 | 330923 | 8.0292 | Q | |
24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28 | 0 | 0 | 113788 | 35.5000 | A6 | S |
25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8 | 3 | 1 | 349909 | 21.0750 | S | |
26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) | female | 38 | 1 | 5 | 347077 | 31.3875 | S | |
27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NA | 0 | 0 | 2631 | 7.2250 | C | |
28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NA | 0 | 0 | 330959 | 7.8792 | Q | |
30 | 0 | 3 | Todoroff, Mr. Lalio | male | NA | 0 | 0 | 349216 | 7.8958 | S | |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
862 | 0 | 2 | Giles, Mr. Frederick Edward | male | 21 | 1 | 0 | 28134 | 11.5000 | S | |
863 | 1 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Barron) | female | 48 | 0 | 0 | 17466 | 25.9292 | D17 | S |
864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NA | 8 | 2 | CA. 2343 | 69.5500 | S | |
865 | 0 | 2 | Gill, Mr. John William | male | 24 | 0 | 0 | 233866 | 13.0000 | S | |
866 | 1 | 2 | Bystrom, Mrs. (Karolina) | female | 42 | 0 | 0 | 236852 | 13.0000 | S | |
867 | 1 | 2 | Duran y More, Miss. Asuncion | female | 27 | 1 | 0 | SC/PARIS 2149 | 13.8583 | C | |
868 | 0 | 1 | Roebling, Mr. Washington Augustus II | male | 31 | 0 | 0 | PC 17590 | 50.4958 | A24 | S |
869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NA | 0 | 0 | 345777 | 9.5000 | S | |
870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4 | 1 | 1 | 347742 | 11.1333 | S | |
871 | 0 | 3 | Balkic, Mr. Cerin | male | 26 | 0 | 0 | 349248 | 7.8958 | S | |
872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47 | 1 | 1 | 11751 | 52.5542 | D35 | S |
873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
874 | 0 | 3 | Vander Cruyssen, Mr. Victor | male | 47 | 0 | 0 | 345765 | 9.0000 | S | |
875 | 1 | 2 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28 | 1 | 0 | P/PP 3381 | 24.0000 | C | |
876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15 | 0 | 0 | 2667 | 7.2250 | C | |
877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20 | 0 | 0 | 7534 | 9.8458 | S | |
878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19 | 0 | 0 | 349212 | 7.8958 | S | |
879 | 0 | 3 | Laleff, Mr. Kristo | male | NA | 0 | 0 | 349217 | 7.8958 | S | |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56 | 0 | 1 | 11767 | 83.1583 | C50 | C |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25 | 0 | 1 | 230433 | 26.0000 | S | |
882 | 0 | 3 | Markun, Mr. Johann | male | 33 | 0 | 0 | 349257 | 7.8958 | S | |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22 | 0 | 0 | 7552 | 10.5167 | S | |
884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | S | |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | S | |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39 | 0 | 5 | 382652 | 29.1250 | Q | |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27 | 0 | 0 | 211536 | 13.0000 | S | |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NA | 1 | 2 | W./C. 6607 | 23.4500 | S | |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32 | 0 | 0 | 370376 | 7.7500 | Q |
Getting the total number of rows in the given dataframe (even though it's been very straight forward with nrow() in base-r, this being a dplyr starter-kit, we'll start with that.
train %>% count()
n |
---|
891 |
The above code just gives the row count of the dataframe that's been passed with the pipe %>% operator. The pipe operator works very similar to the | (pipe) operator in Unix environment where the ouput of the current operation is fed as the input of the following operation. Similary in dplyr or any other package that supports pipe operator, the functions in it will always take only dataframe as the first arugment hence the function can be called in two ways like below:
count(train) #Without pipe, passing the df as the first argument
train %>% count() #with pipe, more convient and more readability
n |
---|
891 |
n |
---|
891 |
But dplyr's real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):
And let us see what every one of these does!
select() as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of()
could be used within select.
select(train,Age) #without pipe
Age |
---|
22 |
38 |
26 |
35 |
35 |
NA |
54 |
2 |
27 |
14 |
4 |
58 |
20 |
39 |
14 |
55 |
2 |
NA |
31 |
NA |
35 |
34 |
15 |
28 |
8 |
38 |
NA |
19 |
NA |
NA |
⋮ |
21 |
48 |
NA |
24 |
42 |
27 |
31 |
NA |
4 |
26 |
47 |
33 |
47 |
28 |
15 |
20 |
19 |
NA |
56 |
25 |
33 |
22 |
28 |
25 |
39 |
27 |
19 |
NA |
26 |
32 |
#multicolumn selection
train %>% select(one_of('Sex','Age'))
Sex | Age |
---|---|
male | 22 |
female | 38 |
female | 26 |
female | 35 |
male | 35 |
male | NA |
male | 54 |
male | 2 |
female | 27 |
female | 14 |
female | 4 |
female | 58 |
male | 20 |
male | 39 |
female | 14 |
female | 55 |
male | 2 |
male | NA |
female | 31 |
female | NA |
male | 35 |
male | 34 |
female | 15 |
male | 28 |
female | 8 |
female | 38 |
male | NA |
male | 19 |
female | NA |
male | NA |
⋮ | ⋮ |
male | 21 |
female | 48 |
female | NA |
male | 24 |
female | 42 |
female | 27 |
male | 31 |
male | NA |
male | 4 |
male | 26 |
female | 47 |
male | 33 |
male | 47 |
female | 28 |
female | 15 |
male | 20 |
male | 19 |
male | NA |
female | 56 |
female | 25 |
male | 33 |
female | 22 |
male | 28 |
male | 25 |
female | 39 |
male | 27 |
female | 19 |
female | NA |
male | 26 |
male | 32 |
#multicolumn rejection
train %>% select(-one_of('Age','Sex'))
PassengerId | Survived | Pclass | Name | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 1 | 0 | 237736 | 30.0708 | C | |
11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
12 | 1 | 1 | Bonnell, Miss. Elizabeth | 0 | 0 | 113783 | 26.5500 | C103 | S |
13 | 0 | 3 | Saundercock, Mr. William Henry | 0 | 0 | A/5. 2151 | 8.0500 | S | |
14 | 0 | 3 | Andersson, Mr. Anders Johan | 1 | 5 | 347082 | 31.2750 | S | |
15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | 0 | 0 | 350406 | 7.8542 | S | |
16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | 0 | 0 | 248706 | 16.0000 | S | |
17 | 0 | 3 | Rice, Master. Eugene | 4 | 1 | 382652 | 29.1250 | Q | |
18 | 1 | 2 | Williams, Mr. Charles Eugene | 0 | 0 | 244373 | 13.0000 | S | |
19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) | 1 | 0 | 345763 | 18.0000 | S | |
20 | 1 | 3 | Masselmani, Mrs. Fatima | 0 | 0 | 2649 | 7.2250 | C | |
21 | 0 | 2 | Fynney, Mr. Joseph J | 0 | 0 | 239865 | 26.0000 | S | |
22 | 1 | 2 | Beesley, Mr. Lawrence | 0 | 0 | 248698 | 13.0000 | D56 | S |
23 | 1 | 3 | McGowan, Miss. Anna "Annie" | 0 | 0 | 330923 | 8.0292 | Q | |
24 | 1 | 1 | Sloper, Mr. William Thompson | 0 | 0 | 113788 | 35.5000 | A6 | S |
25 | 0 | 3 | Palsson, Miss. Torborg Danira | 3 | 1 | 349909 | 21.0750 | S | |
26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) | 1 | 5 | 347077 | 31.3875 | S | |
27 | 0 | 3 | Emir, Mr. Farred Chehab | 0 | 0 | 2631 | 7.2250 | C | |
28 | 0 | 1 | Fortune, Mr. Charles Alexander | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | 0 | 0 | 330959 | 7.8792 | Q | |
30 | 0 | 3 | Todoroff, Mr. Lalio | 0 | 0 | 349216 | 7.8958 | S | |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
862 | 0 | 2 | Giles, Mr. Frederick Edward | 1 | 0 | 28134 | 11.5000 | S | |
863 | 1 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Barron) | 0 | 0 | 17466 | 25.9292 | D17 | S |
864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | 8 | 2 | CA. 2343 | 69.5500 | S | |
865 | 0 | 2 | Gill, Mr. John William | 0 | 0 | 233866 | 13.0000 | S | |
866 | 1 | 2 | Bystrom, Mrs. (Karolina) | 0 | 0 | 236852 | 13.0000 | S | |
867 | 1 | 2 | Duran y More, Miss. Asuncion | 1 | 0 | SC/PARIS 2149 | 13.8583 | C | |
868 | 0 | 1 | Roebling, Mr. Washington Augustus II | 0 | 0 | PC 17590 | 50.4958 | A24 | S |
869 | 0 | 3 | van Melkebeke, Mr. Philemon | 0 | 0 | 345777 | 9.5000 | S | |
870 | 1 | 3 | Johnson, Master. Harold Theodor | 1 | 1 | 347742 | 11.1333 | S | |
871 | 0 | 3 | Balkic, Mr. Cerin | 0 | 0 | 349248 | 7.8958 | S | |
872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 1 | 1 | 11751 | 52.5542 | D35 | S |
873 | 0 | 1 | Carlsson, Mr. Frans Olof | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
874 | 0 | 3 | Vander Cruyssen, Mr. Victor | 0 | 0 | 345765 | 9.0000 | S | |
875 | 1 | 2 | Abelson, Mrs. Samuel (Hannah Wizosky) | 1 | 0 | P/PP 3381 | 24.0000 | C | |
876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | 0 | 0 | 2667 | 7.2250 | C | |
877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | 0 | 0 | 7534 | 9.8458 | S | |
878 | 0 | 3 | Petroff, Mr. Nedelio | 0 | 0 | 349212 | 7.8958 | S | |
879 | 0 | 3 | Laleff, Mr. Kristo | 0 | 0 | 349217 | 7.8958 | S | |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 0 | 1 | 11767 | 83.1583 | C50 | C |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | 0 | 1 | 230433 | 26.0000 | S | |
882 | 0 | 3 | Markun, Mr. Johann | 0 | 0 | 349257 | 7.8958 | S | |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | 0 | 0 | 7552 | 10.5167 | S | |
884 | 0 | 2 | Banfield, Mr. Frederick James | 0 | 0 | C.A./SOTON 34068 | 10.5000 | S | |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | 0 | 0 | SOTON/OQ 392076 | 7.0500 | S | |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | 0 | 5 | 382652 | 29.1250 | Q | |
887 | 0 | 2 | Montvila, Rev. Juozas | 0 | 0 | 211536 | 13.0000 | S | |
888 | 1 | 1 | Graham, Miss. Margaret Edith | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | 1 | 2 | W./C. 6607 | 23.4500 | S | |
890 | 1 | 1 | Behr, Mr. Karl Howell | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | 0 | 0 | 370376 | 7.7500 | Q |
Like selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.
train %>% select(starts_with('P'))
PassengerId | Pclass | Parch |
---|---|---|
1 | 3 | 0 |
2 | 1 | 0 |
3 | 3 | 0 |
4 | 1 | 0 |
5 | 3 | 0 |
6 | 3 | 0 |
7 | 1 | 0 |
8 | 3 | 1 |
9 | 3 | 2 |
10 | 2 | 0 |
11 | 3 | 1 |
12 | 1 | 0 |
13 | 3 | 0 |
14 | 3 | 5 |
15 | 3 | 0 |
16 | 2 | 0 |
17 | 3 | 1 |
18 | 2 | 0 |
19 | 3 | 0 |
20 | 3 | 0 |
21 | 2 | 0 |
22 | 2 | 0 |
23 | 3 | 0 |
24 | 1 | 0 |
25 | 3 | 1 |
26 | 3 | 5 |
27 | 3 | 0 |
28 | 1 | 2 |
29 | 3 | 0 |
30 | 3 | 0 |
⋮ | ⋮ | ⋮ |
862 | 2 | 0 |
863 | 1 | 0 |
864 | 3 | 2 |
865 | 2 | 0 |
866 | 2 | 0 |
867 | 2 | 0 |
868 | 1 | 0 |
869 | 3 | 0 |
870 | 3 | 1 |
871 | 3 | 0 |
872 | 1 | 1 |
873 | 1 | 0 |
874 | 3 | 0 |
875 | 2 | 0 |
876 | 3 | 0 |
877 | 3 | 0 |
878 | 3 | 0 |
879 | 3 | 0 |
880 | 1 | 1 |
881 | 2 | 1 |
882 | 3 | 0 |
883 | 3 | 0 |
884 | 2 | 0 |
885 | 3 | 0 |
886 | 3 | 5 |
887 | 2 | 0 |
888 | 1 | 0 |
889 | 3 | 2 |
890 | 1 | 0 |
891 | 3 | 0 |
train %>% select(ends_with('e'))
Name | Age | Fare |
---|---|---|
Braund, Mr. Owen Harris | 22 | 7.2500 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 38 | 71.2833 |
Heikkinen, Miss. Laina | 26 | 7.9250 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35 | 53.1000 |
Allen, Mr. William Henry | 35 | 8.0500 |
Moran, Mr. James | NA | 8.4583 |
McCarthy, Mr. Timothy J | 54 | 51.8625 |
Palsson, Master. Gosta Leonard | 2 | 21.0750 |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 27 | 11.1333 |
Nasser, Mrs. Nicholas (Adele Achem) | 14 | 30.0708 |
Sandstrom, Miss. Marguerite Rut | 4 | 16.7000 |
Bonnell, Miss. Elizabeth | 58 | 26.5500 |
Saundercock, Mr. William Henry | 20 | 8.0500 |
Andersson, Mr. Anders Johan | 39 | 31.2750 |
Vestrom, Miss. Hulda Amanda Adolfina | 14 | 7.8542 |
Hewlett, Mrs. (Mary D Kingcome) | 55 | 16.0000 |
Rice, Master. Eugene | 2 | 29.1250 |
Williams, Mr. Charles Eugene | NA | 13.0000 |
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) | 31 | 18.0000 |
Masselmani, Mrs. Fatima | NA | 7.2250 |
Fynney, Mr. Joseph J | 35 | 26.0000 |
Beesley, Mr. Lawrence | 34 | 13.0000 |
McGowan, Miss. Anna "Annie" | 15 | 8.0292 |
Sloper, Mr. William Thompson | 28 | 35.5000 |
Palsson, Miss. Torborg Danira | 8 | 21.0750 |
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) | 38 | 31.3875 |
Emir, Mr. Farred Chehab | NA | 7.2250 |
Fortune, Mr. Charles Alexander | 19 | 263.0000 |
O'Dwyer, Miss. Ellen "Nellie" | NA | 7.8792 |
Todoroff, Mr. Lalio | NA | 7.8958 |
⋮ | ⋮ | ⋮ |
Giles, Mr. Frederick Edward | 21 | 11.5000 |
Swift, Mrs. Frederick Joel (Margaret Welles Barron) | 48 | 25.9292 |
Sage, Miss. Dorothy Edith "Dolly" | NA | 69.5500 |
Gill, Mr. John William | 24 | 13.0000 |
Bystrom, Mrs. (Karolina) | 42 | 13.0000 |
Duran y More, Miss. Asuncion | 27 | 13.8583 |
Roebling, Mr. Washington Augustus II | 31 | 50.4958 |
van Melkebeke, Mr. Philemon | NA | 9.5000 |
Johnson, Master. Harold Theodor | 4 | 11.1333 |
Balkic, Mr. Cerin | 26 | 7.8958 |
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 47 | 52.5542 |
Carlsson, Mr. Frans Olof | 33 | 5.0000 |
Vander Cruyssen, Mr. Victor | 47 | 9.0000 |
Abelson, Mrs. Samuel (Hannah Wizosky) | 28 | 24.0000 |
Najib, Miss. Adele Kiamie "Jane" | 15 | 7.2250 |
Gustafsson, Mr. Alfred Ossian | 20 | 9.8458 |
Petroff, Mr. Nedelio | 19 | 7.8958 |
Laleff, Mr. Kristo | NA | 7.8958 |
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 56 | 83.1583 |
Shelley, Mrs. William (Imanita Parrish Hall) | 25 | 26.0000 |
Markun, Mr. Johann | 33 | 7.8958 |
Dahlberg, Miss. Gerda Ulrika | 22 | 10.5167 |
Banfield, Mr. Frederick James | 28 | 10.5000 |
Sutehall, Mr. Henry Jr | 25 | 7.0500 |
Rice, Mrs. William (Margaret Norton) | 39 | 29.1250 |
Montvila, Rev. Juozas | 27 | 13.0000 |
Graham, Miss. Margaret Edith | 19 | 30.0000 |
Johnston, Miss. Catherine Helen "Carrie" | NA | 23.4500 |
Behr, Mr. Karl Howell | 26 | 30.0000 |
Dooley, Mr. Patrick | 32 | 7.7500 |
Group_by is a lot similar to SQL Group by but more versatile. It is related to concept of “split-apply-combine”. Let us understand group_by with a starter example of finding out number of male and number of female - which logically could be the count of each Sex Type (once grouped by Sex).
train %>% group_by(Sex) %>% count()
Sex | n |
---|---|
female | 314 |
male | 577 |
Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.
train %>% group_by(Survived, Sex) %>% count()
train %>% group_by(Sex, Survived) %>% count()
Survived | Sex | n |
---|---|---|
0 | female | 81 |
0 | male | 468 |
1 | female | 233 |
1 | male | 109 |
Sex | Survived | n |
---|---|---|
female | 0 | 81 |
female | 1 | 233 |
male | 0 | 468 |
male | 1 | 109 |
That's minimally group_by, but the true power of group_by is unveiled only when it is coupled with mutate and summarise functions.
Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.
Let us get the average age of all survivors (and non-survivors): so this must be group_by -ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.
train %>% group_by(Survived) %>% summarise(mean(Age))
#Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed.
train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T))
Survived | mean(Age) |
---|---|
0 | NA |
1 | NA |
Survived | average_age |
---|---|
0 | 30.62618 |
1 | 28.34369 |
That's summarise() giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that's where mutate plays its role. Let us create a new column, Age_Bracket
containing value Minor
if Age is less than 18 else Major
train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age'))
#In fact this can be coupled with Survivor list to see the impact of this Age_bracket
train %>%
mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>%
group_by(Survived,Age_Bracket) %>%
summarise(pnt = (n()/nrow(train))*100)
Age | Age_Bracket |
---|---|
22 | Major |
38 | Major |
26 | Major |
35 | Major |
35 | Major |
NA | NA |
54 | Major |
2 | Minor |
27 | Major |
14 | Minor |
4 | Minor |
58 | Major |
20 | Major |
39 | Major |
14 | Minor |
55 | Major |
2 | Minor |
NA | NA |
31 | Major |
NA | NA |
35 | Major |
34 | Major |
15 | Minor |
28 | Major |
8 | Minor |
38 | Major |
NA | NA |
19 | Major |
NA | NA |
NA | NA |
⋮ | ⋮ |
21 | Major |
48 | Major |
NA | NA |
24 | Major |
42 | Major |
27 | Major |
31 | Major |
NA | NA |
4 | Minor |
26 | Major |
47 | Major |
33 | Major |
47 | Major |
28 | Major |
15 | Minor |
20 | Major |
19 | Major |
NA | NA |
56 | Major |
25 | Major |
33 | Major |
22 | Major |
28 | Major |
25 | Major |
39 | Major |
27 | Major |
19 | Major |
NA | NA |
26 | Major |
32 | Major |
Survived | Age_Bracket | pnt |
---|---|---|
0 | Major | 41.750842 |
0 | Minor | 5.836139 |
0 | NA | 14.029181 |
1 | Major | 25.701459 |
1 | Minor | 6.846240 |
1 | NA | 5.836139 |
That's how dplyr can get more powerful with group_by coupled with mutate or summarise for feautre engineering and for better data visualization. But this doesn't stop here, because one of the most important function a dataanalyst would require is sorting and that's what arrange()
does.
train %>% arrange(Fare) %>% tail(22) #Extracting last 22 results after sorting the fare in asending order
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
870 | 319 | 1 | 1 | Wick, Miss. Mary Natalie | female | 31 | 0 | 2 | 36928 | 164.8667 | C7 | S |
871 | 857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45 | 1 | 1 | 36928 | 164.8667 | S | |
872 | 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15 | 0 | 1 | 24160 | 211.3375 | B5 | S |
873 | 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29 | 0 | 0 | 24160 | 211.3375 | B5 | S |
874 | 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) | female | 43 | 0 | 1 | 24160 | 211.3375 | B3 | S |
875 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27 | 0 | 2 | 113503 | 211.5000 | C82 | C |
876 | 528 | 0 | 1 | Farthing, Mr. John | male | NA | 0 | 0 | PC 17483 | 221.7792 | C95 | S |
877 | 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42 | 0 | 0 | PC 17757 | 227.5250 | C | |
878 | 558 | 0 | 1 | Robbins, Mr. Victor | male | NA | 0 | 0 | PC 17757 | 227.5250 | C | |
879 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
880 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
881 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
882 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
883 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
884 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
885 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
886 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
887 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
888 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
889 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35 | 0 | 0 | PC 17755 | 512.3292 | C | |
890 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
891 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
## Arrange in descending order
train %>% arrange(desc(Age)) %>% head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
631 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S |
852 | 0 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | S | |
97 | 0 | 1 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
494 | 0 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | C | |
117 | 0 | 3 | Connors, Mr. Patrick | male | 70.5 | 0 | 0 | 370369 | 7.7500 | Q | |
673 | 0 | 2 | Mitchell, Mr. Henry Michael | male | 70.0 | 0 | 0 | C.A. 24580 | 10.5000 | S | |
746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0000 | B22 | S |
34 | 0 | 2 | Wheadon, Mr. Edward H | male | 66.0 | 0 | 0 | C.A. 24579 | 10.5000 | S | |
55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
281 | 0 | 3 | Duane, Mr. Frank | male | 65.0 | 0 | 0 | 336439 | 7.7500 | Q |
Filter does row_wise filter ( similar to what select did with columns). filter() takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.
Let us start with filtering (extracting) only male and getting their Embarked station count.
train %>%
filter(Sex == 'male') %>%
group_by(Embarked) %>%
count()
Embarked | n |
---|---|
C | 95 |
Q | 41 |
S | 441 |
#Getting the count of everyone whose age is lesser than 18
train %>% filter(Age < 18) %>% count()
n |
---|
113 |
Coupling Filter with Regex to perform simple string manipulation and detection.
train %>% filter(grepl('wick',train$Name))
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
344 | 0 | 2 | Sedgwick, Mr. Charles Frederick Waddington | male | 25 | 0 | 0 | 244361 | 13 | S |
*And this is dplyr in a nut shell and hope you get a decent start with this notebook if you are a beginner. Please share your thoughts in comments and suggestions!