In [1]:
%useLatestDescriptors
%use dataframe, lets-plot

import java.util.*
In [2]:
var dfRaw = DataFrame.readCSV(fileOrUrl = "../../idea-examples/titanic/src/main/resources/titanic.csv", delimiter = ';', parserOptions = ParserOptions(locale = Locale.FRENCH))

dfRaw.head()
Out[2]:

DataFrame [5 x 14]

We have a dataset which uses an alternative pattern for decimal numbers. This is a reason why the French locale will be used in the example.

But before data conversion, we should to handle null values.

In [3]:
dfRaw.describe()
Out[3]:

DataFrame [14 x 12]

In [4]:
val df = dfRaw.rename("\uFEFFpclass").into("pclass")

Imputing null values

Let's convert all columns of our dataset to non-nullable and impute null values based on mean values.

In [5]:
val df1 = df
    // imputing
    .fillNulls { sibsp and parch and age and fare }.perCol { mean() }
    .fillNulls { sex }.withValue("female")
    .fillNulls { embarked }.withValue("S")
    .convert { sibsp and parch and age and fare }.toDouble()

df1.head()
Out[5]:

DataFrame [5 x 14]

In [6]:
df1.schema()
Out[6]:
pclass: Int
survived: Int
name: String
sex: String
age: Double
sibsp: Double
parch: Double
ticket: String
fare: Double
cabin: String?
embarked: String
boat: String?
body: Int?
homedest: String?
In [7]:
df1.corr()
Out[7]:

DataFrame [6 x 7]

In [8]:
val correlations = df1.corr { all() }.with { survived }
    .sortBy { survived }
correlations
Out[8]:

DataFrame [6 x 2]

Great, at this moment we have 5 numerical features available for numerical analysis: pclass, age, sibsp, parch, fare.

Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

  • Pclass: We observe significant correlation (>0.5) among Pclass=1 and Survived.

  • Sex: We confirm the observation during problem definition that Sex=female had a very high survival rate at 74%.

  • SibSp and Parch: These features have zero correlation for the certain values. It may be best to derive a feature or a set of features from these individual features.

In [9]:
df1.groupBy { pclass }.mean { survived }.sortBy { pclass }
Out[9]:

DataFrame [3 x 2]

In [10]:
df1.groupBy { sex }.mean { survived }.sortBy { survived }
Out[10]:

DataFrame [2 x 2]

In [11]:
df1.groupBy { sibsp }.mean { survived }.sortBy { sibsp }
Out[11]:

DataFrame [7 x 2]

In [12]:
df1.groupBy { parch }.mean { survived }.sortBy { parch }
Out[12]:

DataFrame [8 x 2]

Analyze the importance of the Age feature

It's interesting to discover both age distributions: among survived and not survived passengers.

In [13]:
val byAge = df1.valueCounts { age }.sortBy { age }
byAge
Out[13]:

... showing only top 20 of 99 rows

DataFrame [99 x 2]

In [11]:
// JetBrains color palette
val colors = mapOf("light_orange" to "#ffb59e", "orange" to "#ff6632", "light_grey" to "#a6a6a6", "dark_grey" to "#4c4c4c")
In [14]:
letsPlot(byAge.toMap()) { x = "age"; y = "count" } + 
    geomPoint(size = 5, color = colors["dark_grey"]) +
    ggsize(850, 500)
Out[14]:
In [16]:
val age = df.select { age }.dropNulls().sortBy { age }

letsPlot(age.toMap()) { x = "age" } + geomHistogram(binWidth=5, fill = colors["orange"]) + ggsize(850, 500)
Out[16]:
In [17]:
df1.groupBy { age }.pivotCounts { survived }.sortBy { age }
Out[17]:

... showing only top 20 of 99 rows

DataFrame [99 x 2]

In [16]:
val survivedByAge = df1.select { survived and age }.sortBy { age }
survivedByAge
Out[16]:

... showing only top 20 of 1309 rows

DataFrame [1309 x 2]

In [17]:
val plot = letsPlot(survivedByAge.convert { survived }.with { if (it == 1) "Survived" else "Died" }.toMap())

plot +
    geomHistogram(binWidth = 5, alpha = 0.7, position = Pos.dodge) { x = "age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 500)
Out[17]:
In [18]:
// Density plot
plot +
    geomDensity { x="age"; color="survived" } +
    scaleColorManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 250)
Out[18]:
In [19]:
// A basic box plot
plot +
    geomBoxplot { x="survived"; y="age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(500, 400)
Out[19]:

Seems like we have the same age distribution among survived and not survived passengers.

Categorical features with One Hot Encoding

To prepare data for the ML algorithms, we should replace all String values in categorical features on numbers. There are a few ways of how to preprocess categorical features, and One Hot Encoding is one of them. We will use pivotMatches operation to convert categorical columns into sets of nested Boolean columns per every unique value.

In [8]:
val pivoted = df1.pivotMatches { pclass and sex and embarked }
pivoted.head()
Out[8]:

DataFrame [5 x 14]

In [20]:
val df2 = pivoted
            // feature extraction
            .select{ survived and pclass and sibsp and parch and age and fare and sex and embarked}
            .convert { allDfs() }.toDouble()

df2.head()
Out[20]:

DataFrame [5 x 8]

In [22]:
val titanicData = df2.flatten().toMap()

gggrid(
    listOf(
        CorrPlot(titanicData, "Tiles").tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Points").points()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(), 
        CorrPlot(titanicData, "Tiles and labels").tiles().labels()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Tiles, points and labels").points().labels().tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build()
    ), 1, 700, 600)
Out[22]:

Creation of new features

We suggest to combine both, Sibsp and parch features, into the new one feature with the name FamilyNumber as a simple sum of sibsp and parch.

In [25]:
val familyDF = df1.add("familyNumber") { sibsp + parch }

familyDF.head()
Out[25]:

DataFrame [5 x 15]

In [26]:
familyDF.corr { familyNumber }.with { survived }
Out[26]:

DataFrame [1 x 2]

In [27]:
familyDF.corr { familyNumber }.with { age }
Out[27]:

DataFrame [1 x 2]

Looks like the new feature has no influence on the survived column, but it has a strong negative correlation with age.

Titles

Let's try to extract something from the names. A lot of string in the name column contains special titles, like Done, Mr, Mrs and so on.

In [28]:
val titledDF = df.select { survived and name }.add ("title") { name.split(".")[0].split(",")[1].trim() }
titledDF.head(100)
Out[28]:

... showing only top 20 of 100 rows

DataFrame [100 x 3]

In [29]:
titledDF.valueCounts { title }
Out[29]:

DataFrame [18 x 2]

New Title column contains some rare titles and some titles with typos. Let's clean the data and merge rare titles into one category.

In [30]:
val rareTitles = listOf("Dona", "Lady", "the Countess", "Capt", "Col", "Don", 
                "Dr", "Major", "Rev", "Sir", "Jonkheer")

val cleanedTitledDF = titledDF.update { title }.with { 
                            when {
                                it == "Mlle" -> "Miss"
                                it == "Ms" -> "Miss"
                                it == "Mme" -> "Mrs"
                                it in rareTitles -> "Rare Title"
                                else -> it
                            }
                        }
In [31]:
cleanedTitledDF.valueCounts { title }
Out[31]:

DataFrame [5 x 2]

Now it looks awesome and we have only 5 different titles and could see how it correlates with survival.

In [32]:
val correlations = cleanedTitledDF
                    .pivotMatches { title }
                    .corr { title }.with { survived }
correlations
Out[32]:

DataFrame [5 x 2]

In [33]:
correlations.update { title }.with { it.substringAfter('_') }.filter { title != "survived" }
Out[33]:

DataFrame [5 x 2]

The women with title Miss and Mrs have the same chances to survive, but not the same for the men. If you have a title Mr, your deals are bad on the Titanic.

Rare title is really rare and doesn't play a big role.

In [34]:
val groupedCleanedTitledDF = cleanedTitledDF.valueCounts { title and survived }.sortBy { title and survived }
groupedCleanedTitledDF
Out[34]:

DataFrame [10 x 3]

Surname's analysis

It's very interesting to dig deeper into families, home destinations, and we could do start this analysis from surnames which could be easily extracted from Name feature.

In [35]:
val surnameDF = df1.select { survived and name }.add ("surname") { name.split(".")[0].split(",")[0].trim() }
surnameDF.head()
Out[35]:

DataFrame [5 x 3]

In [36]:
surnameDF.valueCounts { surname }
Out[36]:

... showing only top 20 of 875 rows

DataFrame [875 x 2]

In [37]:
surnameDF.surname.countDistinct()
Out[37]:
875
In [39]:
val firstSymbol by column<String>()

df1
.add (firstSymbol) { name.split(".")[0].split(",")[0].trim().first().toString() }
.pivotMatches(firstSymbol)
.corr { firstSymbol }.with { survived }
Out[39]:

... showing only top 20 of 27 rows

DataFrame [27 x 2]