Spark setup¶

In [ ]:

import $ivy.`org.apache.spark::spark-sql:3.1.1`
import $ivy.`org.typelevel::cats-core:2.3.0`
import $ivy.`com.lihaoyi::sourcecode:0.2.6`

In [ ]:

import $ivy.`org.hablapps::doric:0.0.1`

In [ ]:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.{functions => f}
import doric._

In [ ]:

val spark = org.apache.spark.sql.SparkSession.builder().appName("test").master("local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [ ]:

import spark.implicits._

Typed join expressions¶

Spark column expressions are subject to the same problems that we explored previously: lack of static typing, unreported errors at DataFrame compile-time, unsolicited implicit castings, etc. For instance, given the following DataFrames:

In [ ]:

val leftdf = List((1, 1, "hi"), (2, 2, "bye"), (3, 3, "3")).toDF("id-left", "id", "value-left")
val rightdf = List((1, 1, "hi"), (2, 2, "bye"), (3, 3, "3")).toDF("id-right", "id", "value-right")

the following equi-join expressions, where we use different mechanisms to refer to the corresponding columns, compile and run without problems:

In [ ]:

leftdf.join(rightdf, f.col("id-left") === f.col("id-right"))
leftdf.join(rightdf, leftdf("id") === rightdf("id"))
leftdf.alias("left").join(rightdf.alias("right"), f.col("left.id") === f.col("right.id"))

However, if some column name or type is wrong, then errors will be shown too late, or implicit type casts will be applied. For instance, the following DataFrame compiles in Spark and runs with garbage results:

In [ ]:

val dfjoin = leftdf.join(rightdf, leftdf("id") === rightdf("value-right"))

In [ ]:

dfjoin.show

Using doric, errors can be detected at (Scala) compile-time:

In [ ]:

// Scala will prevent this from compiling successfully
def dfjoin = leftdf.join(rightdf, LeftDF.colInt("id") === RightDF.colString("value-right"), "inner")

or at DataFrame-construction time:

In [ ]:

val dfjoin = leftdf.join(rightdf, LeftDF.colInt("id-left") === RightDF.colInt("value-right"), "inner")

In [ ]:

val dfjoin = leftdf.join(rightdf, LeftDF.colInt("id-left1") === RightDF.colInt("value-right"), "inner")

As you can see, join doric expressions also enjoy all the goodies concerning error location :D

If everything is well-typed, then the DataFrame constructed will be exactly the same than the one obtained using conventional column expressions:

In [ ]:

leftdf.join(rightdf, LeftDF.colInt("id") === RightDF.colInt("id"), "inner")

As all these examples show, join doric column expressions refer to the left and right DataFrames using the special objects LeftDF and RightDF, respectively. We can also refer to arbitrarily complex column expressions within the context of the left and right DataFrames, enclosing the expression between parentheses. For instance:

In [ ]:

val leftCol = LeftDF.colString("id2")
val rightCol = RightDF(/* complex column expression here*/ colInt("id").cast[String])
leftdf.join(rightdf, leftCol === rightCol, "inner").show

and, as this example also shows, we can also decompose doric join expressions in different functions so as to obtain more modular designs.

In [ ]: