0.15: New Features¶

Experimental new CSV parser based on Deephaven-CSV
Experimental new GeoDataFrame class for working with geographical data (from GeoJson/Shapefile) and plotting it with Kandy
Custom SQL Database support by passing the dbType parameter to read functions
Full BigInteger support
Improved parsing

In [1]:

// loading dependencies for the SQL examples
// this needs to be called before importing dataframe itself
USE {
    dependencies {
        implementation("com.h2database:h2:2.3.232")
        implementation("com.mysql:mysql-connector-j:9.1.0")
    }
}

In [2]:

%useLatestDescriptors

// you can enable the new experimental modules (in notebooks) in the following way:
%use dataframe(v=0.15.0, enableExperimentalCsv=true, enableExperimentalGeo=true)

Enabling experimental CSV module: dataframe-csv
Enabling experimental Geo module: dataframe-geo

Experimental new CSV parser based on Deephaven-CSV¶

DataFrame's CSV parsing has been based on Apache Commons CSV from the beginning. While this has been sufficient for most applications, it had some issues like running out of memory, performance, and our API lacking in clarity, documentation, and completeness. (Related issue #827)

For DataFrame 0.15, we introduce a new separate package org.jetbrains.kotlinx:dataframe-csv which tries to solve all these issues at once. It's based on Deephaven-CSV which makes it faster and more memory efficient. And since we built it from the ground up, we made sure the API was complete, predictable, and documented carefully.

To try it yourself, explicitly add the dependency org.jetbrains.kotlinx:dataframe-csv to your project. In notebooks you can add enableExperimentalCsv=true to the %use-magic, as seen above.

Given a large CSV file, such as below, the chances of running out of memory are now (still possible, but) lower:

In [6]:

// Old csv function:
DataFrame.readCSV(
    "../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)

java.lang.OutOfMemoryError: Ran out of memory reading this CSV-like file. You can try our new experimental CSV reader by adding the dependency "org.jetbrains.kotlinx:dataframe-csv:{VERSION}" and using `DataFrame.readCsv()` instead of `DataFrame.readCSV()`.
	at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:367)
	at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:285)
	at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$lambda$2(csv.kt:152)
	at org.jetbrains.kotlinx.dataframe.io.CommonKt.catchHttpResponse(common.kt:27)
	at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV(csv.kt:151)
	at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$default(csv.kt:138)
	at Line_11_jupyter.<init>(Line_11.jupyter.kts:2) at Cell In[6], line 2
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:122)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:48)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
	at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:137)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:277)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:48)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:137)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:80)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:78)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:774)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute-L4Nmkdk(CellExecutorImpl.kt:78)
	at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute-L4Nmkdk$default(CellExecutor.kt:13)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode-wNURfNM(ReplForJupyterImpl.kt:596)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalExImpl(ReplForJupyterImpl.kt:454)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evalExImpl(ReplForJupyterImpl.kt:141)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:447)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:446)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:427)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:446)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:171)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:170)
	at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)
	at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedIn(IdeCompatibleMessageRequestProcessor.kt:347)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedIn(IdeCompatibleMessageRequestProcessor.kt:67)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:361)
	at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)
	at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedErr(IdeCompatibleMessageRequestProcessor.kt:336)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedErr(IdeCompatibleMessageRequestProcessor.kt:67)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1.invoke(IdeCompatibleMessageRequestProcessor.kt:360)
	at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)
	at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedOut(IdeCompatibleMessageRequestProcessor.kt:328)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.evalWithIO(IdeCompatibleMessageRequestProcessor.kt:359)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:170)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:169)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$Task.execute(JupyterExecutorImpl.kt:41)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:81)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:79)
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

java.lang.OutOfMemoryError: Ran out of memory reading this CSV-like file. You can try our new experimental CSV reader by adding the dependency "org.jetbrains.kotlinx:dataframe-csv:{VERSION}" and using `DataFrame.readCsv()` instead of `DataFrame.readCSV()`.
at Cell In[6], line 2

In [3]:

// New csv function:
DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)

Out[3]:

Year	Ethnic	Sex	Area	count
2018	1	1	01	795
2018	1	1	02	5067
2018	1	1	03	2229
2018	1	1	04	1356
2018	1	1	05	180
2018	1	1	06	738
2018	1	1	07	630
2018	1	1	08	1188
2018	1	1	09	2157
2018	1	1	12	177
2018	1	1	13	2823
2018	1	1	14	1020
2018	1	1	15	516
2018	1	1	16	222
2018	1	1	17	219
2018	1	1	18	234
2018	1	1	99	..C
2018	1	1	9999	19563
2018	1	1	77	19557
2018	1	2	01	768

40 million rows! Not bad, right?

The fact we can now read this file is due to Deephaven CSV's ability to parse columns directly to the target type, like Int, or Double, instead of reading and storing everything as a String first and then parsing it. This saves both memory and running time. Deephaven made a blogpost if you're curious about the specifics.

DataFrame still reads everything into (boxed) memory, so there are limits to the size of the file you can read, but now the CSV reader is not a limiting factor anymore. (Check the "Max heap size" setting if you're running this notebook in IntelliJ, and you're still running out of memory for large files).

Switching to the new API, in most cases, is as easy as swapping readCSV with readCsv (and readTSV with readTsv, etc.). However, there are a few differences in the API, so be sure to check the KDocs of the new functions.

Here's a small demonstration of the new API:

In [3]:

import java.util.Locale

DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/irisDataset.csv",
    delimiter = ',',

    // overwriting the given header
    header = listOf("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"),

    // skipping the first line in the file with old header
    skipLines = 1,

    // reading only 50 lines
    readLines = 50,

    // manually specifying the types of the columns, will be inferred otherwise
    colTypes = mapOf(
        "species" to ColType.String, // setting the type of the species column to String
        ColType.DEFAULT to ColType.Double, // setting type of all other columns to Double
    ),

    // manually specifying some parser options
    // Will be read from the global parser options `DataFrame.parser` otherwise
    parserOptions = ParserOptions(
        // setting the locale to US, uses `DataFrame.parser.locale` or `Locale.getDefault()` otherwise
        locale = Locale.US,
        // overriding null strings
        nullStrings = DEFAULT_DELIM_NULL_STRINGS + "nothing",
        // using the new faster double parser, true by default for readCsv
        useFastDoubleParser = true,
    ),

    // new! specifying the quote character
    quote = '\"',

    // specifying whether to ignore empty lines in between rows in the file, and plenty more options...
    ignoreEmptyLines = false,
    allowMissingColumns = true,
    ignoreSurroundingSpaces = true,
    trimInsideQuoted = false,
    parseParallel = true,
)

Out[3]:

sepalLength	sepalWidth	petalLength	petalWidth	species
5.100000	3.500000	1.400000	0.200000	Setosa
4.900000	3.000000	1.400000	0.200000	Setosa
4.700000	3.200000	1.300000	0.200000	Setosa
4.600000	3.100000	1.500000	0.200000	Setosa
5.000000	3.600000	1.400000	0.200000	Setosa
5.400000	3.900000	1.700000	0.400000	Setosa
4.600000	3.400000	1.400000	0.300000	Setosa
5.000000	3.400000	1.500000	0.200000	Setosa
4.400000	2.900000	1.400000	0.200000	Setosa
4.900000	3.100000	1.500000	0.100000	Setosa
5.400000	3.700000	1.500000	0.200000	Setosa
4.800000	3.400000	1.600000	0.200000	Setosa
4.800000	3.000000	1.400000	0.100000	Setosa
4.300000	3.000000	1.100000	0.100000	Setosa
5.800000	4.000000	1.200000	0.200000	Setosa
5.700000	4.400000	1.500000	0.400000	Setosa
5.400000	3.900000	1.300000	0.400000	Setosa
5.100000	3.500000	1.400000	0.300000	Setosa
5.700000	3.800000	1.700000	0.300000	Setosa
5.100000	3.800000	1.500000	0.300000	Setosa

Since deephaven supports it, we can now also read multi-space separated files, like logs (Relevant issue #746):

In [4]:

DataFrame.readDelimStr(
    """
    NAME                     STATUS   AGE      LABELS
    argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
    argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
    argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
    beta                     Active   4y235d   kubernetes.io/metadata.name=beta
    """.trimIndent(),
    hasFixedWidthColumns = true,
)

Out[4]:

NAME	STATUS	AGE	LABELS
argo-events	Active	2y77d	app.kubernetes.io/instance=argo-event...
argo-workflows	Active	2y77d	app.kubernetes.io/instance=argo-workf...
argocd	Active	5y18d	kubernetes.io/metadata.name=argocd

We provide single overload (with InputStream) which exposes the underlying implementation for when ours is not sufficient for your needs (Relevant issue #787).

In [5]:

import io.deephaven.csv.containers.ByteSlice
import io.deephaven.csv.tokenization.Tokenizer
import java.io.InputStream

DataFrame.readCsv(
    inputStream = File("../../../../dataframe-csv/src/test/resources/irisDataset.csv").inputStream(),
    adjustCsvSpecs = {
        this
            .headerLegalizer {
                it.map { it.lowercase().replace('.', '_') }.toTypedArray()
            }
            .customDoubleParser(object : Tokenizer.CustomDoubleParser {
                override fun parse(bs: ByteSlice?): Double = TODO("Not yet implemented")
                override fun parse(cs: CharSequence?): Double = TODO("Not yet implemented")
            })
            // etc..
    },
)

Out[5]:

sepal_length	sepal_width	petal_length	petal_width	variety
5.100000	3.500000	1.400000	0.200000	Setosa
4.900000	3.000000	1.400000	0.200000	Setosa
4.700000	3.200000	1.300000	0.200000	Setosa
4.600000	3.100000	1.500000	0.200000	Setosa
5.000000	3.600000	1.400000	0.200000	Setosa
5.400000	3.900000	1.700000	0.400000	Setosa
4.600000	3.400000	1.400000	0.300000	Setosa
5.000000	3.400000	1.500000	0.200000	Setosa
4.400000	2.900000	1.400000	0.200000	Setosa
4.900000	3.100000	1.500000	0.100000	Setosa
5.400000	3.700000	1.500000	0.200000	Setosa
4.800000	3.400000	1.600000	0.200000	Setosa
4.800000	3.000000	1.400000	0.100000	Setosa
4.300000	3.000000	1.100000	0.100000	Setosa
5.800000	4.000000	1.200000	0.200000	Setosa
5.700000	4.400000	1.500000	0.400000	Setosa
5.400000	3.900000	1.300000	0.400000	Setosa
5.100000	3.500000	1.400000	0.300000	Setosa
5.700000	3.800000	1.700000	0.300000	Setosa
5.100000	3.800000	1.500000	0.300000	Setosa

Finally, we now support reading from ZIP files directly, along with GZIP (already demonstrated above) and custom compression formats (Relevant issue #469):

In [6]:

DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/testCSV.zip",
    // this can be manually specified, but is inferred automatically from the file extension
    // compression = Compression.Zip,
)

Out[6]:

untitled	user_id	name	duplicate	username	duplicate1	duplicate11	double	number	time	empty
0	4	George	null	abc	a	null	1203.000000	599.213000	2021-01-07T15:12:32	null
1	5	Paul	null	paul	null	null	null	214.211000	2021-01-14T14:36:19	null
2	8	Johnny	null	qwerty	b	null	20.000000	412.214000	2021-02-23T19:47	null
3	10	Jack	null	buk	null	null	2414.000000	1.010000	2021-03-08T23:38:52	null
4	12	Samuel	null	qwerty	null	null	Infinity	0.000000	2021-04-01T02:30:22	null

In [7]:

USE { dependencies("org.tukaani:xz:1.10", "org.apache.commons:commons-compress:1.27.1") }

In [8]:

import org.apache.commons.compress.archivers.tar.TarFile
import org.apache.commons.io.IOUtils
import org.apache.commons.compress.utils.SeekableInMemoryByteChannel

// custom compression format by specifying how to convert a compressed InputStream to a normal one
val tarCompression = Compression.Custom({ tarInputStream ->
    val tar = TarFile(SeekableInMemoryByteChannel(IOUtils.toByteArray(tarInputStream)))
    tar.getInputStream(tar.entries.first())
})

DataFrame.readCsv("irisDataset.tar", compression = tarCompression)

Out[8]:

sepal.length	sepal.width	petal.length	petal.width	variety
5.100000	3.500000	1.400000	0.200000	Setosa
4.900000	3.000000	1.400000	0.200000	Setosa
4.700000	3.200000	1.300000	0.200000	Setosa
4.600000	3.100000	1.500000	0.200000	Setosa
5.000000	3.600000	1.400000	0.200000	Setosa
5.400000	3.900000	1.700000	0.400000	Setosa
4.600000	3.400000	1.400000	0.300000	Setosa
5.000000	3.400000	1.500000	0.200000	Setosa
4.400000	2.900000	1.400000	0.200000	Setosa
4.900000	3.100000	1.500000	0.100000	Setosa
5.400000	3.700000	1.500000	0.200000	Setosa
4.800000	3.400000	1.600000	0.200000	Setosa
4.800000	3.000000	1.400000	0.100000	Setosa
4.300000	3.000000	1.100000	0.100000	Setosa
5.800000	4.000000	1.200000	0.200000	Setosa
5.700000	4.400000	1.500000	0.400000	Setosa
5.400000	3.900000	1.300000	0.400000	Setosa
5.100000	3.500000	1.400000	0.300000	Setosa
5.700000	3.800000	1.700000	0.300000	Setosa
5.100000	3.800000	1.500000	0.300000	Setosa

Writing is also supported; it still uses Apache Commons CSV under the hood. The API is similar to the reading API:

In [9]:

val irisDf = DataFrame.readCsv("../../../../dataframe-csv/src/test/resources/irisDataset.csv")

irisDf.writeCsv("irisDataset.csv")

some options can be specified:

In [10]:

irisDf.writeDelim(
    path = "irisDataset.csv",
    delimiter = ';',
    includeHeader = false,
    quoteMode = QuoteMode.ALL,
    escapeChar = '\\',
    commentChar = '#',
    headerComments = listOf("This is a comment", "This is another comment"),
    recordSeparator = "\n",
)

and similarly we have a single overload which exposes the underlying implementation:

In [11]:

irisDf.writeCsv(
    writer = File("irisDataset.csv").writer(),
    adjustCsvFormat = {
        this
            .setSkipHeaderRecord(true)
            .setHeader("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species")
            .setTrailingData(true)
            .setNullString("null")
            // etc..
    },
)

Experimental new GeoDataFrame with Kandy¶

Kandy v0.8 introduces geo-plotting which allows you to visualize geospatial/geographical data using the awesome Kandy DSL. To make working with this geographical data (from GeoJson/Shapefile) easier, we happily accepted the GeoDataFrame PR from the Kandy team (Relevant issue #875)

To try it yourself, explicitly add the dependency org.jetbrains.kotlinx:dataframe-geo to your project (with the repository maven("https://repo.osgeo.org/repository/release")) or add enableExperimentalGeo=true to the %use-magic, as seen at the start of the notebook.

Then use GeoDataFrame.readGeoJson() or GeoDataFrame.readShapeFile() to get started!

In [12]:

USE {
    repositories("https://repo.osgeo.org/repository/release")
    dependencies("org.jetbrains.kotlinx:kandy-geo:0.8.0-dev-57")
}

Here's a small demonstration of the new API, reading and plotting a GeoJson file:

In [13]:

val usaGeo = GeoDataFrame.readGeoJson("https://echarts.apache.org/examples/data/asset/geo/USA.json")
usaGeo.df

2024-12-06T13:22:34.883693Z Execution of code 'val usaGeo = GeoData...' ERROR Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

Out[13]:

name	geometry
Alabama	POLYGON ((-87.359296 35.00118, -85.60...
Alaska	MULTIPOLYGON (((-131.602021 55.117982...
Arizona	POLYGON ((-109.042503 37.000263, -109...
Arkansas	POLYGON ((-94.473842 36.501861, -90.1...
California	POLYGON ((-123.233256 42.006186, -122...
Colorado	POLYGON ((-107.919731 41.003906, -105...
Connecticut	POLYGON ((-73.053528 42.039048, -71.7...
Delaware	POLYGON ((-75.414089 39.804456, -75.5...
District of Columbia	POLYGON ((-77.035264 38.993869, -76.9...
Florida	POLYGON ((-85.497137 30.997536, -85.0...
Georgia	POLYGON ((-83.109191 35.00118, -83.32...
Hawaii	MULTIPOLYGON (((-155.634835 18.948267...
Idaho	POLYGON ((-116.04751 49.000239, -116....
Illinois	POLYGON ((-90.639984 42.510065, -88.7...
Indiana	POLYGON ((-85.990061 41.759724, -84.8...
Iowa	POLYGON ((-91.368417 43.501391, -91.2...
Kansas	POLYGON ((-101.90605 40.001626, -95.3...
Kentucky	POLYGON ((-83.903347 38.769315, -83.6...
Louisiana	POLYGON ((-93.608485 33.018527, -91.1...
Maine	POLYGON ((-70.703921 43.057759, -70.8...

In [14]:

usaGeo.plot { geoMap() }

Out[14]:

Let's modify the GeoDataFrame a bit by adding some population data and plotting that too:

In [15]:

val usPopByState = DataFrame.readCsv("us_pop_by_state.csv")
usPopByState

Out[15]:

rank	state	state_code	2020_census	percent_of_total
1.000000	California	CA	39538223	0.119100
2.000000	Texas	TX	29145505	0.087400
3.000000	Florida	FL	21538187	0.064700
4.000000	New York	NY	20201249	0.058600
5.000000	Pennsylvania	PA	13002700	0.038600
6.000000	Illinois	IL	12801989	0.038200
7.000000	Ohio	OH	11799448	0.035200
8.000000	Georgia	GA	10711908	0.032000
9.000000	North Carolina	NC	10439388	0.031600
10.000000	Michigan	MI	10077331	0.030100
11.000000	New Jersey	NJ	9288994	0.026800
12.000000	Virginia	VA	8631393	0.025700
13.000000	Washington	WA	7705281	0.022900
14.000000	Arizona	AZ	7151502	0.021900
15.000000	Massachusetts	MA	7029917	0.020900
16.000000	Tennessee	TN	6910840	0.020600
17.000000	Indiana	IN	6785528	0.020300
18.000000	Maryland	MD	6177224	0.018500
19.000000	Missouri	MO	6154913	0.018200
20.000000	Wisconsin	WI	5893718	0.017500

In [16]:

val usaGeoPopulation = usaGeo.modify {
    this.join(
        usPopByState.select { state and `2020_census`.named("population") },
    ) { name match right.state }
}
usaGeoPopulation.df

Out[16]:

name	geometry	population
Alabama	POLYGON ((-87.359296 35.00118, -85.60...	5024279
Alaska	MULTIPOLYGON (((-131.602021 55.117982...	733391
Arizona	POLYGON ((-109.042503 37.000263, -109...	7151502
Arkansas	POLYGON ((-94.473842 36.501861, -90.1...	3011524
California	POLYGON ((-123.233256 42.006186, -122...	39538223
Colorado	POLYGON ((-107.919731 41.003906, -105...	5773714
Connecticut	POLYGON ((-73.053528 42.039048, -71.7...	3605944
Delaware	POLYGON ((-75.414089 39.804456, -75.5...	989948
Florida	POLYGON ((-85.497137 30.997536, -85.0...	21538187
Georgia	POLYGON ((-83.109191 35.00118, -83.32...	10711908
Hawaii	MULTIPOLYGON (((-155.634835 18.948267...	1455271
Idaho	POLYGON ((-116.04751 49.000239, -116....	1839106
Illinois	POLYGON ((-90.639984 42.510065, -88.7...	12801989
Indiana	POLYGON ((-85.990061 41.759724, -84.8...	6785528
Iowa	POLYGON ((-91.368417 43.501391, -91.2...	3271616
Kansas	POLYGON ((-101.90605 40.001626, -95.3...	2937880
Kentucky	POLYGON ((-83.903347 38.769315, -83.6...	4505836
Louisiana	POLYGON ((-93.608485 33.018527, -91.1...	4657757
Maine	POLYGON ((-70.703921 43.057759, -70.8...	1362359
Maryland	MULTIPOLYGON (((-75.994645 37.95325, ...	6177224

In [17]:

usaGeoPopulation.plot {
    // crop out alaska and hawaii
    x.axis.limits = -130..-65
    y.axis.limits = 25..50

    geoMap {
        tooltips(name, population)
        fillColor(population) {
            scale = continuousColorViridis()
        }
        borderLine {
            width = 0.1
            color = Color.BLACK
        }
        alpha = 0.5
        layout.style(Style.Void)
    }
}

Out[17]:

Custom SQL Database support¶

Our JDBC-based SQL integration for DataFrame has become extensible!

This means that if you have an SQL database that we currently don't support, you can create your own DbType instance and read from your database to a dataframe. (Remember that we already support quite a few databases: MariaDB, PostgreSQL, MySQL, SQLite, MS SQL, and H2 (with dialects))

To get started, we need a custom DbType.

For the sake of example, we'll create a custom DbType based on the H2 Database. Ordinarily, you'd extend DbType("jdbc name of your database").

In [18]:

import org.jetbrains.kotlinx.dataframe.io.db.*
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.ResultSet
import kotlin.reflect.KType

object CustomDbType : H2(MySql) {

    /**
     * Represents the JDBC driver class name for a given database type.
     * Something like "org.h2.Driver".
     */
    override val driverClassName: String
        get() = super.driverClassName

    /**
     * Here you define which KType you expect the column to be based on [tableColumnMetadata].
     * This is mostly for special cases, as DataFrame can already infer most types
     * from the databse automatically.
     *
     * Return `null` to let DataFrame figure out the type.
     */
    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? {
        return super.convertSqlTypeToKType(tableColumnMetadata)
    }

    /**
     * Similar to [convertSqlTypeToKType] but here you'll need to define a [ColumnSchema] for the column
     * based on [tableColumnMetadata].
     *
     * Return `null` to let DataFrame figure out the schema.
     */
    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? {
        return super.convertSqlTypeToColumnSchemaValue(tableColumnMetadata)
    }

    /**
     * Here you define where to get the table metadata for information about the database table,
     * including its name, schema name, and catalogue name.
     */
    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        return super.buildTableMetadata(tables)
    }

    /**
     * Return whether the table with metadata [tableMetadata] should be considered
     * a system table or not.
     *
     * System tables are skipped when reading.
     */
    override fun isSystemTable(tableMetadata: TableMetadata): Boolean {
        return super.isSystemTable(tableMetadata)
    }

    /**
     * Can be overridden to change DataFrame limits queries in your specific DataBase type.
     *
     * By default it executes: `"$sqlQuery LIMIT $limit"`
     */
    override fun sqlQueryLimit(sqlQuery: String, limit: Int): String {
        return super.sqlQueryLimit(sqlQuery, limit)
    }
}

Now that we have a custom DbType we can connect to our database (add some demo data) and retrieve it in a dataframe!

In [19]:

import org.intellij.lang.annotations.Language
import java.sql.DriverManager

val URL = "jdbc:h2:mem:test5;DB_CLOSE_DELAY=-1;MODE=MySQL;DATABASE_TO_UPPER=false"
val connection = DriverManager.getConnection(URL)

// insert some demo data
val statements = listOf(
    """
    CREATE TABLE Customer (
        id INT PRIMARY KEY,
        name VARCHAR(50),
        age INT
    )
    """.trimIndent(),
    """
    CREATE TABLE Sale (
        id INT PRIMARY KEY,
        customerId INT,
        amount DECIMAL(10, 2) NOT NULL
    )
    """.trimIndent(),
    "INSERT INTO Customer (id, name, age) VALUES (1, 'John', 40)",
    "INSERT INTO Customer (id, name, age) VALUES (2, 'Alice', 25)",
    "INSERT INTO Customer (id, name, age) VALUES (3, 'Bob', 47)",
    "INSERT INTO Customer (id, name, age) VALUES (4, NULL, NULL)",
    "INSERT INTO Sale (id, customerId, amount) VALUES (1, 1, 100.50)",
    "INSERT INTO Sale (id, customerId, amount) VALUES (2, 2, 50.00)",
    "INSERT INTO Sale (id, customerId, amount) VALUES (3, 1, 75.25)",
    "INSERT INTO Sale (id, customerId, amount) VALUES (4, 3, 35.15)",
)
statements.forEach { connection.createStatement().execute(it) }

In [20]:

// and read it :)
DataFrame.readSqlQuery(connection, "SELECT * FROM Customer")

Out[20]:

id	name	age
1	John	40
2	Alice	25
3	Bob	47
4	null	null

On the documentation website, you can find another example to support custom databases. This time, it uses HSQLDB.

`BigInteger` support¶

Java has support for arbitrarily large decimal- and integer values: BigDecimal and BigInteger. This is very helpful when working with huge numbers and Double and Long are not big enough. Maybe Kotlin will even gain its own representation in the future!

DataFrame has supported BigDecimal for a while, but it lacked BigInteger support. DataFrame 0.15 fixes that.

Let's make a column with numbers so large that they can only be represented as String:

In [21]:

import java.math.BigInteger
import kotlin.random.Random
import kotlin.random.nextLong
import kotlin.random.nextUInt

val largestLong = Long.MAX_VALUE.toString()
val giantNumberCol: DataColumn<String> by List(10) {
    largestLong + abs(Random.nextLong()).toString()
}.toColumn()

giantNumberCol

Out[21]:

giantNumberCol
92233720368547758076895131956927986339
92233720368547758078046379596815421908
92233720368547758074448343410727247115
92233720368547758076039595862559697533
92233720368547758075144108214797648735
92233720368547758078096095849008105978
92233720368547758074672787943104667710
92233720368547758074824236907230894613
92233720368547758072945460066985633410
92233720368547758078189171310245143715

We now have overloads to convert/parse this column to BigInteger, just like for the other conversions. This also allows us to perform mathematical operations with it!

In [22]:

val bigIntCol = giantNumberCol.convertToBigInteger()

DISPLAY(bigIntCol.type)

bigIntCol * -1.toBigInteger()

java.math.BigInteger

Out[22]:

giantNumberCol
-92233720368547758076895131956927986339
-92233720368547758078046379596815421908
-92233720368547758074448343410727247115
-92233720368547758076039595862559697533
-92233720368547758075144108214797648735
-92233720368547758078096095849008105978
-92233720368547758074672787943104667710
-92233720368547758074824236907230894613
-92233720368547758072945460066985633410
-92233720368547758078189171310245143715

We also support conversions from/to BigInteger, both on the column itself, and when the column is inside a dataframe:

In [23]:

val df = bigIntCol.toDataFrame()
    .convert { bigIntCol }.toBigDecimal()

DISPLAY(df.schema())
df

giantNumberCol: java.math.BigDecimal

Out[23]:

giantNumberCol
9223372036854775807689513195692798633...
9223372036854775807804637959681542190...
9223372036854775807444834341072724711...
9223372036854775807603959586255969753...
9223372036854775807514410821479764873...
9223372036854775807809609584900810597...
9223372036854775807467278794310466771...
9223372036854775807482423690723089461...
9223372036854775807294546006698563341...
9223372036854775807818917131024514371...

Finally, statistics also support BigInteger, as well as all other number types.

(.describe() now also works a bit better, both supporting BigInteger as well as columns with mixed number types Relevant issue #558. We'll continue to improve the statistics functions in the next releases)

In [24]:

import java.math.BigDecimal

val bigDecimalCol: DataColumn<BigDecimal> by bigIntCol.convertTo()
val mixedNumberCol: DataColumn<Number> by bigIntCol.map {
    if (it % 2.toBigInteger() == 0.toBigInteger()) Random.nextDouble() else Random.nextInt()
}

dataFrameOf(
    bigIntCol named "bigIntCol",
    bigDecimalCol,
    mixedNumberCol,
).describe()

Out[24]:

name	type	count	unique	top	freq	mean	std	min	median	max
bigIntCol	java.math.BigInteger	10	10	92233720368547758076895131956927986339	1	9223372036854776000000000000000000000...	0.000000	92233720368547758072945460066985633410	92233720368547758075591852038678673134	92233720368547758078189171310245143715
bigDecimalCol	java.math.BigDecimal	10	10	9223372036854775807689513195692798633...	1	9223372036854776000000000000000000000...	1817876800039198720.000000	9223372036854775807294546006698563341...	9223372036854775807559185203867867313...	9223372036854775807818917131024514371...
mixedNumberCol	Number	10	10	-558408044	1	-285953766.155901	1186872780.278754	-2071877832.000000	0.089672	1949813924.000000

Improved Parsing¶

Parsing, in DataFrame, is a special case of convert. It can convert String columns to any other supported type by guessing. This can be done manually, by calling .parse() on a dataframe, but it also happens automatically when reading from textual data, like CSV.

In DataFrame 0.15:

The speed of parsing and guessing types has improved
We gained support for parsing strings to Char
We have a new experimental double parser

The new double parser is based on FastDoubleParser and can be enabled by setting useFastDoubleParser = true in the parser options.

In [25]:

// enabling the fast double parser globally can be done like
DataFrame.parser.apply {
    useFastDoubleParser = true
    // you can also set other global parsing options here
}

// or you can choose to enable it per call
// Each function that parses strings should have the `parserOptions` argument:
DataFrame.readDelimStr(
    text = """
           numbers
           0,12
           100.456,23
           1,00
           """.trimIndent(),
    delimiter = ';',
    parserOptions = ParserOptions(
        locale = Locale.GERMAN,
        useFastDoubleParser = true,
    )
)

Out[25]:

numbers
0.120000
100456.230000
1.000000

Our implementation of FastDoubleParser is configured to be very forgiving, depending on the locale.

For instance, in the French, numbers are often formatted like "100 512,123", which contains a non-breaking space character (" "). If you come across files which use normal spaces (" "), many double parsers would fail.

The same holds for the Estonian minus ("−") which is expected if your locale is set to Estonian (Relevant issue #607).

We now try to catch these cases and save you some headaches :). Do not hesitate to provide feedback if you have a case that fails, and you think it should work!

In [26]:

val estonianNumbers by listOf(
    "12,45",
    "−13,35", // note the different minus sign '−' vs '-'
    "−204 235,23", // note the different minus sign '−' vs '-'
    "100 123,35", // space instead of NBSP
    "1,234e3",
    "-345,122", // 'ordinary' minus sign
).toColumn()

estonianNumbers.parse(ParserOptions(locale = Locale.forLanguageTag("et-EE"), useFastDoubleParser = true))

Out[26]:

estonianNumbers
12.450000
-13.350000
-204235.230000
100123.350000
1234.000000
-345.122000

In [ ]: