GeoDataFrame
class for working with geographical data (from GeoJson/Shapefile) and plotting it with KandydbType
parameter to read functionsBigInteger
support// loading dependencies for the SQL examples
// this needs to be called before importing dataframe itself
USE {
dependencies {
implementation("com.h2database:h2:2.3.232")
implementation("com.mysql:mysql-connector-j:9.1.0")
}
}
%useLatestDescriptors
// you can enable the new experimental modules (in notebooks) in the following way:
%use dataframe(v=0.15.0, enableExperimentalCsv=true, enableExperimentalGeo=true)
Enabling experimental CSV module: dataframe-csv Enabling experimental Geo module: dataframe-geo
DataFrame's CSV parsing has been based on Apache Commons CSV from the beginning. While this has been sufficient for most applications, it had some issues like running out of memory, performance, and our API lacking in clarity, documentation, and completeness. (Related issue #827)
For DataFrame 0.15, we introduce a new separate package org.jetbrains.kotlinx:dataframe-csv
which tries to solve all these issues at once. It's based on Deephaven-CSV which makes it faster and more memory efficient. And since we built it from the ground up, we made sure the API was complete, predictable, and documented carefully.
To try it yourself, explicitly add the dependency org.jetbrains.kotlinx:dataframe-csv
to your project. In notebooks you can add enableExperimentalCsv=true
to the %use-magic, as seen above.
Given a large CSV file, such as below, the chances of running out of memory are now (still possible, but) lower:
// Old csv function:
DataFrame.readCSV(
"../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)
java.lang.OutOfMemoryError: Ran out of memory reading this CSV-like file. You can try our new experimental CSV reader by adding the dependency "org.jetbrains.kotlinx:dataframe-csv:{VERSION}" and using `DataFrame.readCsv()` instead of `DataFrame.readCSV()`. at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:367) at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:285) at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$lambda$2(csv.kt:152) at org.jetbrains.kotlinx.dataframe.io.CommonKt.catchHttpResponse(common.kt:27) at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV(csv.kt:151) at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$default(csv.kt:138) at Line_11_jupyter.<init>(Line_11.jupyter.kts:2) at Cell In[6], line 2 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:122) at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:48) at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt) at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49) at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:137) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104) at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:277) at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95) at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69) at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source) at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:48) at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source) at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:137) at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:80) at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:78) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:774) at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute-L4Nmkdk(CellExecutorImpl.kt:78) at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute-L4Nmkdk$default(CellExecutor.kt:13) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode-wNURfNM(ReplForJupyterImpl.kt:596) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalExImpl(ReplForJupyterImpl.kt:454) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evalExImpl(ReplForJupyterImpl.kt:141) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:447) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:446) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:427) at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:446) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:171) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:170) at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70) at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedIn(IdeCompatibleMessageRequestProcessor.kt:347) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedIn(IdeCompatibleMessageRequestProcessor.kt:67) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:361) at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70) at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedErr(IdeCompatibleMessageRequestProcessor.kt:336) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedErr(IdeCompatibleMessageRequestProcessor.kt:67) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1.invoke(IdeCompatibleMessageRequestProcessor.kt:360) at org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70) at org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedOut(IdeCompatibleMessageRequestProcessor.kt:328) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.evalWithIO(IdeCompatibleMessageRequestProcessor.kt:359) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:170) at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:169) at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$Task.execute(JupyterExecutorImpl.kt:41) at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:81) at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:79) at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30) java.lang.OutOfMemoryError: Ran out of memory reading this CSV-like file. You can try our new experimental CSV reader by adding the dependency "org.jetbrains.kotlinx:dataframe-csv:{VERSION}" and using `DataFrame.readCsv()` instead of `DataFrame.readCSV()`. at Cell In[6], line 2
// New csv function:
DataFrame.readCsv(
"../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)
Year | Age | Ethnic | Sex | Area | count |
---|---|---|---|---|---|
2018 | 0 | 1 | 1 | 01 | 795 |
2018 | 0 | 1 | 1 | 02 | 5067 |
2018 | 0 | 1 | 1 | 03 | 2229 |
2018 | 0 | 1 | 1 | 04 | 1356 |
2018 | 0 | 1 | 1 | 05 | 180 |
2018 | 0 | 1 | 1 | 06 | 738 |
2018 | 0 | 1 | 1 | 07 | 630 |
2018 | 0 | 1 | 1 | 08 | 1188 |
2018 | 0 | 1 | 1 | 09 | 2157 |
2018 | 0 | 1 | 1 | 12 | 177 |
2018 | 0 | 1 | 1 | 13 | 2823 |
2018 | 0 | 1 | 1 | 14 | 1020 |
2018 | 0 | 1 | 1 | 15 | 516 |
2018 | 0 | 1 | 1 | 16 | 222 |
2018 | 0 | 1 | 1 | 17 | 219 |
2018 | 0 | 1 | 1 | 18 | 234 |
2018 | 0 | 1 | 1 | 99 | ..C |
2018 | 0 | 1 | 1 | 9999 | 19563 |
2018 | 0 | 1 | 1 | 77 | 19557 |
2018 | 0 | 1 | 2 | 01 | 768 |
40 million rows! Not bad, right?
The fact we can now read this file is due to Deephaven CSV's ability to parse columns directly to the target type, like Int
, or Double
, instead of reading and storing everything as a String
first and then parsing it. This saves both memory and running time. Deephaven made a blogpost if you're curious about the specifics.
DataFrame still reads everything into (boxed) memory, so there are limits to the size of the file you can read, but now the CSV reader is not a limiting factor anymore. (Check the "Max heap size" setting if you're running this notebook in IntelliJ, and you're still running out of memory for large files).
Switching to the new API, in most cases, is as easy as swapping readCSV
with readCsv
(and readTSV
with readTsv
, etc.). However, there are a few differences in the API, so be sure to check the KDocs of the new functions.
Here's a small demonstration of the new API:
import java.util.Locale
DataFrame.readCsv(
"../../../../dataframe-csv/src/test/resources/irisDataset.csv",
delimiter = ',',
// overwriting the given header
header = listOf("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"),
// skipping the first line in the file with old header
skipLines = 1,
// reading only 50 lines
readLines = 50,
// manually specifying the types of the columns, will be inferred otherwise
colTypes = mapOf(
"species" to ColType.String, // setting the type of the species column to String
ColType.DEFAULT to ColType.Double, // setting type of all other columns to Double
),
// manually specifying some parser options
// Will be read from the global parser options `DataFrame.parser` otherwise
parserOptions = ParserOptions(
// setting the locale to US, uses `DataFrame.parser.locale` or `Locale.getDefault()` otherwise
locale = Locale.US,
// overriding null strings
nullStrings = DEFAULT_DELIM_NULL_STRINGS + "nothing",
// using the new faster double parser, true by default for readCsv
useFastDoubleParser = true,
),
// new! specifying the quote character
quote = '\"',
// specifying whether to ignore empty lines in between rows in the file, and plenty more options...
ignoreEmptyLines = false,
allowMissingColumns = true,
ignoreSurroundingSpaces = true,
trimInsideQuoted = false,
parseParallel = true,
)
sepalLength | sepalWidth | petalLength | petalWidth | species |
---|---|---|---|---|
5.100000 | 3.500000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.000000 | 1.400000 | 0.200000 | Setosa |
4.700000 | 3.200000 | 1.300000 | 0.200000 | Setosa |
4.600000 | 3.100000 | 1.500000 | 0.200000 | Setosa |
5.000000 | 3.600000 | 1.400000 | 0.200000 | Setosa |
5.400000 | 3.900000 | 1.700000 | 0.400000 | Setosa |
4.600000 | 3.400000 | 1.400000 | 0.300000 | Setosa |
5.000000 | 3.400000 | 1.500000 | 0.200000 | Setosa |
4.400000 | 2.900000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.100000 | 1.500000 | 0.100000 | Setosa |
5.400000 | 3.700000 | 1.500000 | 0.200000 | Setosa |
4.800000 | 3.400000 | 1.600000 | 0.200000 | Setosa |
4.800000 | 3.000000 | 1.400000 | 0.100000 | Setosa |
4.300000 | 3.000000 | 1.100000 | 0.100000 | Setosa |
5.800000 | 4.000000 | 1.200000 | 0.200000 | Setosa |
5.700000 | 4.400000 | 1.500000 | 0.400000 | Setosa |
5.400000 | 3.900000 | 1.300000 | 0.400000 | Setosa |
5.100000 | 3.500000 | 1.400000 | 0.300000 | Setosa |
5.700000 | 3.800000 | 1.700000 | 0.300000 | Setosa |
5.100000 | 3.800000 | 1.500000 | 0.300000 | Setosa |
Since deephaven supports it, we can now also read multi-space separated files, like logs (Relevant issue #746):
DataFrame.readDelimStr(
"""
NAME STATUS AGE LABELS
argo-events Active 2y77d app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows Active 2y77d app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd Active 5y18d kubernetes.io/metadata.name=argocd
beta Active 4y235d kubernetes.io/metadata.name=beta
""".trimIndent(),
hasFixedWidthColumns = true,
)
NAME | STATUS | AGE | LABELS |
---|---|---|---|
argo-events | Active | 2y77d | app.kubernetes.io/instance=argo-event... |
argo-workflows | Active | 2y77d | app.kubernetes.io/instance=argo-workf... |
argocd | Active | 5y18d | kubernetes.io/metadata.name=argocd |
We provide single overload (with InputStream
) which exposes the underlying implementation for when ours is not sufficient for your needs (Relevant issue #787).
import io.deephaven.csv.containers.ByteSlice
import io.deephaven.csv.tokenization.Tokenizer
import java.io.InputStream
DataFrame.readCsv(
inputStream = File("../../../../dataframe-csv/src/test/resources/irisDataset.csv").inputStream(),
adjustCsvSpecs = {
this
.headerLegalizer {
it.map { it.lowercase().replace('.', '_') }.toTypedArray()
}
.customDoubleParser(object : Tokenizer.CustomDoubleParser {
override fun parse(bs: ByteSlice?): Double = TODO("Not yet implemented")
override fun parse(cs: CharSequence?): Double = TODO("Not yet implemented")
})
// etc..
},
)
sepal_length | sepal_width | petal_length | petal_width | variety |
---|---|---|---|---|
5.100000 | 3.500000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.000000 | 1.400000 | 0.200000 | Setosa |
4.700000 | 3.200000 | 1.300000 | 0.200000 | Setosa |
4.600000 | 3.100000 | 1.500000 | 0.200000 | Setosa |
5.000000 | 3.600000 | 1.400000 | 0.200000 | Setosa |
5.400000 | 3.900000 | 1.700000 | 0.400000 | Setosa |
4.600000 | 3.400000 | 1.400000 | 0.300000 | Setosa |
5.000000 | 3.400000 | 1.500000 | 0.200000 | Setosa |
4.400000 | 2.900000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.100000 | 1.500000 | 0.100000 | Setosa |
5.400000 | 3.700000 | 1.500000 | 0.200000 | Setosa |
4.800000 | 3.400000 | 1.600000 | 0.200000 | Setosa |
4.800000 | 3.000000 | 1.400000 | 0.100000 | Setosa |
4.300000 | 3.000000 | 1.100000 | 0.100000 | Setosa |
5.800000 | 4.000000 | 1.200000 | 0.200000 | Setosa |
5.700000 | 4.400000 | 1.500000 | 0.400000 | Setosa |
5.400000 | 3.900000 | 1.300000 | 0.400000 | Setosa |
5.100000 | 3.500000 | 1.400000 | 0.300000 | Setosa |
5.700000 | 3.800000 | 1.700000 | 0.300000 | Setosa |
5.100000 | 3.800000 | 1.500000 | 0.300000 | Setosa |
Finally, we now support reading from ZIP files directly, along with GZIP (already demonstrated above) and custom compression formats (Relevant issue #469):
DataFrame.readCsv(
"../../../../dataframe-csv/src/test/resources/testCSV.zip",
// this can be manually specified, but is inferred automatically from the file extension
// compression = Compression.Zip,
)
untitled | user_id | name | duplicate | username | duplicate1 | duplicate11 | double | number | time | empty |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | George | null | abc | a | null | 1203.000000 | 599.213000 | 2021-01-07T15:12:32 | null |
1 | 5 | Paul | null | paul | null | null | null | 214.211000 | 2021-01-14T14:36:19 | null |
2 | 8 | Johnny | null | qwerty | b | null | 20.000000 | 412.214000 | 2021-02-23T19:47 | null |
3 | 10 | Jack | null | buk | null | null | 2414.000000 | 1.010000 | 2021-03-08T23:38:52 | null |
4 | 12 | Samuel | null | qwerty | null | null | Infinity | 0.000000 | 2021-04-01T02:30:22 | null |
USE { dependencies("org.tukaani:xz:1.10", "org.apache.commons:commons-compress:1.27.1") }
import org.apache.commons.compress.archivers.tar.TarFile
import org.apache.commons.io.IOUtils
import org.apache.commons.compress.utils.SeekableInMemoryByteChannel
// custom compression format by specifying how to convert a compressed InputStream to a normal one
val tarCompression = Compression.Custom({ tarInputStream ->
val tar = TarFile(SeekableInMemoryByteChannel(IOUtils.toByteArray(tarInputStream)))
tar.getInputStream(tar.entries.first())
})
DataFrame.readCsv("irisDataset.tar", compression = tarCompression)
sepal.length | sepal.width | petal.length | petal.width | variety |
---|---|---|---|---|
5.100000 | 3.500000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.000000 | 1.400000 | 0.200000 | Setosa |
4.700000 | 3.200000 | 1.300000 | 0.200000 | Setosa |
4.600000 | 3.100000 | 1.500000 | 0.200000 | Setosa |
5.000000 | 3.600000 | 1.400000 | 0.200000 | Setosa |
5.400000 | 3.900000 | 1.700000 | 0.400000 | Setosa |
4.600000 | 3.400000 | 1.400000 | 0.300000 | Setosa |
5.000000 | 3.400000 | 1.500000 | 0.200000 | Setosa |
4.400000 | 2.900000 | 1.400000 | 0.200000 | Setosa |
4.900000 | 3.100000 | 1.500000 | 0.100000 | Setosa |
5.400000 | 3.700000 | 1.500000 | 0.200000 | Setosa |
4.800000 | 3.400000 | 1.600000 | 0.200000 | Setosa |
4.800000 | 3.000000 | 1.400000 | 0.100000 | Setosa |
4.300000 | 3.000000 | 1.100000 | 0.100000 | Setosa |
5.800000 | 4.000000 | 1.200000 | 0.200000 | Setosa |
5.700000 | 4.400000 | 1.500000 | 0.400000 | Setosa |
5.400000 | 3.900000 | 1.300000 | 0.400000 | Setosa |
5.100000 | 3.500000 | 1.400000 | 0.300000 | Setosa |
5.700000 | 3.800000 | 1.700000 | 0.300000 | Setosa |
5.100000 | 3.800000 | 1.500000 | 0.300000 | Setosa |
Writing is also supported; it still uses Apache Commons CSV under the hood. The API is similar to the reading API:
val irisDf = DataFrame.readCsv("../../../../dataframe-csv/src/test/resources/irisDataset.csv")
irisDf.writeCsv("irisDataset.csv")
some options can be specified:
irisDf.writeDelim(
path = "irisDataset.csv",
delimiter = ';',
includeHeader = false,
quoteMode = QuoteMode.ALL,
escapeChar = '\\',
commentChar = '#',
headerComments = listOf("This is a comment", "This is another comment"),
recordSeparator = "\n",
)
and similarly we have a single overload which exposes the underlying implementation:
irisDf.writeCsv(
writer = File("irisDataset.csv").writer(),
adjustCsvFormat = {
this
.setSkipHeaderRecord(true)
.setHeader("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species")
.setTrailingData(true)
.setNullString("null")
// etc..
},
)
Kandy v0.8 introduces geo-plotting which allows you to visualize geospatial/geographical data using the awesome Kandy DSL. To make working with this geographical data (from GeoJson/Shapefile) easier, we happily accepted the GeoDataFrame PR from the Kandy team (Relevant issue #875)
To try it yourself, explicitly add the dependency org.jetbrains.kotlinx:dataframe-geo
to your project (with the repository maven("https://repo.osgeo.org/repository/release")
) or add enableExperimentalGeo=true
to the %use-magic, as seen at the start of the notebook.
Then use GeoDataFrame.readGeoJson()
or GeoDataFrame.readShapeFile()
to get started!
USE {
repositories("https://repo.osgeo.org/repository/release")
dependencies("org.jetbrains.kotlinx:kandy-geo:0.8.0-dev-57")
}
Here's a small demonstration of the new API, reading and plotting a GeoJson file:
val usaGeo = GeoDataFrame.readGeoJson("https://echarts.apache.org/examples/data/asset/geo/USA.json")
usaGeo.df
2024-12-06T13:22:34.883693Z Execution of code 'val usaGeo = GeoData...' ERROR Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
name | geometry |
---|---|
Alabama | POLYGON ((-87.359296 35.00118, -85.60... |
Alaska | MULTIPOLYGON (((-131.602021 55.117982... |
Arizona | POLYGON ((-109.042503 37.000263, -109... |
Arkansas | POLYGON ((-94.473842 36.501861, -90.1... |
California | POLYGON ((-123.233256 42.006186, -122... |
Colorado | POLYGON ((-107.919731 41.003906, -105... |
Connecticut | POLYGON ((-73.053528 42.039048, -71.7... |
Delaware | POLYGON ((-75.414089 39.804456, -75.5... |
District of Columbia | POLYGON ((-77.035264 38.993869, -76.9... |
Florida | POLYGON ((-85.497137 30.997536, -85.0... |
Georgia | POLYGON ((-83.109191 35.00118, -83.32... |
Hawaii | MULTIPOLYGON (((-155.634835 18.948267... |
Idaho | POLYGON ((-116.04751 49.000239, -116.... |
Illinois | POLYGON ((-90.639984 42.510065, -88.7... |
Indiana | POLYGON ((-85.990061 41.759724, -84.8... |
Iowa | POLYGON ((-91.368417 43.501391, -91.2... |
Kansas | POLYGON ((-101.90605 40.001626, -95.3... |
Kentucky | POLYGON ((-83.903347 38.769315, -83.6... |
Louisiana | POLYGON ((-93.608485 33.018527, -91.1... |
Maine | POLYGON ((-70.703921 43.057759, -70.8... |
usaGeo.plot { geoMap() }
Let's modify the GeoDataFrame
a bit by adding some population data and plotting that too:
val usPopByState = DataFrame.readCsv("us_pop_by_state.csv")
usPopByState
rank | state | state_code | 2020_census | percent_of_total |
---|---|---|---|---|
1.000000 | California | CA | 39538223 | 0.119100 |
2.000000 | Texas | TX | 29145505 | 0.087400 |
3.000000 | Florida | FL | 21538187 | 0.064700 |
4.000000 | New York | NY | 20201249 | 0.058600 |
5.000000 | Pennsylvania | PA | 13002700 | 0.038600 |
6.000000 | Illinois | IL | 12801989 | 0.038200 |
7.000000 | Ohio | OH | 11799448 | 0.035200 |
8.000000 | Georgia | GA | 10711908 | 0.032000 |
9.000000 | North Carolina | NC | 10439388 | 0.031600 |
10.000000 | Michigan | MI | 10077331 | 0.030100 |
11.000000 | New Jersey | NJ | 9288994 | 0.026800 |
12.000000 | Virginia | VA | 8631393 | 0.025700 |
13.000000 | Washington | WA | 7705281 | 0.022900 |
14.000000 | Arizona | AZ | 7151502 | 0.021900 |
15.000000 | Massachusetts | MA | 7029917 | 0.020900 |
16.000000 | Tennessee | TN | 6910840 | 0.020600 |
17.000000 | Indiana | IN | 6785528 | 0.020300 |
18.000000 | Maryland | MD | 6177224 | 0.018500 |
19.000000 | Missouri | MO | 6154913 | 0.018200 |
20.000000 | Wisconsin | WI | 5893718 | 0.017500 |
val usaGeoPopulation = usaGeo.modify {
this.join(
usPopByState.select { state and `2020_census`.named("population") },
) { name match right.state }
}
usaGeoPopulation.df
name | geometry | population |
---|---|---|
Alabama | POLYGON ((-87.359296 35.00118, -85.60... | 5024279 |
Alaska | MULTIPOLYGON (((-131.602021 55.117982... | 733391 |
Arizona | POLYGON ((-109.042503 37.000263, -109... | 7151502 |
Arkansas | POLYGON ((-94.473842 36.501861, -90.1... | 3011524 |
California | POLYGON ((-123.233256 42.006186, -122... | 39538223 |
Colorado | POLYGON ((-107.919731 41.003906, -105... | 5773714 |
Connecticut | POLYGON ((-73.053528 42.039048, -71.7... | 3605944 |
Delaware | POLYGON ((-75.414089 39.804456, -75.5... | 989948 |
Florida | POLYGON ((-85.497137 30.997536, -85.0... | 21538187 |
Georgia | POLYGON ((-83.109191 35.00118, -83.32... | 10711908 |
Hawaii | MULTIPOLYGON (((-155.634835 18.948267... | 1455271 |
Idaho | POLYGON ((-116.04751 49.000239, -116.... | 1839106 |
Illinois | POLYGON ((-90.639984 42.510065, -88.7... | 12801989 |
Indiana | POLYGON ((-85.990061 41.759724, -84.8... | 6785528 |
Iowa | POLYGON ((-91.368417 43.501391, -91.2... | 3271616 |
Kansas | POLYGON ((-101.90605 40.001626, -95.3... | 2937880 |
Kentucky | POLYGON ((-83.903347 38.769315, -83.6... | 4505836 |
Louisiana | POLYGON ((-93.608485 33.018527, -91.1... | 4657757 |
Maine | POLYGON ((-70.703921 43.057759, -70.8... | 1362359 |
Maryland | MULTIPOLYGON (((-75.994645 37.95325, ... | 6177224 |
usaGeoPopulation.plot {
// crop out alaska and hawaii
x.axis.limits = -130..-65
y.axis.limits = 25..50
geoMap {
tooltips(name, population)
fillColor(population) {
scale = continuousColorViridis()
}
borderLine {
width = 0.1
color = Color.BLACK
}
alpha = 0.5
layout.style(Style.Void)
}
}
Our JDBC-based SQL integration for DataFrame has become extensible!
This means that if you have an SQL database that we currently don't support, you can
create your own DbType
instance and read from your database to a dataframe.
(Remember that we already support quite a few databases: MariaDB, PostgreSQL, MySQL, SQLite, MS SQL, and H2 (with dialects))
To get started, we need a custom DbType
.
For the sake of example, we'll create a custom DbType based on the H2
Database. Ordinarily, you'd extend DbType("jdbc name of your database")
.
import org.jetbrains.kotlinx.dataframe.io.db.*
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.ResultSet
import kotlin.reflect.KType
object CustomDbType : H2(MySql) {
/**
* Represents the JDBC driver class name for a given database type.
* Something like "org.h2.Driver".
*/
override val driverClassName: String
get() = super.driverClassName
/**
* Here you define which KType you expect the column to be based on [tableColumnMetadata].
* This is mostly for special cases, as DataFrame can already infer most types
* from the databse automatically.
*
* Return `null` to let DataFrame figure out the type.
*/
override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? {
return super.convertSqlTypeToKType(tableColumnMetadata)
}
/**
* Similar to [convertSqlTypeToKType] but here you'll need to define a [ColumnSchema] for the column
* based on [tableColumnMetadata].
*
* Return `null` to let DataFrame figure out the schema.
*/
override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? {
return super.convertSqlTypeToColumnSchemaValue(tableColumnMetadata)
}
/**
* Here you define where to get the table metadata for information about the database table,
* including its name, schema name, and catalogue name.
*/
override fun buildTableMetadata(tables: ResultSet): TableMetadata {
return super.buildTableMetadata(tables)
}
/**
* Return whether the table with metadata [tableMetadata] should be considered
* a system table or not.
*
* System tables are skipped when reading.
*/
override fun isSystemTable(tableMetadata: TableMetadata): Boolean {
return super.isSystemTable(tableMetadata)
}
/**
* Can be overridden to change DataFrame limits queries in your specific DataBase type.
*
* By default it executes: `"$sqlQuery LIMIT $limit"`
*/
override fun sqlQueryLimit(sqlQuery: String, limit: Int): String {
return super.sqlQueryLimit(sqlQuery, limit)
}
}
Now that we have a custom DbType
we can connect to our database (add some demo data) and retrieve it in a dataframe!
import org.intellij.lang.annotations.Language
import java.sql.DriverManager
val URL = "jdbc:h2:mem:test5;DB_CLOSE_DELAY=-1;MODE=MySQL;DATABASE_TO_UPPER=false"
val connection = DriverManager.getConnection(URL)
// insert some demo data
val statements = listOf(
"""
CREATE TABLE Customer (
id INT PRIMARY KEY,
name VARCHAR(50),
age INT
)
""".trimIndent(),
"""
CREATE TABLE Sale (
id INT PRIMARY KEY,
customerId INT,
amount DECIMAL(10, 2) NOT NULL
)
""".trimIndent(),
"INSERT INTO Customer (id, name, age) VALUES (1, 'John', 40)",
"INSERT INTO Customer (id, name, age) VALUES (2, 'Alice', 25)",
"INSERT INTO Customer (id, name, age) VALUES (3, 'Bob', 47)",
"INSERT INTO Customer (id, name, age) VALUES (4, NULL, NULL)",
"INSERT INTO Sale (id, customerId, amount) VALUES (1, 1, 100.50)",
"INSERT INTO Sale (id, customerId, amount) VALUES (2, 2, 50.00)",
"INSERT INTO Sale (id, customerId, amount) VALUES (3, 1, 75.25)",
"INSERT INTO Sale (id, customerId, amount) VALUES (4, 3, 35.15)",
)
statements.forEach { connection.createStatement().execute(it) }
// and read it :)
DataFrame.readSqlQuery(connection, "SELECT * FROM Customer")
id | name | age |
---|---|---|
1 | John | 40 |
2 | Alice | 25 |
3 | Bob | 47 |
4 | null | null |
On the documentation website, you can find another example to support custom databases. This time, it uses HSQLDB.
BigInteger
support¶Java has support for arbitrarily large decimal- and integer values: BigDecimal
and BigInteger
.
This is very helpful when working with huge numbers and Double
and Long
are not big enough.
Maybe Kotlin will even gain its own representation in the future!
DataFrame has supported BigDecimal
for a while, but it lacked BigInteger
support. DataFrame 0.15 fixes that.
Let's make a column with numbers so large that they can only be represented as String
:
import java.math.BigInteger
import kotlin.random.Random
import kotlin.random.nextLong
import kotlin.random.nextUInt
val largestLong = Long.MAX_VALUE.toString()
val giantNumberCol: DataColumn<String> by List(10) {
largestLong + abs(Random.nextLong()).toString()
}.toColumn()
giantNumberCol
giantNumberCol |
---|
92233720368547758076895131956927986339 |
92233720368547758078046379596815421908 |
92233720368547758074448343410727247115 |
92233720368547758076039595862559697533 |
92233720368547758075144108214797648735 |
92233720368547758078096095849008105978 |
92233720368547758074672787943104667710 |
92233720368547758074824236907230894613 |
92233720368547758072945460066985633410 |
92233720368547758078189171310245143715 |
We now have overloads to convert/parse this column to BigInteger
, just like for the other conversions.
This also allows us to perform mathematical operations with it!
val bigIntCol = giantNumberCol.convertToBigInteger()
DISPLAY(bigIntCol.type)
bigIntCol * -1.toBigInteger()
java.math.BigInteger
giantNumberCol |
---|
-92233720368547758076895131956927986339 |
-92233720368547758078046379596815421908 |
-92233720368547758074448343410727247115 |
-92233720368547758076039595862559697533 |
-92233720368547758075144108214797648735 |
-92233720368547758078096095849008105978 |
-92233720368547758074672787943104667710 |
-92233720368547758074824236907230894613 |
-92233720368547758072945460066985633410 |
-92233720368547758078189171310245143715 |
We also support conversions from/to BigInteger
, both on the column itself, and when the column is inside a dataframe:
val df = bigIntCol.toDataFrame()
.convert { bigIntCol }.toBigDecimal()
DISPLAY(df.schema())
df
giantNumberCol: java.math.BigDecimal
giantNumberCol |
---|
9223372036854775807689513195692798633... |
9223372036854775807804637959681542190... |
9223372036854775807444834341072724711... |
9223372036854775807603959586255969753... |
9223372036854775807514410821479764873... |
9223372036854775807809609584900810597... |
9223372036854775807467278794310466771... |
9223372036854775807482423690723089461... |
9223372036854775807294546006698563341... |
9223372036854775807818917131024514371... |
Finally, statistics also support BigInteger
, as well as all other number types.
(.describe()
now also works a bit better, both supporting BigInteger
as well as columns with mixed number types Relevant issue #558.
We'll continue to improve the statistics functions in the next releases)
import java.math.BigDecimal
val bigDecimalCol: DataColumn<BigDecimal> by bigIntCol.convertTo()
val mixedNumberCol: DataColumn<Number> by bigIntCol.map {
if (it % 2.toBigInteger() == 0.toBigInteger()) Random.nextDouble() else Random.nextInt()
}
dataFrameOf(
bigIntCol named "bigIntCol",
bigDecimalCol,
mixedNumberCol,
).describe()
name | type | count | unique | nulls | top | freq | mean | std | min | median | max |
---|---|---|---|---|---|---|---|---|---|---|---|
bigIntCol | java.math.BigInteger | 10 | 10 | 0 | 92233720368547758076895131956927986339 | 1 | 9223372036854776000000000000000000000... | 0.000000 | 92233720368547758072945460066985633410 | 92233720368547758075591852038678673134 | 92233720368547758078189171310245143715 |
bigDecimalCol | java.math.BigDecimal | 10 | 10 | 0 | 9223372036854775807689513195692798633... | 1 | 9223372036854776000000000000000000000... | 1817876800039198720.000000 | 9223372036854775807294546006698563341... | 9223372036854775807559185203867867313... | 9223372036854775807818917131024514371... |
mixedNumberCol | Number | 10 | 10 | 0 | -558408044 | 1 | -285953766.155901 | 1186872780.278754 | -2071877832.000000 | 0.089672 | 1949813924.000000 |
Parsing, in DataFrame, is a special case of convert
.
It can convert String
columns to any other supported type by guessing.
This can be done manually, by calling .parse()
on a dataframe, but it also happens automatically when reading from textual data, like CSV.
In DataFrame 0.15:
Char
The new double parser is based on FastDoubleParser and
can be enabled by setting useFastDoubleParser = true
in the parser options.
// enabling the fast double parser globally can be done like
DataFrame.parser.apply {
useFastDoubleParser = true
// you can also set other global parsing options here
}
// or you can choose to enable it per call
// Each function that parses strings should have the `parserOptions` argument:
DataFrame.readDelimStr(
text = """
numbers
0,12
100.456,23
1,00
""".trimIndent(),
delimiter = ';',
parserOptions = ParserOptions(
locale = Locale.GERMAN,
useFastDoubleParser = true,
)
)
numbers |
---|
0.120000 |
100456.230000 |
1.000000 |
Our implementation of FastDoubleParser
is configured to be very forgiving, depending on the locale.
For instance, in the French, numbers are often formatted like "100 512,123", which contains a non-breaking space character (" "). If you come across files which use normal spaces (" "), many double parsers would fail.
The same holds for the Estonian minus ("−") which is expected if your locale is set to Estonian (Relevant issue #607).
We now try to catch these cases and save you some headaches :). Do not hesitate to provide feedback if you have a case that fails, and you think it should work!
val estonianNumbers by listOf(
"12,45",
"−13,35", // note the different minus sign '−' vs '-'
"−204 235,23", // note the different minus sign '−' vs '-'
"100 123,35", // space instead of NBSP
"1,234e3",
"-345,122", // 'ordinary' minus sign
).toColumn()
estonianNumbers.parse(ParserOptions(locale = Locale.forLanguageTag("et-EE"), useFastDoubleParser = true))
estonianNumbers |
---|
12.450000 |
-13.350000 |
-204235.230000 |
100123.350000 |
1234.000000 |
-345.122000 |