This notebook runs a simple trend analysis on the timeseries shown in the time series viewer: https://proba-v-mep.esa.int/applications/time-series-viewer/app/app.html It demonstrates how Spark can be used on the MEP Hadoop cluster, from within a notebook.
This notebook is not scientifically validated, only meant as a technical demonstration on a real world use case.
So some context first about the data we are going to manipulate.
Our goal is pretty simple: we want to detect if there are any trends for each zone.
We will proceed as follows:
We will detail each step along the way.
If you run this script in the jupyterlab of terrascope, make sure to shut down all kernels before reconnecting and running this script, as the last cell will use a lot of memory.
import pyspark
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime, time
%matplotlib inline
Let's read the parquet file containing all our data, as follows:
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set('spark.yarn.executor.memoryOverhead', 1024)
conf.set('spark.executor.memory', '4g')
<pyspark.conf.SparkConf at 0x30b96d0>
The SparkContext 'sc' is our entry point to the Hadoop cluster. The sparkConf defines some parameters.
sc = pyspark.SparkContext(conf=conf)
sqlCtx = pyspark.SQLContext(sc)
Open a terminal, run kinit
and supply your password to request a kerboros ticket.
!hdfs dfs -ls /tapdata/tsviewer/combined-stats/stats.parquet
Found 169 items -rw-r--r-- 3 mep_tsviewer vito 0 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/_SUCCESS -rw-r--r-- 3 mep_tsviewer vito 25357414 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00000.parquet -rw-r--r-- 3 mep_tsviewer vito 25341813 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00001.parquet -rw-r--r-- 3 mep_tsviewer vito 25344785 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00002.parquet -rw-r--r-- 3 mep_tsviewer vito 25325074 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00003.parquet -rw-r--r-- 3 mep_tsviewer vito 25382644 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00004.parquet -rw-r--r-- 3 mep_tsviewer vito 25368452 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00005.parquet -rw-r--r-- 3 mep_tsviewer vito 25338382 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00006.parquet -rw-r--r-- 3 mep_tsviewer vito 25364026 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00007.parquet -rw-r--r-- 3 mep_tsviewer vito 25343301 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00008.parquet -rw-r--r-- 3 mep_tsviewer vito 25337037 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00009.parquet -rw-r--r-- 3 mep_tsviewer vito 25361096 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00010.parquet -rw-r--r-- 3 mep_tsviewer vito 25355121 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00011.parquet -rw-r--r-- 3 mep_tsviewer vito 25374168 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00012.parquet -rw-r--r-- 3 mep_tsviewer vito 25373881 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00013.parquet -rw-r--r-- 3 mep_tsviewer vito 25343740 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00014.parquet -rw-r--r-- 3 mep_tsviewer vito 25375775 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00015.parquet -rw-r--r-- 3 mep_tsviewer vito 25376607 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00016.parquet -rw-r--r-- 3 mep_tsviewer vito 25332092 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00017.parquet -rw-r--r-- 3 mep_tsviewer vito 25363406 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00018.parquet -rw-r--r-- 3 mep_tsviewer vito 25364663 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00019.parquet -rw-r--r-- 3 mep_tsviewer vito 25345982 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00020.parquet -rw-r--r-- 3 mep_tsviewer vito 25364568 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00021.parquet -rw-r--r-- 3 mep_tsviewer vito 25345751 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00022.parquet -rw-r--r-- 3 mep_tsviewer vito 25359454 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00023.parquet -rw-r--r-- 3 mep_tsviewer vito 25362565 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00024.parquet -rw-r--r-- 3 mep_tsviewer vito 25342732 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00025.parquet -rw-r--r-- 3 mep_tsviewer vito 25371436 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00026.parquet -rw-r--r-- 3 mep_tsviewer vito 25364058 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00027.parquet -rw-r--r-- 3 mep_tsviewer vito 25314740 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00028.parquet -rw-r--r-- 3 mep_tsviewer vito 25362923 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00029.parquet -rw-r--r-- 3 mep_tsviewer vito 25376056 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00030.parquet -rw-r--r-- 3 mep_tsviewer vito 25339711 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00031.parquet -rw-r--r-- 3 mep_tsviewer vito 25397722 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00032.parquet -rw-r--r-- 3 mep_tsviewer vito 25379909 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00033.parquet -rw-r--r-- 3 mep_tsviewer vito 25335473 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00034.parquet -rw-r--r-- 3 mep_tsviewer vito 25364742 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00035.parquet -rw-r--r-- 3 mep_tsviewer vito 25383452 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00036.parquet -rw-r--r-- 3 mep_tsviewer vito 25353719 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00037.parquet -rw-r--r-- 3 mep_tsviewer vito 25371553 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00038.parquet -rw-r--r-- 3 mep_tsviewer vito 25364836 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00039.parquet -rw-r--r-- 3 mep_tsviewer vito 25354000 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00040.parquet -rw-r--r-- 3 mep_tsviewer vito 25372447 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00041.parquet -rw-r--r-- 3 mep_tsviewer vito 25352126 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00042.parquet -rw-r--r-- 3 mep_tsviewer vito 25354362 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00043.parquet -rw-r--r-- 3 mep_tsviewer vito 25362814 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00044.parquet -rw-r--r-- 3 mep_tsviewer vito 25347834 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00045.parquet -rw-r--r-- 3 mep_tsviewer vito 25337259 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00046.parquet -rw-r--r-- 3 mep_tsviewer vito 25349579 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00047.parquet -rw-r--r-- 3 mep_tsviewer vito 25338758 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00048.parquet -rw-r--r-- 3 mep_tsviewer vito 25359736 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00049.parquet -rw-r--r-- 3 mep_tsviewer vito 25340555 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00050.parquet -rw-r--r-- 3 mep_tsviewer vito 25359822 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00051.parquet -rw-r--r-- 3 mep_tsviewer vito 25377064 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00052.parquet -rw-r--r-- 3 mep_tsviewer vito 25361970 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00053.parquet -rw-r--r-- 3 mep_tsviewer vito 25338795 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00054.parquet -rw-r--r-- 3 mep_tsviewer vito 25345861 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00055.parquet -rw-r--r-- 3 mep_tsviewer vito 25361802 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00056.parquet -rw-r--r-- 3 mep_tsviewer vito 25356596 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00057.parquet -rw-r--r-- 3 mep_tsviewer vito 25364383 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00058.parquet -rw-r--r-- 3 mep_tsviewer vito 25383945 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00059.parquet -rw-r--r-- 3 mep_tsviewer vito 25357862 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00060.parquet -rw-r--r-- 3 mep_tsviewer vito 25344097 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00061.parquet -rw-r--r-- 3 mep_tsviewer vito 25394550 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00062.parquet -rw-r--r-- 3 mep_tsviewer vito 25361621 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00063.parquet -rw-r--r-- 3 mep_tsviewer vito 25328248 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00064.parquet -rw-r--r-- 3 mep_tsviewer vito 25360018 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00065.parquet -rw-r--r-- 3 mep_tsviewer vito 25353412 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00066.parquet -rw-r--r-- 3 mep_tsviewer vito 25353366 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00067.parquet -rw-r--r-- 3 mep_tsviewer vito 25366552 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00068.parquet -rw-r--r-- 3 mep_tsviewer vito 25336160 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00069.parquet -rw-r--r-- 3 mep_tsviewer vito 25352581 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00070.parquet -rw-r--r-- 3 mep_tsviewer vito 25356921 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00071.parquet -rw-r--r-- 3 mep_tsviewer vito 25337459 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00072.parquet -rw-r--r-- 3 mep_tsviewer vito 25352658 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00073.parquet -rw-r--r-- 3 mep_tsviewer vito 25362806 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00074.parquet -rw-r--r-- 3 mep_tsviewer vito 25332933 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00075.parquet -rw-r--r-- 3 mep_tsviewer vito 25337050 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00076.parquet -rw-r--r-- 3 mep_tsviewer vito 25351615 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00077.parquet -rw-r--r-- 3 mep_tsviewer vito 25372093 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00078.parquet -rw-r--r-- 3 mep_tsviewer vito 25400785 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00079.parquet -rw-r--r-- 3 mep_tsviewer vito 25355679 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00080.parquet -rw-r--r-- 3 mep_tsviewer vito 25363894 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00081.parquet -rw-r--r-- 3 mep_tsviewer vito 25374263 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00082.parquet -rw-r--r-- 3 mep_tsviewer vito 25349140 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00083.parquet -rw-r--r-- 3 mep_tsviewer vito 25386733 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00084.parquet -rw-r--r-- 3 mep_tsviewer vito 25380380 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00085.parquet -rw-r--r-- 3 mep_tsviewer vito 25349083 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00086.parquet -rw-r--r-- 3 mep_tsviewer vito 25399829 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00087.parquet -rw-r--r-- 3 mep_tsviewer vito 25378706 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00088.parquet -rw-r--r-- 3 mep_tsviewer vito 25345343 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00089.parquet -rw-r--r-- 3 mep_tsviewer vito 25358393 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00090.parquet -rw-r--r-- 3 mep_tsviewer vito 25346919 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00091.parquet -rw-r--r-- 3 mep_tsviewer vito 25332395 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00092.parquet -rw-r--r-- 3 mep_tsviewer vito 25355408 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00093.parquet -rw-r--r-- 3 mep_tsviewer vito 25350583 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00094.parquet -rw-r--r-- 3 mep_tsviewer vito 25337139 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00095.parquet -rw-r--r-- 3 mep_tsviewer vito 25349278 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00096.parquet -rw-r--r-- 3 mep_tsviewer vito 25350053 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00097.parquet -rw-r--r-- 3 mep_tsviewer vito 25361480 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00098.parquet -rw-r--r-- 3 mep_tsviewer vito 25375630 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00099.parquet -rw-r--r-- 3 mep_tsviewer vito 25375661 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00100.parquet -rw-r--r-- 3 mep_tsviewer vito 25345687 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00101.parquet -rw-r--r-- 3 mep_tsviewer vito 25342268 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00102.parquet -rw-r--r-- 3 mep_tsviewer vito 25334411 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00103.parquet -rw-r--r-- 3 mep_tsviewer vito 25357199 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00104.parquet -rw-r--r-- 3 mep_tsviewer vito 25375453 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00105.parquet -rw-r--r-- 3 mep_tsviewer vito 25365719 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00106.parquet -rw-r--r-- 3 mep_tsviewer vito 25381858 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00107.parquet -rw-r--r-- 3 mep_tsviewer vito 25353112 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00108.parquet -rw-r--r-- 3 mep_tsviewer vito 25354219 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00109.parquet -rw-r--r-- 3 mep_tsviewer vito 25376862 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00110.parquet -rw-r--r-- 3 mep_tsviewer vito 25370664 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00111.parquet -rw-r--r-- 3 mep_tsviewer vito 25357291 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00112.parquet -rw-r--r-- 3 mep_tsviewer vito 25363815 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00113.parquet -rw-r--r-- 3 mep_tsviewer vito 25379000 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00114.parquet -rw-r--r-- 3 mep_tsviewer vito 25346606 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00115.parquet -rw-r--r-- 3 mep_tsviewer vito 25338025 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00116.parquet -rw-r--r-- 3 mep_tsviewer vito 25353108 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00117.parquet -rw-r--r-- 3 mep_tsviewer vito 25351054 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00118.parquet -rw-r--r-- 3 mep_tsviewer vito 25331492 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00119.parquet -rw-r--r-- 3 mep_tsviewer vito 25352670 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00120.parquet -rw-r--r-- 3 mep_tsviewer vito 25341291 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00121.parquet -rw-r--r-- 3 mep_tsviewer vito 25325189 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00122.parquet -rw-r--r-- 3 mep_tsviewer vito 25341454 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00123.parquet -rw-r--r-- 3 mep_tsviewer vito 25334407 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00124.parquet -rw-r--r-- 3 mep_tsviewer vito 25371710 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00125.parquet -rw-r--r-- 3 mep_tsviewer vito 25400455 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00126.parquet -rw-r--r-- 3 mep_tsviewer vito 25334676 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00127.parquet -rw-r--r-- 3 mep_tsviewer vito 25362616 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00128.parquet -rw-r--r-- 3 mep_tsviewer vito 25363886 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00129.parquet -rw-r--r-- 3 mep_tsviewer vito 25336678 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00130.parquet -rw-r--r-- 3 mep_tsviewer vito 25352216 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00131.parquet -rw-r--r-- 3 mep_tsviewer vito 25371067 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00132.parquet -rw-r--r-- 3 mep_tsviewer vito 25373034 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00133.parquet -rw-r--r-- 3 mep_tsviewer vito 25366880 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00134.parquet -rw-r--r-- 3 mep_tsviewer vito 25341113 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00135.parquet -rw-r--r-- 3 mep_tsviewer vito 25390356 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00136.parquet -rw-r--r-- 3 mep_tsviewer vito 25375039 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00137.parquet -rw-r--r-- 3 mep_tsviewer vito 25334315 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00138.parquet -rw-r--r-- 3 mep_tsviewer vito 25370581 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00139.parquet -rw-r--r-- 3 mep_tsviewer vito 25370598 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00140.parquet -rw-r--r-- 3 mep_tsviewer vito 25318438 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00141.parquet -rw-r--r-- 3 mep_tsviewer vito 25367727 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00142.parquet -rw-r--r-- 3 mep_tsviewer vito 25361599 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00143.parquet -rw-r--r-- 3 mep_tsviewer vito 25327646 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00144.parquet -rw-r--r-- 3 mep_tsviewer vito 25366143 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00145.parquet -rw-r--r-- 3 mep_tsviewer vito 25373734 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00146.parquet -rw-r--r-- 3 mep_tsviewer vito 25350089 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00147.parquet -rw-r--r-- 3 mep_tsviewer vito 25347975 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00148.parquet -rw-r--r-- 3 mep_tsviewer vito 25336055 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00149.parquet -rw-r--r-- 3 mep_tsviewer vito 25321908 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00150.parquet -rw-r--r-- 3 mep_tsviewer vito 25364109 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00151.parquet -rw-r--r-- 3 mep_tsviewer vito 25366125 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00152.parquet -rw-r--r-- 3 mep_tsviewer vito 25376436 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00153.parquet -rw-r--r-- 3 mep_tsviewer vito 25388034 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00154.parquet -rw-r--r-- 3 mep_tsviewer vito 25346212 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00155.parquet -rw-r--r-- 3 mep_tsviewer vito 25346212 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00156.parquet -rw-r--r-- 3 mep_tsviewer vito 25364542 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00157.parquet -rw-r--r-- 3 mep_tsviewer vito 25361136 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00158.parquet -rw-r--r-- 3 mep_tsviewer vito 25357436 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00159.parquet -rw-r--r-- 3 mep_tsviewer vito 25357944 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00160.parquet -rw-r--r-- 3 mep_tsviewer vito 25367168 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00161.parquet -rw-r--r-- 3 mep_tsviewer vito 25368636 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00162.parquet -rw-r--r-- 3 mep_tsviewer vito 25342537 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00163.parquet -rw-r--r-- 3 mep_tsviewer vito 25343544 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00164.parquet -rw-r--r-- 3 mep_tsviewer vito 25355247 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00165.parquet -rw-r--r-- 3 mep_tsviewer vito 25353852 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00166.parquet -rw-r--r-- 3 mep_tsviewer vito 25326912 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00167.parquet
data = sqlCtx.read.parquet("hdfs:/tapdata/tsviewer/combined-stats/stats.parquet")
data
DataFrame[landcover: string, zone: int, date: string, BIOPAR_ALB_BHV_V1: double, BIOPAR_ALB_BHV_V1_area: bigint, BIOPAR_ALB_BHV_V1_validArea: bigint, BIOPAR_ALB_DHV_V1: double, BIOPAR_ALB_DHV_V1_area: bigint, BIOPAR_ALB_DHV_V1_validArea: bigint, BIOPAR_BA_V1: double, BIOPAR_BA_V1_area: bigint, BIOPAR_BA_V1_validArea: bigint, BIOPAR_DMP: double, BIOPAR_DMP_area: bigint, BIOPAR_DMP_validArea: bigint, BIOPAR_FAPAR_V1: double, BIOPAR_FAPAR_V1_area: bigint, BIOPAR_FAPAR_V1_validArea: bigint, BIOPAR_FCOVER_V1: double, BIOPAR_FCOVER_V1_area: bigint, BIOPAR_FCOVER_V1_validArea: bigint, BIOPAR_LAI_V1: double, BIOPAR_LAI_V1_area: bigint, BIOPAR_LAI_V1_validArea: bigint, BIOPAR_NDVI_V1: double, BIOPAR_NDVI_V1_area: bigint, BIOPAR_NDVI_V1_validArea: bigint, BIOPAR_NDVI_V2: double, BIOPAR_NDVI_V2_area: bigint, BIOPAR_NDVI_V2_validArea: bigint, BIOPAR_SWI10_V3: double, BIOPAR_SWI10_V3_area: bigint, BIOPAR_SWI10_V3_validArea: bigint, BIOPAR_TOCR_B0: double, BIOPAR_TOCR_B0_area: bigint, BIOPAR_TOCR_B0_validArea: bigint, BIOPAR_TOCR_B2: double, BIOPAR_TOCR_B2_area: bigint, BIOPAR_TOCR_B2_validArea: bigint, BIOPAR_TOCR_B3: double, BIOPAR_TOCR_B3_area: bigint, BIOPAR_TOCR_B3_validArea: bigint, BIOPAR_TOCR_MIR: double, BIOPAR_TOCR_MIR_area: bigint, BIOPAR_TOCR_MIR_validArea: bigint, BIOPAR_VCI: double, BIOPAR_VCI_area: bigint, BIOPAR_VCI_validArea: bigint, BIOPAR_VPI: double, BIOPAR_VPI_area: bigint, BIOPAR_VPI_validArea: bigint, BIOPAR_WB_V1: double, BIOPAR_WB_V1_area: bigint, BIOPAR_WB_V1_validArea: bigint, BIOPAR_WB_V2: double, BIOPAR_WB_V2_area: bigint, BIOPAR_WB_V2_validArea: bigint, PROBAV_L3_S10_TOC_RED_333M: double, PROBAV_L3_S10_TOC_RED_333M_area: bigint, PROBAV_L3_S10_TOC_RED_333M_validArea: bigint, PROBAV_L3_S10_TOC_NIR_333M: double, PROBAV_L3_S10_TOC_NIR_333M_area: bigint, PROBAV_L3_S10_TOC_NIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_BLUE_333M: double, PROBAV_L3_S10_TOC_BLUE_333M_area: bigint, PROBAV_L3_S10_TOC_BLUE_333M_validArea: bigint, PROBAV_L3_S10_TOC_SWIR_333M: double, PROBAV_L3_S10_TOC_SWIR_333M_area: bigint, PROBAV_L3_S10_TOC_SWIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_NDVI_333M: double, PROBAV_L3_S10_TOC_NDVI_333M_area: bigint, PROBAV_L3_S10_TOC_NDVI_333M_validArea: bigint, RAINFALL: double, RAINFALL_area: bigint, RAINFALL_validArea: bigint]
Next, we want to register this data in a temporary table so we can query it using Spark's SQLContext.
data.registerTempTable("data")
Now although we instructed Spark where the data was and to register it, it did not actually read anything. This is because Spark is lazy: it will postpone doing any task until it the last moment. Requesting data will force it to load it, as follows:
data.take(1)
[Row(landcover=u'Water_bodies', zone=3219, date=u'2007-11-01', BIOPAR_ALB_BHV_V1=0.14052462320017708, BIOPAR_ALB_BHV_V1_area=11163, BIOPAR_ALB_BHV_V1_validArea=11076, BIOPAR_ALB_DHV_V1=0.14695226555861848, BIOPAR_ALB_DHV_V1_area=11163, BIOPAR_ALB_DHV_V1_validArea=11076, BIOPAR_BA_V1=0.0009853981904505958, BIOPAR_BA_V1_area=11163, BIOPAR_BA_V1_validArea=11, BIOPAR_DMP=None, BIOPAR_DMP_area=None, BIOPAR_DMP_validArea=None, BIOPAR_FAPAR_V1=0.21222769551344417, BIOPAR_FAPAR_V1_area=11163, BIOPAR_FAPAR_V1_validArea=10995, BIOPAR_FCOVER_V1=0.11557305349856734, BIOPAR_FCOVER_V1_area=11163, BIOPAR_FCOVER_V1_validArea=10995, BIOPAR_LAI_V1=0.28491411937776656, BIOPAR_LAI_V1_area=11163, BIOPAR_LAI_V1_validArea=10995, BIOPAR_NDVI_V1=0.27314899691044103, BIOPAR_NDVI_V1_area=11163, BIOPAR_NDVI_V1_validArea=10995, BIOPAR_NDVI_V2=None, BIOPAR_NDVI_V2_area=None, BIOPAR_NDVI_V2_validArea=None, BIOPAR_SWI10_V3=None, BIOPAR_SWI10_V3_area=None, BIOPAR_SWI10_V3_validArea=None, BIOPAR_TOCR_B0=None, BIOPAR_TOCR_B0_area=None, BIOPAR_TOCR_B0_validArea=None, BIOPAR_TOCR_B2=None, BIOPAR_TOCR_B2_area=None, BIOPAR_TOCR_B2_validArea=None, BIOPAR_TOCR_B3=None, BIOPAR_TOCR_B3_area=None, BIOPAR_TOCR_B3_validArea=None, BIOPAR_TOCR_MIR=None, BIOPAR_TOCR_MIR_area=None, BIOPAR_TOCR_MIR_validArea=None, BIOPAR_VCI=None, BIOPAR_VCI_area=None, BIOPAR_VCI_validArea=None, BIOPAR_VPI=None, BIOPAR_VPI_area=None, BIOPAR_VPI_validArea=None, BIOPAR_WB_V1=None, BIOPAR_WB_V1_area=None, BIOPAR_WB_V1_validArea=None, BIOPAR_WB_V2=None, BIOPAR_WB_V2_area=None, BIOPAR_WB_V2_validArea=None, PROBAV_L3_S10_TOC_RED_333M=None, PROBAV_L3_S10_TOC_RED_333M_area=None, PROBAV_L3_S10_TOC_RED_333M_validArea=None, PROBAV_L3_S10_TOC_NIR_333M=None, PROBAV_L3_S10_TOC_NIR_333M_area=None, PROBAV_L3_S10_TOC_NIR_333M_validArea=None, PROBAV_L3_S10_TOC_BLUE_333M=None, PROBAV_L3_S10_TOC_BLUE_333M_area=None, PROBAV_L3_S10_TOC_BLUE_333M_validArea=None, PROBAV_L3_S10_TOC_SWIR_333M=None, PROBAV_L3_S10_TOC_SWIR_333M_area=None, PROBAV_L3_S10_TOC_SWIR_333M_validArea=None, PROBAV_L3_S10_TOC_NDVI_333M=None, PROBAV_L3_S10_TOC_NDVI_333M_area=None, PROBAV_L3_S10_TOC_NDVI_333M_validArea=None, RAINFALL=2.1919554834854207, RAINFALL_area=1406, RAINFALL_validArea=1406)]
Using Spark's SQLContext, we can query our data with SQL-like expressions. Here we are interested in the zone, landcover, date, FAPAR and area columns.
We will also filter out unwanted rows by specifying that we do not want rows for which we do not have a FAPAR value.
data = sqlCtx.sql("SELECT zone, landcover, date date, BIOPAR_FAPAR_V1 fapar, BIOPAR_FAPAR_V1 " +
"FROM data " +
"WHERE BIOPAR_FAPAR_V1 is not null").cache()
data.take(1)
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-08-11', fapar=0.29711740277250515, BIOPAR_FAPAR_V1=0.29711740277250515)]
As you can see, although null values have been filtered out, some FAPAR values still remain nan. We can filter them out using the filter method from Spark. For that, we can either use the DataFrame's method filter, or we can use the DataFrame as a regular RDD.
The difference is that the DataFrame contains Rows, for which we can refer to the columns with their names, whereas the RDD contains tuples, where we have to select columns based on their index. However, with a regular RDD, we can pass any method we want to filter, map, reduce, ..., whereas with a DataFrame, we have to use pre-defined functions.
def fapar_not_nan(row):
return row[3] is not None and not np.isnan(row[3])
withoutNaNs = data.rdd.filter(fapar_not_nan).cache()
withoutNaNs.take(1)
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-06-21', fapar=0.24140815792410703, BIOPAR_FAPAR_V1=0.24140815792410703)]
For grouping, we could use spark's groupBy method, or we could first map our data to a (Key, Value) pair and use the groupByKey method. We will use the latter method as we have more fined control over what goes into Value.
def by_zone(row):
return ((row[0]), (row[1], row[2], row[3], row[4]))
by_zone = withoutNaNs.map(by_zone).groupByKey().cache()
by_zone.take(1)
[(672, <pyspark.resultiterable.ResultIterable at 0x601efd0>)]
The result is a list of pairs (Zone, Values) where Values is a list containing all the values for the specific Zone we grouped by.
Spark will combine the different values into a ResultIterable object, not really suited for the analysis we will be doing. Instead, we are going to use Pandas and Numpy. So first thing we want to do is to convert our ResultIterable into an actual Pandas DataFrame.
def iterable_to_pd(row):
return (row[0], pd.DataFrame(list(row[1]), columns=["landcover", "date", "fapar", "area"]))
by_zone = by_zone.map(iterable_to_pd).cache()
by_zone.take(1)
[(672, landcover date \ 0 Tree_cover_broadleaved_evergreen_closed_to_open 2000-07-11 1 Tree_cover_broadleaved_evergreen_closed_to_open 2006-11-01 2 Tree_cover_broadleaved_evergreen_closed_to_open 2010-11-21 3 Tree_cover_broadleaved_evergreen_closed_to_open 2011-02-11 4 Mosaic_herbaceous_cover__tree_and_shrub 2000-06-01 5 Mosaic_herbaceous_cover__tree_and_shrub 2004-06-21 6 Mosaic_herbaceous_cover__tree_and_shrub 2010-10-11 7 Mosaic_herbaceous_cover__tree_and_shrub 2011-01-01 8 Mosaic_herbaceous_cover__tree_and_shrub 2015-01-21 9 Sparse_vegetation_tree_shrub_herbaceous_cover 2000-04-21 10 Sparse_vegetation_tree_shrub_herbaceous_cover 2006-08-11 11 Sparse_vegetation_tree_shrub_herbaceous_cover 2012-12-01 12 Sparse_vegetation_tree_shrub_herbaceous_cover 2016-12-21 13 Shrubland 2002-10-11 14 Shrubland 2003-01-01 15 Shrubland 2007-01-21 16 Shrubland 2013-05-11 17 Cropland_rainfed 1999-05-01 18 Cropland_rainfed 2003-05-21 19 Cropland_rainfed 2009-09-11 20 Mosaic_natural_vegetation 2008-04-11 21 Mosaic_natural_vegetation 2014-08-01 22 Urban_areas 2002-04-01 23 Urban_areas 2006-04-21 24 Urban_areas 2012-08-11 25 Tree_cover_broadleaved_deciduous_closed_to_open 2001-02-21 26 Tree_cover_broadleaved_deciduous_closed_to_open 2007-06-11 27 Tree_cover_broadleaved_deciduous_closed_to_open 2013-10-01 28 Water_bodies 2000-05-11 29 Water_bodies 2006-09-01 ... ... ... 11685 Mosaic_tree_and_shrub_herbaceous_cover 2002-01-01 11686 Mosaic_tree_and_shrub_herbaceous_cover 2006-01-21 11687 Mosaic_tree_and_shrub_herbaceous_cover 2012-05-11 11688 Shrub_or_herbaceous_cover_flooded_fresh_saline... 2000-06-11 11689 Shrub_or_herbaceous_cover_flooded_fresh_saline... 2006-10-01 11690 Shrub_or_herbaceous_cover_flooded_fresh_saline... 2010-10-21 11691 Shrub_or_herbaceous_cover_flooded_fresh_saline... 2011-01-11 11692 Tree_cover_broadleaved_deciduous_open 2002-06-11 11693 Tree_cover_broadleaved_deciduous_open 2008-10-01 11694 Tree_cover_broadleaved_deciduous_open 2012-10-21 11695 Tree_cover_broadleaved_deciduous_open 2013-01-11 11696 Mosaic_cropland 1999-02-21 11697 Mosaic_cropland 2005-06-11 11698 Mosaic_cropland 2011-10-01 11699 Mosaic_cropland 2015-10-21 11700 Mosaic_cropland 2016-01-11 11701 Tree_cover_broadleaved_deciduous_closed 2000-08-01 11702 Tree_cover_broadleaved_deciduous_closed 2004-08-21 11703 Tree_cover_broadleaved_deciduous_closed 2010-12-11 11704 Tree_cover_broadleaved_deciduous_closed 2011-03-01 11705 Tree_cover_broadleaved_deciduous_closed 2015-03-21 11706 Grassland 2004-12-11 11707 Grassland 2005-03-01 11708 Grassland 2009-03-21 11709 Grassland 2015-07-11 11710 Tree_cover_flooded_saline_water 2000-04-11 11711 Tree_cover_flooded_saline_water 2006-08-01 11712 Tree_cover_flooded_saline_water 2010-08-21 11713 Tree_cover_flooded_saline_water 2016-12-11 11714 Tree_cover_flooded_saline_water 2017-03-01 fapar area 0 0.781023 0.781023 1 0.739872 0.739872 2 0.714261 0.714261 3 0.795091 0.795091 4 0.682756 0.682756 5 0.617121 0.617121 6 0.384000 0.384000 7 0.568926 0.568926 8 0.498292 0.498292 9 0.512471 0.512471 10 0.376571 0.376571 11 0.415657 0.415657 12 0.412000 0.412000 13 0.553395 0.553395 14 0.598780 0.598780 15 0.606932 0.606932 16 0.673394 0.673394 17 0.674976 0.674976 18 0.636602 0.636602 19 0.527672 0.527672 20 0.731046 0.731046 21 0.628583 0.628583 22 0.549303 0.549303 23 0.660212 0.660212 24 0.509302 0.509302 25 0.627513 0.627513 26 0.692247 0.692247 27 0.549743 0.549743 28 0.559624 0.559624 29 0.394943 0.394943 ... ... ... 11685 0.469723 0.469723 11686 0.665650 0.665650 11687 0.653948 0.653948 11688 0.638316 0.638316 11689 0.405949 0.405949 11690 0.503323 0.503323 11691 0.501129 0.501129 11692 0.602000 0.602000 11693 0.296000 0.296000 11694 0.546000 0.546000 11695 0.572000 0.572000 11696 0.630002 0.630002 11697 0.814023 0.814023 11698 0.605181 0.605181 11699 0.597991 0.597991 11700 0.645325 0.645325 11701 0.576503 0.576503 11702 0.611666 0.611666 11703 0.615693 0.615693 11704 0.598025 0.598025 11705 0.624073 0.624073 11706 0.503609 0.503609 11707 0.602599 0.602599 11708 0.483708 0.483708 11709 0.498409 0.498409 11710 0.722473 0.722473 11711 0.675400 0.675400 11712 0.662609 0.662609 11713 0.610634 0.610634 11714 0.619562 0.619562 [11715 rows x 4 columns])]
Now that we are ready to write our Combine task, let's detail the algorithm we are going to write to process each zone:
def combine_fapar(zone):
series = zone[1]
series["date"] = pd.to_datetime(series["date"])
series = series.set_index("date")
# Group the dataset by landcover, and for each:
by_landcover = series.groupby("landcover")
# Resample the data monthly (instead of 10-daily) with padding in order to fill some void if there are any
by_landcover = by_landcover.resample("1m", how="mean").fillna(method='pad')
series = by_landcover.reset_index()
# Reweight the FAPAR values by the area the landcover takes over the total area
total_area = by_landcover["area"].mean()
series["fapar"] *= series["area"] / total_area
series = series.drop("area", axis=1)
# Regroup the dataset by date and take the (monthly) sum
series = series.groupby("date").sum()
# Resample the data yearly and remove the 1st and last year of data (to only include full years)
yearly_sum = pd.DataFrame(series["fapar"].resample("12m", "mean")[1:-1])
return (zone[0], series, yearly_sum)
combined = by_zone.map(combine_fapar).cache()
combined.take(1)
[(672, fapar date 1999-01-31 9.717959 1999-02-28 8.868936 1999-03-31 10.370095 1999-04-30 12.420800 1999-05-31 12.032194 1999-06-30 9.968254 1999-07-31 9.284237 1999-08-31 8.657068 1999-09-30 6.982681 1999-10-31 5.440120 1999-11-30 5.537638 1999-12-31 8.929892 2000-01-31 10.512340 2000-02-29 11.302637 2000-03-31 13.014589 2000-04-30 14.539168 2000-05-31 14.625471 2000-06-30 13.036348 2000-07-31 10.837346 2000-08-31 9.071030 2000-09-30 6.932734 2000-10-31 8.162503 2000-11-30 7.231002 2000-12-31 10.725195 2001-01-31 11.366307 2001-02-28 10.276144 2001-03-31 10.517193 2001-04-30 10.990723 2001-05-31 10.960789 2001-06-30 11.189507 ... ... 2014-10-31 8.205650 2014-11-30 8.801627 2014-12-31 11.202606 2015-01-31 10.501179 2015-02-28 9.387215 2015-03-31 10.382771 2015-04-30 11.646911 2015-05-31 11.264060 2015-06-30 12.828613 2015-07-31 11.889360 2015-08-31 10.821384 2015-09-30 8.733411 2015-10-31 7.264628 2015-11-30 6.536714 2015-12-31 7.512584 2016-01-31 7.997100 2016-02-29 10.642866 2016-03-31 9.514068 2016-04-30 9.566114 2016-05-31 8.566643 2016-06-30 7.837038 2016-07-31 8.002204 2016-08-31 6.676644 2016-09-30 5.125340 2016-10-31 5.245660 2016-11-30 6.372412 2016-12-31 9.893308 2017-01-31 10.128927 2017-02-28 9.886582 2017-03-31 10.078414 [219 rows x 1 columns], fapar date 2000-01-31 9.083688 2001-01-31 10.903694 2002-01-31 9.291626 2003-01-31 10.591492 2004-01-31 9.401098 2005-01-31 11.984670 2006-01-31 12.621007 2007-01-31 10.920098 2008-01-31 10.022505 2009-01-31 10.516419 2010-01-31 11.414949 2011-01-31 11.175359 2012-01-31 11.270612 2013-01-31 11.411795 2014-01-31 11.138742 2015-01-31 10.768098 2016-01-31 9.688729 2017-01-31 8.130935)]
Finally, we can compute the trend using a linear regression (OLS).
def min_years(datas):
return len(datas[2]) >=5
def ols(datas):
yearly_series = datas[2]
x, y = yearly_series.index.astype(int) // (10**9 * 60 * 24 * 30), yearly_series["fapar"]
X = np.c_[np.ones(len(x)), x]
return (datas[0], datas[1], datas[2], np.linalg.pinv(X).dot(y))
trends = combined.filter(min_years).map(ols).cache()
Next we will plot the trends. We are interested in the highest and lowest trends in our data.
Let's define our plotting function and plot the trends by descending order.
#copy auxiliary data from hdfs
!hdfs dfs -copyToLocal /tapdata/GAUL2013/zonecodes.csv ./
17/03/22 13:25:52 WARN hdfs.DFSClient: DFSInputStream has been closed already
zonecodes = pd.read_csv("./zonecodes.csv")
def plotTopTrends(trendsCollected):
for zone, series, yearly_series, trend in trendsCollected:
if not np.isnan(trend[0]) and not np.isnan(trend[1]):
codes = zonecodes[zonecodes['ADM1_CODE'] == zone][["ADM0_NAME", "ADM1_NAME"]]
codes["Trend"] = trend[1]
print codes
pdf = series.copy()
pdf["ols"] = pdf.index.astype(int) / (10**9 * 60 * 24 * 30) * trend[1] + trend[0]
pdf[["fapar", "ols"]].plot(figsize=(16,4))
#sns.regplot(x="unix_date", y="fapar", data=series, scatter=False)
plt.show()
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=False).take(10))
ADM0_NAME ADM1_NAME Trend 117 American Samoa Administrative unit not available 0.001467
/opt/rh/python27/root/usr/lib64/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
ADM0_NAME ADM1_NAME Trend 793 Egypt Al Bahr/al Ahmar (redsea) 0.000554
ADM0_NAME ADM1_NAME Trend 1311 Kiribati Administrative unit not available 0.000551
ADM0_NAME ADM1_NAME Trend 1662 Montserrat Plymouth 0.000495
ADM0_NAME ADM1_NAME Trend 1827 Pakistan Punjab 0.000426
ADM0_NAME ADM1_NAME Trend 1101 India Rajasthan 0.000421
ADM0_NAME ADM1_NAME Trend 805 Egypt As Ismailiyah (ismailia) 0.000421
ADM0_NAME ADM1_NAME Trend 104 Algeria Saida 0.000418
ADM0_NAME ADM1_NAME Trend 112 Algeria Tiaret 0.000405
ADM0_NAME ADM1_NAME Trend 2561 Turkey Denizli 0.0004
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=True).take(10))
ADM0_NAME ADM1_NAME Trend 574 Christmas Island Administrative unit not available -0.001041
ADM0_NAME ADM1_NAME Trend 904 French Polynesia Administrative unit not available -0.000726
ADM0_NAME ADM1_NAME Trend 3314 Martinique La Trinite -0.000519
ADM0_NAME ADM1_NAME Trend 731 Djibouti Obock -0.000301
ADM0_NAME ADM1_NAME Trend 3102 Guinea Conakry -0.000285
ADM0_NAME ADM1_NAME Trend 984 Grenada St. George's -0.000231
ADM0_NAME ADM1_NAME Trend 1674 Mozambique Inhambane -0.000223
ADM0_NAME ADM1_NAME Trend 1676 Mozambique Maputo -0.000222
Now we can create a world map and color each zone by its trend. First, let's only take each region and its corresponding trend.
def mean_trends(zone):
return (zone[0], zone[3][1])
trend_by_zone = dict(trends.map(mean_trends).filter(lambda x: not np.isnan(x[1])).collect())
We will also draw a colormap but by default matplotlib will center the colormap based on the mean of the values. We would like to center the colormap on 0.0, so as to fill zones with no trend (0) in yellow, zones with an upwards trend in green and zones with a downwards trend in red.
We define a function to shift the colormap and center it around the point we want.
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.patches import PathPatch
from matplotlib.cm import ScalarMappable
from mpl_toolkits.axes_grid1 import AxesGrid
def shiftedColorMap(cmap, start=0, midpoint=0.5, stop=1.0, name='shiftedcmap'):
'''
Function to offset the "center" of a colormap. Useful for
data with a negative min and positive max and you want the
middle of the colormap's dynamic range to be at zero
Input
-----
cmap : The matplotlib colormap to be altered
start : Offset from lowest point in the colormap's range.
Defaults to 0.0 (no lower ofset). Should be between
0.0 and `midpoint`.
midpoint : The new center of the colormap. Defaults to
0.5 (no shift). Should be between 0.0 and 1.0. In
general, this should be 1 - vmax/(vmax + abs(vmin))
For example if your data range from -15.0 to +5.0 and
you want the center of the colormap at 0.0, `midpoint`
should be set to 1 - 5/(5 + 15)) or 0.75
stop : Offset from highets point in the colormap's range.
Defaults to 1.0 (no upper ofset). Should be between
`midpoint` and 1.0.
'''
cdict = {
'red': [],
'green': [],
'blue': [],
'alpha': []
}
# regular index to compute the colors
reg_index = np.linspace(start, stop, 257)
# shifted index to match the data
shift_index = np.hstack([
np.linspace(0.0, midpoint, 128, endpoint=False),
np.linspace(midpoint, 1.0, 129, endpoint=True)
])
for ri, si in zip(reg_index, shift_index):
r, g, b, a = cmap(ri)
cdict['red'].append((si, r, r))
cdict['green'].append((si, g, g))
cdict['blue'].append((si, b, b))
cdict['alpha'].append((si, a, a))
newcmap = matplotlib.colors.LinearSegmentedColormap(name, cdict)
plt.register_cmap(cmap=newcmap)
return newcmap
Now we can actually draw the map.
!mkdir GAUL2013
!hdfs dfs -copyToLocal /tapdata/GAUL2013/GAUL1* ./GAUL2013/
def plot_map(zones_trend, lllon=-180, lllat=-90, urlon=180, urlat=90):
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
map = Basemap(llcrnrlon=lllon, llcrnrlat=lllat,
urcrnrlon=urlon, urcrnrlat=urlat,
resolution='c',
projection='cyl')
#map.drawmapboundary(fill_color='aqua')
#map.fillcontinents(color='#ddaa66', lake_color='aqua')
#map.drawcoastlines()
map.readshapefile('./GAUL2013/GAUL1', 'GAUL1')
patches = []
colors = []
for info, shape in zip(map.GAUL1_info, map.GAUL1):
zone_trend = zones_trend.get(info['ADM1_CODE'])
if zone_trend is not None:
patches.append( Polygon(np.array(shape), True ))
colors.append(zone_trend)
vmax = np.percentile(colors, 95)
vmin = np.percentile(colors, 5)
midpoint = 1 - vmax/(vmax + abs(vmin))
colormap = shiftedColorMap(plt.get_cmap('RdYlGn'), midpoint=midpoint, name='shifted')
pc = PatchCollection(patches, cmap=colormap, linewidths=1., zorder=2)
pc.set_array(np.array(colors))
pc.set_clim([vmin, vmax])
ax.add_collection(pc)
plt.colorbar(pc)
plt.show()
17/03/22 13:33:13 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:13 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:13 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:13 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:14 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:14 WARN hdfs.DFSClient: DFSInputStream has been closed already 17/03/22 13:33:14 WARN hdfs.DFSClient: DFSInputStream has been closed already
This is what it looks like worldwide.
plot_map(trend_by_zone)
Europe:
plot_map(trend_by_zone, lllon=-31.5, lllat=27, urlon=39, urlat=80)
Africa:
plot_map(trend_by_zone, lllon=-21.6, lllat=-38.7, urlon=55.2, urlat=39.8)