Open In Colab

Berdasarkan isu #73: request: mengolah berkas dari data bmkg

Deskripsi:

  • mengolah berkas excel yang diperoleh dari data online bmkg untuk siap dipakai
  • memeriksa kondisi data

Fungsi yang diharapkan:

Umum / General

  • Memeriksa apakah data lengkap atau tidak? Jika tidak, data apa dan pada tanggal berapa?
  • Memeriksa apakah data tidak ada data / tidak ada pengukuran (9999) atau data tidak diukur (8888)? Jika ada, data apa dan pada tanggal berapa?
  • Menampilkan "potongan" baris yang tidak memiliki data / tidak melakukan pengukuran?

DATASET

In [0]:
# AKSES GOOGLE DRIVE 
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
In [0]:
# DRIVE PATH
DRIVE_DROP_PATH = '/content/gdrive/My Drive/Colab Notebooks/_dropbox'
DRIVE_DATASET_PATH = '/content/gdrive/My Drive/Colab Notebooks/_dataset/uma_pamarayan'
In [0]:
DATASET_PATH = DRIVE_DATASET_PATH + '/klimatologi_geofisika_tangerang_1998_2009.xlsx'

FUNGSI

In [0]:
import pandas as pd
import numpy as np
from operator import itemgetter
from itertools import groupby

def _read_bmkg(io):
    return pd.read_excel(
        io, skiprows=8, skipfooter=16, header=0, index_col=0, parse_dates=True,
        date_parser=lambda x: pd.to_datetime(x, format='%d-%m-%Y')
    )

def _have_nan(dataset):
    if dataset.isna().any().any():
        return True
    else:
        return False

def _get_index1D(array1D_bool):
    return np.argwhere(array1D_bool).reshape(-1,)

def _get_nan(dataset):
    nan = {}

    for col in dataset.columns:
        nan[col] = _get_index1D(dataset[col].isna().values).tolist()

    return nan

def _get_missing(dataset):
    missing = {}

    for col in dataset.columns:
        masking = (dataset[col] == 8888) | (dataset[col] == 9999)
        missing[col] = _get_index1D(masking.values)
    
    return missing

def _check_nan(dataset):
    if _have_nan(dataset):
        return _get_nan(dataset)
    else:
        return None

def _get_nan_columns(dataset):
    return dataset.columns[dataset.isna().any()].tolist()

def _group_as_list(array):

    # based on https://stackoverflow.com/a/15276206 
    group_list = []
    for _, g in groupby(enumerate(array), lambda x: x[0]-x[1]):
        single_list = sorted(list(map(itemgetter(1), g)))
        group_list.append(single_list)
    
    return group_list

def _group_as_index(
    group_list, index=None, date_format='%Y%m%d',
    format_date = '{}-{}'
):
    group_index = []
    date_index = isinstance(index, pd.DatetimeIndex)

    for item in group_list:
        if len(item) == 1:
            if date_index:
                group_index.append(index[item[0]].strftime(date_format))
            else:
                group_index.append(index[item[0]])
        else:
            if date_index:
                group_index.append(
                    format_date.format(
                        index[item[0]].strftime(date_format),
                        index[item[-1]].strftime(date_format)
                    )
                )
            else:
                group_index.append(
                    format_date.format(
                        index[item[0]], index[item[-1]]
                    )
                )
            
    return group_index

PENGGUNAAN

Fungsi _read_bmkg

Tujuan: Impor berkas excel bmkg ke dataframe

In [0]:
dataset = _read_bmkg(DATASET_PATH)
dataset.head()
Out[0]:
Tn Tx Tavg RH_avg RR ss ff_x ddd_x ff_avg ddd_car
Tanggal
1998-03-01 24.0 32.0 27.7 89.0 0.0 5.0 2.0 78.0 2.0 W
1998-03-02 23.9 31.2 27.6 92.0 16.0 2.0 4.0 283.0 3.0 SE
1998-03-03 24.2 33.0 27.3 90.0 7.0 2.0 5.0 330.0 2.0 S
1998-03-04 24.0 30.4 26.9 90.0 1.0 0.2 3.0 285.0 0.0 SW
1998-03-05 24.0 32.0 27.4 89.0 0.0 3.7 0.0 345.0 0.0 E
In [0]:
dataset.tail()
Out[0]:
Tn Tx Tavg RH_avg RR ss ff_x ddd_x ff_avg ddd_car
Tanggal
2008-12-28 24.7 33.0 28.6 78.0 0.0 5.1 13.0 270.0 3.0 W
2008-12-29 23.2 29.4 26.7 84.0 1.0 0.0 15.0 270.0 8.0 W
2008-12-30 24.6 31.2 27.2 82.0 2.0 1.5 13.0 315.0 4.0 W
2008-12-31 24.6 32.2 27.5 82.0 3.0 1.5 13.0 270.0 4.0 W
2009-01-01 24.5 32.1 26.2 90.0 18.0 1.6 8.0 270.0 2.0 W

Fungsi _have_nan()

Tujuan: Memeriksa apakah di dalam tabel memiliki nilai yang hilang (np.nan)

In [0]:
_have_nan(dataset)
Out[0]:
True

Fungsi _get_index1D()

Tujuan: Memperoleh index data yang hilang untuk setiap array

In [0]:
_get_index1D(dataset['RH_avg'].isna().values)
Out[0]:
array([ 852, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046,
       1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057,
       1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1220,
       1221, 1222, 1223, 1224, 1628, 1629, 1697, 2657])

Fungsi _get_nan()

Tujuan: Memperoleh index data yang hilang untuk setiap kolom dalam bentuk dictionary

In [0]:
_get_nan(dataset).keys()
Out[0]:
dict_keys(['Tn', 'Tx', 'Tavg', 'RH_avg', 'RR', 'ss', 'ff_x', 'ddd_x', 'ff_avg', 'ddd_car'])
In [0]:
print(_get_nan(dataset)['RH_avg'])
[852, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1220, 1221, 1222, 1223, 1224, 1628, 1629, 1697, 2657]

Fungsi _get_nan_columns()

Tujuan: Memperoleh nama kolom yang memiliki nilai yang hilang NaN.

In [0]:
_get_nan_columns(dataset)
Out[0]:
['RH_avg', 'ss', 'ff_x', 'ddd_x', 'ff_avg', 'ddd_car']

Fungsi _check_nan()

Tujuan: Gabungan dari _have_nan() dan _get_nan(). Memeriksa apakah dataset memiliki NaN, jika iya, memberikan nilai hasil _get_nan(), jika tidak memberikan nilai None.

In [0]:
_check_nan(dataset).items()
Out[0]:
dict_items([('Tn', []), ('Tx', []), ('Tavg', []), ('RH_avg', [852, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1220, 1221, 1222, 1223, 1224, 1628, 1629, 1697, 2657]), ('RR', []), ('ss', [97, 103, 113, 117, 131, 132, 133, 139, 148, 207, 212, 238, 273, 308, 313, 314, 315, 324, 332, 333, 340, 343, 344, 349, 354, 357, 359, 360, 363, 379, 604, 609, 610, 616, 617, 623, 642, 655, 656, 663, 664, 665, 669, 678, 687, 690, 696, 700, 701, 708, 709, 710, 712, 721, 733, 768, 824, 825, 840, 949, 976, 977, 978, 984, 985, 986, 1000, 1068, 1073, 1074, 1077, 1080, 1084, 1087, 1089, 1101, 1125, 1135, 1174, 1182, 1185, 1187, 1199, 1355, 1363, 1366, 1384, 1400, 1401, 1408, 1411, 1419, 1429, 1430, 1431, 1436, 1437, 1441, 1443, 1445, 1446, 1454, 1456, 1468, 1470, 1479, 1492, 1531, 1749, 1762, 1764, 1792, 1803, 1810, 1818, 1823, 1840, 1844, 1864, 1938, 2028, 2070, 2086, 2090, 2091, 2096, 2105, 2112, 2117, 2120, 2121, 2122, 2125, 2129, 2157, 2165, 2166, 2167, 2169, 2174, 2175, 2177, 2178, 2179, 2180, 2181, 2182, 2196, 2198, 2199, 2209, 2254, 2293, 2295, 2415, 2416, 2417, 2457, 2461, 2467, 2489, 2500, 2517, 2522, 2526, 2534, 2538, 2539, 2540, 2550, 2553, 2565, 2575, 2663, 2664, 2665, 2666, 2710, 2712, 2714, 2757, 2787, 2788, 2826, 2827, 2846, 2847, 2848, 2849, 2857, 2859, 2865, 2879, 2883, 2885, 2921, 2922, 2923, 2925, 2940, 2941, 3004, 3020, 3028, 3220, 3259, 3594, 3595, 3608, 3624, 3630, 3631, 3632, 3633, 3634, 3635, 3636, 3637, 3638, 3641, 3642, 3644, 3645, 3646, 3648, 3649, 3651, 3652, 3695, 3697, 3705, 3709, 3758, 3759, 3761, 3890]), ('ff_x', [1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766]), ('ddd_x', [1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 3286]), ('ff_avg', [1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1286, 1699, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 3396]), ('ddd_car', [1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1286, 1699, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766])])
In [0]:
# Jika tidak memiliki nilai nan
print(_check_nan(dataset.drop(_get_nan_columns(dataset), axis=1)))
None

Fungsi _group_as_list()

Tujuan: Mengelompokkan kelompok array yang bersifat kontinu (nilainya berurutan) dalam masing-masing list.

Referensi: https://stackoverflow.com/a/15276206 (dimodifikasi untuk Python 3.x dan kemudahan membaca)

In [0]:
missing_dict = _get_nan(dataset)
missing_RH_avg = missing_dict['RH_avg']
print(missing_RH_avg)
[852, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1220, 1221, 1222, 1223, 1224, 1628, 1629, 1697, 2657]
In [0]:
print(_group_as_list(missing_RH_avg))
[[852], [1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067], [1220, 1221, 1222, 1223, 1224], [1628, 1629], [1697], [2657]]

Fungsi _group_as_index()

Tujuan: Mengubah hasil pengelompokkan menjadi jenis index dataset (dalam kasus ini dalam bentuk tanggal dibandingkan dalam bentuk angka-index dataset).

In [0]:
_group_as_index(_group_as_list(missing_RH_avg), index=dataset.index, date_format='%d %b %Y')
Out[0]:
['30 Jun 2000',
 '01 Jan 2001-31 Jan 2001',
 '03 Jul 2001-07 Jul 2001',
 '15 Aug 2002-16 Aug 2002',
 '23 Oct 2002',
 '09 Jun 2005']

Fungsi _get_missing()

Tujuan: Memperoleh index yang memiliki nilai tidak terukur (bernilai 8888 atau 9999) untuk setiap kolomnya

In [0]:
_get_missing(dataset)
Out[0]:
{'RH_avg': array([], dtype=int64),
 'RR': array([2090, 2467, 2468, 2489, 2506, 2606, 3208, 3209, 3229, 3712, 3876,
        3904, 3946]),
 'Tavg': array([], dtype=int64),
 'Tn': array([], dtype=int64),
 'Tx': array([], dtype=int64),
 'ddd_car': array([], dtype=int64),
 'ddd_x': array([], dtype=int64),
 'ff_avg': array([], dtype=int64),
 'ff_x': array([], dtype=int64),
 'ss': array([], dtype=int64)}

Penerapan

Menampilkan index yang bermasalah

Tujuan: Setelah memperoleh index dari hasil _get_missing() atau _get_nan(), bisa menampilkan potongan index tersebut dalam dataframe.

In [0]:
dataset.iloc[_get_missing(dataset)['RR']]
Out[0]:
Tn Tx Tavg RH_avg RR ss ff_x ddd_x ff_avg ddd_car
Tanggal
2003-11-20 24.1 28.8 26.2 85.0 8888.0 NaN 1.0 90.0 1.0 E
2004-12-01 23.7 29.9 26.2 87.0 8888.0 NaN 10.0 270.0 0.0 W
2004-12-02 24.0 34.0 28.5 78.0 8888.0 2.0 10.0 270.0 0.0 W
2004-12-23 23.6 31.2 27.1 86.0 8888.0 NaN 6.0 270.0 0.0 W
2005-01-09 23.2 30.6 26.5 89.0 8888.0 4.4 3.0 270.0 0.0 W
2005-04-19 24.8 33.0 27.9 84.0 8888.0 5.1 10.0 315.0 0.0 E
2006-12-12 25.4 32.8 28.0 81.0 8888.0 5.0 5.0 45.0 2.0 N
2006-12-13 23.2 33.2 28.5 80.0 8888.0 4.2 5.0 45.0 2.0 N
2007-01-02 22.2 31.6 26.6 83.0 8888.0 5.6 15.0 270.0 4.0 W
2008-04-29 21.0 32.4 28.1 82.0 8888.0 6.0 2.0 360.0 0.0 N
2008-10-10 24.6 35.1 29.6 72.0 8888.0 7.3 10.0 360.0 4.0 N
2008-11-07 25.0 31.6 27.5 83.0 8888.0 2.5 3.0 270.0 0.0 W
2008-12-19 25.2 31.6 27.3 81.0 8888.0 4.0 5.0 360.0 2.0 N
In [0]:
_group_as_list(_get_missing(dataset)['RR'])
Out[0]:
[[2090],
 [2467, 2468],
 [2489],
 [2506],
 [2606],
 [3208, 3209],
 [3229],
 [3712],
 [3876],
 [3904],
 [3946]]
In [0]:
_group_as_index(_, index=dataset.index, date_format='%d %b %Y', format_date='{} sampai {}')
Out[0]:
['20 Nov 2003',
 '01 Dec 2004 sampai 02 Dec 2004',
 '23 Dec 2004',
 '09 Jan 2005',
 '19 Apr 2005',
 '12 Dec 2006 sampai 13 Dec 2006',
 '02 Jan 2007',
 '29 Apr 2008',
 '10 Oct 2008',
 '07 Nov 2008',
 '19 Dec 2008']

Changelog

- 20190928 - 1.0.0 - Initial

Copyright © 2019 Taruma Sakti Megariansyah

Source code in this notebook is licensed under a MIT License. Data in this notebook is licensed under a Creative Common Attribution 4.0 International.