Notebook

Pandas¶

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Ref : http://pandas.pydata.org/pandas-docs/stable/#pandas-powerful-python-data-analysis-toolkit

In [2]:

from __future__ import print_function

import pandas as pd

Creating Pandas Series and DataFrame¶

In [2]:

pd.Series?

In [11]:

pd.DataFrame?

In [12]:

ser = pd.Series(range(10, 20))
print(ser)

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

In [6]:

indexed_ser = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(indexed_ser)

a    1
b    2
c    3
dtype: int64

In [3]:

df = pd.DataFrame({'a': range(5), 'b': range(10, 15)})
print(df)

In [7]:

indexed_df = pd.DataFrame({'a': range(5), 'b': range(10, 15)}, index=['u', 'v', 'w', 'x', 'y'])
print(indexed_df)

Exercises¶

Understand the various keyword arguments that pandas.Series and pandas.DataFrame can take and their functionality.
Create a Pandas Series and a DataFrame from a Python list.
Set the column names on a Pandas DataFrame generated from a Python list.
Create a Pandas Series which is of boolean type.

Saving a Pandas Series to a file¶

Ref : See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html for more information on the various keyword arguments and their defaults on the Series/DataFrame.to_csv method. A number of other methods are also available on the Series/DataFrame objects that can be used to export the contents of the Series/DataFrame to various formats, such as -

DataFrame.to_excel
DataFrame.to_hdf
DataFrame.to_json
DataFrame.to_sql

You can read more about all such methods at - http://pandas.pydata.org/pandas-docs/stable/api.html#id12

In [6]:

ser.to_csv('pd_ser.csv', sep=',')

In [7]:

!cat pd_ser.csv

0,10
1,11
2,12
3,13
4,14
5,15
6,16
7,17
8,18
9,19

In [15]:

df.to_csv?

Exercises¶

Understand the various keyword arguments to pandas.Series.to_csv and pandas.DataFrame.to_csv and their functionality.
Understand the various methods available on the Series/DataFrame objects to save data into various file formats.
Write a Pandas Series to a file without the Index values.
Write a Pandas DataFrame to a csv file without a header column.
Write a Pandas DataFrame to an Excel file.

Reading a Pandas Series from a file¶

Ref : http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html for more information on the various keywords arguments and their defaults on the pandas.read_table function. A few other freqently used functions to generated DataFrames in pandas are

pandas.read_csv - which uses the , value as the default delimiter
pandas.read_excel
pandas.read_html
pandas.read_hdf

Ref : http://pandas.pydata.org/pandas-docs/stable/api.html#input-output

pd.read_csv will assume that the first row of data in the file contains columns names. Similarly, it assumes that the file doesn't contain any information about the index column.

In [8]:

ser = pd.read_table('pd_ser.csv', sep=',', index_col=0, header=None)
print(ser)

Exercises¶

Understand the various functions available in Pandas to read data from files into Series/DataFrames.
Read a Pandas DataFrame from a HTML page

Digging into the Pandas Series, DataFrame objects¶

Ref

In [11]:

print(dir(pd.Series(range(10))))

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_prepare__', '__array_priority__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__idiv__', '__imul__', '__init__', '__int__', '__invert__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__le__', '__len__', '__long__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__unicode__', '__weakref__', '__xor__', '_accessors', '_add_numeric_operations', '_add_series_only_operations', '_add_series_or_dataframe_operations', '_agg_by_level', '_align_frame', '_align_series', '_allow_index_ops', '_at', '_binop', '_box_item_values', '_can_hold_na', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_percentile', '_check_setitem_copy', '_clear_item_cache', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_dict_for_slice', '_construct_axes_dict_from', '_construct_axes_from_arguments', '_constructor', '_constructor_expanddim', '_constructor_sliced', '_convert', '_create_indexer', '_dir_additions', '_dir_deletions', '_expand_axes', '_from_axes', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_index_resolvers', '_get_item_cache', '_get_numeric_data', '_get_repr', '_get_values', '_get_values_tuple', '_get_with', '_iat', '_iget_item_cache', '_iloc', '_index', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_init_mgr', '_internal_names', '_internal_names_set', '_is_cached', '_is_datelike_mixed_type', '_is_mixed_type', '_is_numeric_mixed_type', '_is_view', '_ix', '_ixs', '_loc', '_make_cat_accessor', '_make_dt_accessor', '_make_str_accessor', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_axis', '_reindex_indexer', '_reindex_multi', '_reindex_with_indexers', '_reset_cache', '_reset_cacher', '_set_as_cached', '_set_axis', '_set_axis_name', '_set_is_copy', '_set_item', '_set_labels', '_set_name', '_set_subtyp', '_set_values', '_set_with', '_set_with_engine', '_setup_axes', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_typ', '_unpickle_series_compat', '_update_inplace', '_validate_dtype', '_values', '_where', '_xs', 'abs', 'add', 'add_prefix', 'add_suffix', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'as_blocks', 'as_matrix', 'asfreq', 'asobject', 'asof', 'astype', 'at', 'at_time', 'autocorr', 'axes', 'base', 'between', 'between_time', 'bfill', 'blocks', 'bool', 'clip', 'clip_lower', 'clip_upper', 'combine', 'combine_first', 'compound', 'compress', 'consolidate', 'convert_objects', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'data', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'dropna', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_array', 'from_csv', 'ftype', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iget', 'iget_value', 'iloc', 'imag', 'index', 'interpolate', 'irow', 'is_copy', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_time_series', 'is_unique', 'isin', 'isnull', 'item', 'itemsize', 'iteritems', 'iterkv', 'ix', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'nonzero', 'notnull', 'nsmallest', 'nunique', 'order', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'ptp', 'put', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'real', 'reindex', 'reindex_axis', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'reshape', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'select', 'sem', 'set_axis', 'set_value', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort', 'sort_index', 'sort_values', 'sortlevel', 'squeeze', 'std', 'strides', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_frame', 'to_hdf', 'to_json', 'to_msgpack', 'to_period', 'to_pickle', 'to_sparse', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'tolist', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'valid', 'value_counts', 'values', 'var', 'view', 'where', 'xs']

In [14]:

print(dir(pd.DataFrame({'a':range(10)})))

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__idiv__', '__imul__', '__init__', '__invert__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__unicode__', '__weakref__', '__xor__', '_accessors', '_add_numeric_operations', '_add_series_only_operations', '_add_series_or_dataframe_operations', '_agg_by_level', '_align_frame', '_align_series', '_apply_broadcast', '_apply_empty_result', '_apply_raw', '_apply_standard', '_at', '_box_col_values', '_box_item_values', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_percentile', '_check_setitem_copy', '_clear_item_cache', '_combine_const', '_combine_frame', '_combine_match_columns', '_combine_match_index', '_combine_series', '_combine_series_infer', '_compare_frame', '_compare_frame_evaluate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_dict_for_slice', '_construct_axes_dict_from', '_construct_axes_from_arguments', '_constructor', '_constructor_expanddim', '_constructor_sliced', '_convert', '_count_level', '_create_indexer', '_dir_additions', '_dir_deletions', '_ensure_valid_index', '_expand_axes', '_flex_compare_frame', '_from_arrays', '_from_axes', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_index_resolvers', '_get_item_cache', '_get_numeric_data', '_get_values', '_getitem_array', '_getitem_column', '_getitem_frame', '_getitem_multilevel', '_getitem_slice', '_iat', '_iget_item_cache', '_iloc', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_dict', '_init_mgr', '_init_ndarray', '_internal_names', '_internal_names_set', '_is_cached', '_is_datelike_mixed_type', '_is_mixed_type', '_is_numeric_mixed_type', '_is_view', '_ix', '_ixs', '_join_compat', '_loc', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_axis', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_as_cached', '_set_axis', '_set_axis_name', '_set_is_copy', '_set_item', '_setitem_array', '_setitem_frame', '_setitem_slice', '_setup_axes', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_typ', '_unpickle_frame_compat', '_unpickle_matrix_compat', '_update_inplace', '_validate_dtype', '_values', '_where', '_xs', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'as_blocks', 'as_matrix', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'axes', 'between_time', 'bfill', 'blocks', 'bool', 'boxplot', 'clip', 'clip_lower', 'clip_upper', 'columns', 'combine', 'combineAdd', 'combineMult', 'combine_first', 'compound', 'consolidate', 'convert_objects', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'from_csv', 'from_dict', 'from_items', 'from_records', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'head', 'hist', 'iat', 'icol', 'idxmax', 'idxmin', 'iget_value', 'iloc', 'index', 'info', 'insert', 'interpolate', 'irow', 'is_copy', 'isin', 'isnull', 'iteritems', 'iterkv', 'iterrows', 'itertuples', 'ix', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notnull', 'nsmallest', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_axis', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select', 'select_dtypes', 'sem', 'set_axis', 'set_index', 'set_value', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort', 'sort_index', 'sort_values', 'sortlevel', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_excel', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_msgpack', 'to_panel', 'to_period', 'to_pickle', 'to_records', 'to_sparse', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unstack', 'update', 'values', 'var', 'where', 'xs']

Slicing Pandas Series, DataFrame¶

Ref : http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-and-selecting-data

In [15]:

df = pd.DataFrame({'a': range(5, 10), 'b': range(10, 15)}, index=[5,3,7,1,9])
print(df)

DataFrame[][] expects two values - the column name and the row index.
DataFrame.loc[] expects two values - the first value should be an index label and the second value should be a column label.
DataFrame.iloc[] expects two values - the first value should be the integer location of the row and the integer location of the column.

In [16]:

# Refers to all columns in the row labeled `1`
df.loc[1, :]

Out[16]:

a     8
b    13
Name: 1, dtype: int64

In [17]:

# Refers to all columns in the second row of the DataFrame
df.iloc[1, :]

Out[17]:

a     6
b    11
Name: 3, dtype: int64

In [18]:

# Refers to all rows in the column labeled `a`
df.loc[:, 'a']

Out[18]:

5    5
3    6
7    7
1    8
9    9
Name: a, dtype: int64

In [19]:

# Refers to all rows in the second column of the DataFrame
df.iloc[:, 1]

Out[19]:

5    10
3    11
7    12
1    13
9    14
Name: b, dtype: int64

In [20]:

# When we call the iloc method, we are asking for the data in the second row, second column
# When we call the loc method, we are asking for the data in the row labeled `3` and the column labeled `b`
print(df.iloc[1, 1], df.loc[3, 'b'])

11 11

In [21]:

# Refers to the value in the `a` column and the row labeled `1`
print(df['a'][1])

Plotting a Pandas DataFrame¶

Ref : http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html for more information on the various keyword arguments and their defaults set on the DataFrame.plot method. A number of other methods are available to generate scatter plots, bar plots and more, information on which can be found at - http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting

In [22]:

%matplotlib inline

In [23]:

df = pd.DataFrame({'a': range(10, 20), 'b': range(20, 30)})

In [24]:

# Plot all available columns against the Index of the DataFrame
df.plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x1123b6490>

In [25]:

# Choose which columns represent x and y and plot them against one another
df.plot(x = 'a', y = 'b')

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x114a74510>

Exercises¶

Create a bar plot from data in a DataFrame
Create a scatter plot from data in a DataFrame

Pandas¶

Creating Pandas Series and DataFrame¶

Exercises¶

Saving a Pandas Series to a file¶

Exercises¶

Reading a Pandas Series from a file¶

Exercises¶

Digging into the Pandas Series, DataFrame objects¶

Slicing Pandas Series, DataFrame¶

Plotting a Pandas DataFrame¶

Exercises¶

References¶