Creating Holidays Datasets

In this project, we will use Nager.Date API to create datasets of public holidays and long weekends in different countries of the world in 2019. At the end of the project, we will have two datasets that can be analyzed in the future:

  • world_holidays_2019.
  • world_long_weekends_2019

The world_holidays_2019 dataset will have the following columns:

  • date: the date of the holiday.
  • local_name: the name of the holiday in the country.
  • english_name: the name of the holiday in English.
  • country_code: two-letter country code.
  • country_name: full country name.
  • fixed_date: is this holiday celebrated every year on the same date.
  • global_holiday: is this holiday international.
  • counties: federal state code.
  • launch_year: when this holiday was launched.
  • type: the holiday's type

And the world_long_weekends_2019 dataset these columns:

  • start_date: when the weekend starts.
  • end_date: when the weekend ends.
  • weekend_length: how many days the weekend lasts.
  • need_bridge_day: if this weekend needs a bridge day.
  • country_code: two-letter country code.
  • country_name: full country name.
In [1]:
# Import the necessary libraries
import json
import time

import pandas as pd
import requests
import requests_cache
from IPython.core.display import clear_output

Creating the Dataframes

In this section, we will create the dataframes that we will clean a bit and transform into two csv files:

  • world_holidays_2019.csv containing the holidays in different countries in 2019.
  • world_long_weekends_2019.csv containing the long weekends in different countries in 2019.
In [2]:
# Function to prettify the returned list from response
def prettify_json(python_obj):
    text = json.dumps(python_obj, indent=4)
    print(text)

Before we proceed we will need to extract the two-letter country code for each country that we will insert as a parameter to the API URLs.

In [3]:
url_countries = "https://date.nager.at/Api/v2/AvailableCountries"
available_countries = requests.get(url_countries)

# Return the first 3 dictionaries to study the structureb
prettify_json(available_countries.json()[:3])
[
    {
        "key": "AD",
        "value": "Andorra"
    },
    {
        "key": "AL",
        "value": "Albania"
    },
    {
        "key": "AR",
        "value": "Argentina"
    }
]

We are going to add both country codes and names to our final datasets so we have to create a dataframe of all available countries that we will merge with the final dataframes. We can pass each country code from the countries dataframe to the API URLs to extract information about holidays and long weekends for each country.

In [4]:
# Create the dataframe of available countries
countries_df = pd.DataFrame(available_countries.json())

# Rename columns
countries_df.columns = ["country_code", "country_name"]

# Check the dataframe
countries_df.head()
Out[4]:
country_code country_name
0 AD Andorra
1 AL Albania
2 AR Argentina
3 AT Austria
4 AU Australia

Now we are ready to extract information about public holidays and long weekends in 2019. We will use two for loops to loop over the country codes and get an API response for each country that we will append to two different lists: one for the holidays and one for weekends.

In case of weekends, we will append directly lists of dictionaries containing the information about long weekends while in case of holidays we will append Response objects. We do so because we want to add missing country codes to the weekends' dictionaries.

In [5]:
# Create a local cache for holidays requests
requests_cache.install_cache("holidays")

# Initialize empty lists for responses
holidays_responses = []
weekends_responses = []

# Convert the `country_code` name in a list to pass each element in the API url
for code in list(countries_df["country_code"]):

    holiday_url = "https://date.nager.at/Api/v2/PublicHolidays/2019/{}".format(code)

    # Get the API response and append it to the list of responses
    holiday_response = requests.get(holiday_url)
    holidays_responses.append(holiday_response)

    # If not cached sleep for 0.5 seconds
    if not getattr(holiday_response, "from_cache", False):
        time.sleep(0.5)

    # Repeat the procedure to get responses for long weekends
    weekend_url = "https://date.nager.at/Api/v2/LongWeekend/2019/{}".format(code)
    weekend_response = requests.get(weekend_url).json()

    # Add a country code to each dictionary of long weekends
    for d in weekend_response:
        d.update({"country_code": code})

    # Append the API response to the list of responses
    weekends_responses.append(weekend_response)

Let's now look at how our data is organized in pandas dataframes.

In [6]:
pd.DataFrame(holidays_responses[0].json())
Out[6]:
date localName name countryCode fixed global counties launchYear type
0 2019-01-01 Any nou New Year's Day AD True True None None Public
1 2019-03-14 Dia de la Constitució Constitution Day AD True True None None Public
2 2019-03-14 Mare de Déu de Meritxell National Holiday AD True True None None Public
3 2019-12-25 Nadal Christmas Day AD True True None None Public
In [7]:
pd.DataFrame(weekends_responses[0])
Out[7]:
startDate endDate dayCount needBridgeDay country_code
0 2018-12-29 2019-01-01 4 True AD
1 2019-03-14 2019-03-17 4 True AD

We now can create two lists of dataframes out of the created lists and concatenate all dataframes in that list.

In [8]:
# Create two list of dataframes
holidays_frames = [pd.DataFrame(x.json()) for x in holidays_responses]
weekends_frames = [pd.DataFrame(x) for x in weekends_responses]

# Concatenate the dataframes from the lists
holidays = pd.concat(holidays_frames, ignore_index=True)
weekends = pd.concat(weekends_frames, ignore_index=True)

Data Cleaning and Exporting

Before we export the dataframes in csv files we will:

  • Rename the columns to make them more readable and descriptive.
  • Add a column with full country names.
  • Reorder the columns in a more logical way.

It is possible to do some more data cleaning but we are just preparing the dataset for a further analysis so we will not dive into the details.

In [9]:
# Rename columns in the `holidays` dataframe
holidays = holidays.rename(
    columns={
        "localName": "local_name",
        "name": "english_name",
        "countryCode": "country_code",
        "fixed": "fixed_date",
        "global": "global_holiday",
        "launchYear": "launch_year",
    }
)

# Check if everything is correct
holidays.head()
Out[9]:
date local_name english_name country_code fixed_date global_holiday counties launch_year type
0 2019-01-01 Any nou New Year's Day AD True True None None Public
1 2019-03-14 Dia de la Constitució Constitution Day AD True True None None Public
2 2019-03-14 Mare de Déu de Meritxell National Holiday AD True True None None Public
3 2019-12-25 Nadal Christmas Day AD True True None None Public
4 2019-01-01 Viti i Ri New Year's Day AL True True None None Public
In [10]:
# Rename columns in the `weekends` dataframe
weekends = weekends.rename(
    columns={
        "startDate": "start_date",
        "endDate": "end_date",
        "dayCount": "weekend_length",
        "needBridgeDay": "need_bridge_day",
    }
)

# Check if everything is correct
weekends.head()
Out[10]:
start_date end_date weekend_length need_bridge_day country_code
0 2018-12-29 2019-01-01 4 True AD
1 2019-03-14 2019-03-17 4 True AD
2 2018-12-29 2019-01-02 5 True AL
3 2019-03-14 2019-03-17 4 True AL
4 2019-03-22 2019-03-24 3 False AL

After renaming the columns we can export the dataframes in csv files that we can analyze in the future. Before doing so we will add full country names to both dataframes.

In [11]:
# Merge the dataframes
holidays = pd.merge(holidays, countries_df, on="country_code")
weekends = pd.merge(weekends, countries_df, on="country_code")

# Reorder the columns in the `holidays` dataframe to have `country_name` after `country_code`
cols = [
    "date",
    "local_name",
    "english_name",
    "country_code",
    "country_name",
    "fixed_date",
    "global_holiday",
    "counties",
    "launch_year",
    "type",
]

holidays = holidays[cols]

# Check the `holidays` dataframe
holidays.head()
Out[11]:
date local_name english_name country_code country_name fixed_date global_holiday counties launch_year type
0 2019-01-01 Any nou New Year's Day AD Andorra True True None None Public
1 2019-03-14 Dia de la Constitució Constitution Day AD Andorra True True None None Public
2 2019-03-14 Mare de Déu de Meritxell National Holiday AD Andorra True True None None Public
3 2019-12-25 Nadal Christmas Day AD Andorra True True None None Public
4 2019-01-01 Viti i Ri New Year's Day AL Albania True True None None Public
In [12]:
# Check the `weekends` dataframe
weekends.head()
Out[12]:
start_date end_date weekend_length need_bridge_day country_code country_name
0 2018-12-29 2019-01-01 4 True AD Andorra
1 2019-03-14 2019-03-17 4 True AD Andorra
2 2018-12-29 2019-01-02 5 True AL Albania
3 2019-03-14 2019-03-17 4 True AL Albania
4 2019-03-22 2019-03-24 3 False AL Albania

Now we can export the dataframes to csv files that we can analyze in the future.

In [13]:
# Export the dataframes in csv files
holidays.to_csv("world_holidays_2019.csv", index=False)
weekends.to_csv("world_long_weekends_2019.csv", index=False)

Next Steps

One can use these datasets to answer the following questions (and not only!):

  • Which countries have the most number of holidays?
  • Which countries have the most number of free days?
  • Which are truly global holidays (yes, there are a lot of mistakes in the global_holiday column)?
  • Which month have the most number of free days worldwide? In each country?
  • Are there similar holidays in different countries? If so which?

Conclusions

In the project, we used Nager.Date API to create two datasets of holidays and long weekends in different countries all over the globe. We also did some data cleaning to prepare the data for a more comfortable analysis. We also proposed some questions that can be answerred using these datasets.