Use Data Science to Handle Enterprise Resource Planning Data with Ease

"Excel at Python"

  • author: Dmytro Karabash, Maxim Korotkov
  • categories: [management, python, erp, enterprise resource planning, data science]
  • image: images/unsplash-scott-graham-5fNmWej4tAA-640.jpg
  • permalink: /teamdata/
In [146]:
# hide
# loading functions

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

verbosity = 5

pd.options.display.max_rows = 7
pd.options.display.float_format = '{:.2f}'.format


You’re a boss now. You have a team or a business unit doing consultancy. Probably a bunch of Consultants helping clients, some Project Managers leading your projects in a nice Agile or whatever way. Everyone fills their timesheets (and you cut their bonuses if they don’t) – the time is billed to clients, and you have Fixed Fee projects as well. Maybe smaller teams being parts of your organization also – you can lead ten, a hundred or a thousand people (hey, dude – you run a unit of 1000 employees? Don’t read this article – hire someone who did). You probably even have an ERP (Enterprise Resource Planning) or something and a CFO in a lovely corner office. Do you have an excellent way to estimate how much profit every team member and project brings and ways to predict it with reasonable precision? If you have a decent ERP dashboard which gives you all that – then you’re lucky, and you don’t need this article either. The chances are that you get a simple calculation end of the year – like “let’s take all revenue finance recognized on your projects in a year and man day cost at USD 800 to get your unit profitability”. USD 800 can seem crazily high or unacceptably low, depending on the business you’re in. So that’s the precision you have to count the money your team brings to the firm? Do we need to remind you that this is one of the reasons why your job exists?

What else can you get? All the timesheets projects with their budgets, and you can even get to approximate costs (let’s discuss it later) — a shitload of data for a year, even for a team of 10. What we show is that you don’t need an ERP to do the rest – the notebook in Jupyter will do. Keep this in mind - management starts with measurement. How can you supervise your team and projects if you don’t have their data at your fingers?

Why do you need Data Science here? Well, you can have all data at your fingers. Eventually, you would want to calculate your profit like

profit = revenue - cost

Not only as subtracting two numbers, but also on the level of tables - so the output of the above statement can be a table with profit per consultant per month, like this:

In [147]:
# hide
profit = pd.DataFrame([[6187.500, 8006.25], [8050.000, 3762.50]],
                      index=['CATHY THE NEW MANAGER', 'FRANK THE CONSTULANT'],
                      columns=['2020-02', '2020-03'])
# next a bit of formating
profit = profit.applymap(lambda x: "{:10.2f}".format(x))
In [148]:
2020-02 2020-03
CATHY THE NEW MANAGER 6187.50 8006.25
FRANK THE CONSTULANT 8050.00 3762.50

Or get a list of 3 people with most billable hours in March via the following statement

t.where(t['Billable'] & (t['month'] == '2020-03')

The code above is indeed not executable, but we promise to get you there in a few paragraphs. And yes, there is some learning of python required. The huge difference from anything else you’ve been doing yourself (ERP reports, Excel, other tools) is that any analysis stays in the notebook and can be re-applied after your data changes.

Data Science

So, let’s get this done. First – yes, you need to know a bit of python to get the files in. The basic level will do. If you are a manager in 2020 and can’t write a simple script – hmm, you probably missed something. The objective is not to build an ERP and not even to have an easy-to-use toolkit - we want to show you how to make a toolkit for yourself (but feel free to reuse our code). What you will see is a straightforward set of examples - spend an hour or two to load your files into the notebook and then start playing with it - just make the analysis you want. You can end up checking a few numbers or building your charts and dashboards. It is ok, even if you work for a corporation (that’s where you need it most) - just install Anaconda and download the notebook. So, we first get and transform our input files (if you want to follow text AND all the code - read it on colab).

We are loading projects, timesheets, and roles below - more details will follow on what we presumed is inside. Note - files are loaded from the GitHub repository so that they will open in colab or your local machine just the same.

In [149]:
data = {}
data_path = ''
url = data_path + 'roles.csv'
data['roles'] = pd.read_csv(url, index_col=0, sep=";")

url = data_path + 'project_data.xlsm'
project_data = pd.ExcelFile(url)
['TS Feb', 'TS Mar', 'TS Apr', 'Projects', 'Employees']

Don’t focus too much on the next code block - we have monthly timesheets in separate tabs and need to stack them one on top of another. Plus - we have errors in excel (Sat.1 column) - data cleanup is also required, quite usual for those who touched data science. Collapsed blocks in our notebook contain the code, which is not critical for understanding. If you’d rather read it all, we suggest you switch to either GitHub or colab, so you can also play with it.

In [150]:
# collapse

timesheets = []
for sheet in project_data.sheet_names:
    if 'TS' in sheet:
        timesheets += [pd.read_excel(project_data, sheet, header=2)]
        tmp = pd.read_excel(project_data, sheet)
        data[sheet] = tmp
if 'Sat.1' in timesheets[0]:  # cleaning from Sat.1
    timesheets[0] = timesheets[0].rename(
        columns={'Sat': 'Sun', 'Sat.1': 'Sat'})
data['timesheet'] = pd.concat(timesheets, sort=False)
d = {'Billable': True, 'Non-Billable': False}
data['timesheet']['Billable'] = data['timesheet']['Billing'].replace(d)
data['timesheet'].drop(['Billing'], axis=1, inplace=True)
data['timesheet'] = data['timesheet'][~data['timesheet'].User.isnull()]

So, let's see what we've got here:

In [151]:
for key in data.keys():
    string_format = "{:20} shape: {:>10} rows \t and {:>10} columns"
    print(string_format.format(key, *data[key].shape))
dict_keys(['roles', 'Projects', 'Employees', 'timesheet'])
roles                shape:          8 rows 	 and          4 columns
Projects             shape:         48 rows 	 and         10 columns
Employees            shape:         35 rows 	 and          5 columns
timesheet            shape:       1792 rows 	 and         14 columns

Data we loaded

Let us summarize it here – most probably if you are a team leader or business unit manager – you can get

  • Dictionaries
    • Cost per region or default cost for “outside” contributors
    • Average revenue per hour for projects not “owned” by you
In [152]:
# show roles
roles = data["roles"]
Seniority EU US UK
Unit Head SENIOR 1100 1300 1200
Program Manager EXPERIENCED 900 1000 900
Project Manager SENIOR 800 800 800
... ... ... ... ...
Consultant SENIOR 800 800 900
Consultant EXPERIENCED 550 550 550
Consultant JUNIOR 500 500 500

8 rows × 4 columns

We need to set default values and bring it to format easy to use, which is very easy in python:

In [153]:
default_revenue = 1200
default_cost = 850

# wide to long format
roles_long = pd.melt(roles.reset_index(),
                     id_vars=['Position', 'Seniority'],
Position Seniority region cost
0 Unit Head SENIOR EU 1100
1 Program Manager EXPERIENCED EU 900
2 Project Manager SENIOR EU 800
... ... ... ... ...
21 Consultant SENIOR UK 900
22 Consultant EXPERIENCED UK 550
23 Consultant JUNIOR UK 500

24 rows × 4 columns

  • HeadCount
    • List of your employees, with their grades (or costs)
    • Engagement model (employee or contractor)
In [154]:
# show head count
headcount = data["Employees"]
headcount = headcount.merge(
    roles_long[['Position', 'Seniority', 'region', 'cost']],
    left_on=['Seniority', 'Position', 'Country'],
    right_on=['Seniority', 'Position', 'region'])
headcount['cost'] = headcount['cost'].fillna(default_cost)
Name Engagement Country Position Seniority region cost
1 JOHN DOE STAFF FR Consultant JUNIOR NaN 850.00
... ... ... ... ... ... ... ...

35 rows × 7 columns

  • Projects with budgets, effort estimates, dates, types of revenue recognition (Time and Material, Fixed Fee or something else), et cetera
In [155]:
# show projects

projects = data["Projects"]
Client ID Project ID Project PM Region Start End Funding Daily Rate TCV
0 13154 68454 Medium consutancy 1 SUSAN THE MANAGER Europe 2019-01-16 08:00:00 2020-07-22 14:00:00 Time and Materials 1200 39,000 USD
1 2764 70285 Our biggest project EXT 6a8e0ca747 Europe 2019-10-01 08:00:00 2021-04-30 11:00:00 Time and Materials 1200 1,450,000 USD
2 12916 68093 Upgrade to new version TIM THE LEAVER Europe 2018-09-28 08:00:00 2020-04-27 17:00:00 Fixed Fee 1200 127,500 USD
... ... ... ... ... ... ... ... ... ... ...
45 12899 63852 Huge consultancy pool JACK THE MANAGER Europe 2017-08-01 08:00:00 2020-04-27 17:00:00 Time and Materials 1200 198,000 USD
46 12901 71248 Small upgrade JACK THE MANAGER Europe 2020-01-29 08:00:00 2020-12-31 17:00:00 Time and Materials 1200 108,000 USD
47 17719 70618 Small consultancy pool SUSAN THE MANAGER Europe 2019-11-06 08:00:00 2020-06-30 17:00:00 Time and Materials 1200 27,200 USD

48 rows × 10 columns

  • Timesheets for your team / your projects. It is possible that other teams can contribute to your projects and your team might work in outside projects
    • Typically arrive monthly or weekly
    • (optional) Planned effort
      • There might be none, in one of the next posts we'll talk about autofill
In [156]:
# show timesheets

timesheet = data["timesheet"]
Project Activity User Period Starting Effort Type Total (Days) Sun Mon Tue Wed Thu Fri Sat Billable
0 PRJ f74fe3647d Consultancy EXT 800982ab08 2020-02-02 Actual 2.50 0.00 0.50 0.50 0.50 0.50 0.50 0.00 True
1 PRJ f74fe3647d Consultancy EXT c8c92ca432 2020-02-02 Actual 2.50 0.00 0.50 0.50 0.50 0.50 0.50 0.00 True
2 PRJ f74fe3647d Consultancy EXT c8c92ca432 2020-02-09 Actual 2.50 0.00 0.50 0.50 0.50 0.50 0.50 0.00 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
620 PRJ 67df00a7e9 Consultancy EXT ca02603616 2020-04-12 Actual 1.00 0.00 0.50 0.00 0.50 0.00 0.00 0.00 True
621 PRJ 67df00a7e9 Consultancy EXT ca02603616 2020-04-19 Actual 1.00 0.00 0.50 0.00 0.00 0.50 0.00 0.00 True
622 PRJ 67df00a7e9 Consultancy EXT ca02603616 2020-04-26 Reported 1.50 0.00 0.50 0.50 0.00 0.50 nan nan True

1792 rows × 14 columns

EXT users here are outside of our headcount (let’s say we are not getting the real names, but rather some ids from ERP in such case)

Why not just have things done by ERP? Your ERP probably does not present you anything nicely in terms of parameters you’d want to control on your level (otherwise, you won’t read this anyway). There may be a project running to improve it by 2025 - maybe one of the Big4 firms is interviewing you for requirements. If you are steering this boat long enough – you might have ended up with quick and dirty calculation, SQL queries to SAP (with field names still in German), or an Excel file. Why? Your bosses don’t care – they already hired you, and a better ERP is coming in 2025 anyway. So they want to know how much money your team makes (preferably - per person, per month, per project, with charts, projections, and comparisons) and to know why it is not so profitable (because it never is).

To simplify your way forward we are going to create a timeseries out of timesheets which is a bit more involved so you can skip it for now and come back to it later, but the point is that at the end you will get a lovely monthly pandas DataFrame that looks like this

In [157]:
# collapse

dd = []
timesheet['Period Starting'] = pd.to_datetime(timesheet['Period Starting'])
weekdays = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
for i, weekday in enumerate(weekdays):
    columns = [col for col in timesheet if col not in weekdays or col == weekday]
    tmp = timesheet[columns].copy()
    tmp = tmp.rename(columns={weekday: 'effort'})
    tmp['date'] = tmp['Period Starting']+pd.to_timedelta('{}D'.format(i))
    tmp['weekday'] = weekday
    tmp = tmp.drop(['Period Starting', 'Total (Days)'], axis=1)
    dd += [tmp]

timeseries = pd.concat(dd, sort=False).sort_values(['User', 'date'])
timeseries = timeseries.reset_index().drop('index', axis=1)
timeseries['date'] = pd.to_datetime(timeseries['date']).astype('str')
timeseries['month'] = timeseries['date'].str[:7]

def isweekend(x): return x in ['Sun', 'Sat']

timeseries['weekend'] = timeseries['weekday'].apply(isweekend)
timeseries['workweek'] = ~timeseries['weekday'].apply(isweekend)
Project Activity User Effort Type effort Billable date weekday month weekend workweek
0 Upgrade in two contracts - 1 Consultancy CATHY THE NEW MANAGER Actual 0.00 True 2020-02-16 Sun 2020-02 True False
1 Upgrade in two contracts - 2 Consultancy CATHY THE NEW MANAGER Actual 0.00 True 2020-02-16 Sun 2020-02 True False
2 A pilot for huge opportunity Consultancy CATHY THE NEW MANAGER Actual 0.00 True 2020-02-16 Sun 2020-02 True False
... ... ... ... ... ... ... ... ... ... ... ...
12541 Strategic project in a new region Project Management TOM THE TEAM LEADER Reported nan True 2020-05-01 Fri 2020-05 False True
12542 Strategic project in a new region 2 Consultancy TOM THE TEAM LEADER Actual nan False 2020-05-02 Sat 2020-05 True False
12543 Strategic project in a new region Project Management TOM THE TEAM LEADER Reported nan True 2020-05-02 Sat 2020-05 True False

12544 rows × 11 columns

It is common in data science to have one big DataFrame (similar to excel tab or SQL table) that you can group and filter easily. It is helpful because it makes things very easy.


Timesheets and project budgets are simplified this way, and you can manipulate them the way you want in a few simple statements. If you studied excel formulas - you could do this also. Now, let’s look at profitability in the most straightforward manner possible. As a manager, we suggest you do that to set benchmarks for your team. So, profit is $$\mbox{Revenue  –  Cost}$$ and we intend to keep it simple. You can implement the calculations you like.

Revenue: Let’s say that for a Time & Material project, you generate as much revenue as you bill (on agreed rate) up to its budget. We are not doing any Fixed Fee recognition formula. Different companies do it differently, and you’ll either need historical data or actual recognized value depending on how you operate.

We also assume that we are interested only by users in our headcount (so we filter out all EXT users).

In [158]:
# revenue calculation for a person for a month 
# (SUM REV of all timesheet records * corresp project rates)
revenue_df = timeseries[['User', 'Project',
                         'Billable', 'effort', 'date', 'month']]
revenue_df = revenue_df.merge(projects[['Project', 'Daily Rate']],
revenue_df['Daily Rate'].fillna(default_revenue, inplace=True)
revenue_df = revenue_df[(revenue_df['User'].str[:3] != 'EXT') &

revenue_df['daily_revenue'] = revenue_df['Daily Rate'] * \
revenue = revenue_df.groupby(['User', 'month'])[
revenue = revenue[revenue.sum(1) > 0]
month 2020-02 2020-03 2020-04 2020-05
CATHY THE NEW MANAGER 11550.00 19350.00 9412.50 0.00
FRANK THE CONSTULANT 15750.00 12425.00 14875.00 0.00
IVAN THE PROGRAMMER 25062.50 29043.75 25643.75 0.00
JACK THE EXPERT 29093.75 31331.25 24843.75 0.00
JACK THE MANAGER 24090.62 22318.75 18821.88 0.00

We got revenue per month per person. Not too sophisticated, right?

Cost: Let’s start with the fact that just using the “default cost rate” is not enough - whenever margins are under pressure, you should do better. You might have people working in different countries and of totally different levels. Talk to your finance team and get some estimates from them (or make it together). We’d say they owe you that much if you were appointed to run a team or business unit. We will assume below that you managed to get a few grades per Country (see roles). Costs per individual will be nice to have as well. The beauty of this tool (compared to doing Excel by yourself) is that you can just add it - it will be a few lines of code. Let’s calculate direct cost per month: here we check the cost of effort posted on timesheets, presuming they are full - can check per number of working days in a month also and compare. We are not interested in external resources, so we’ll filter them out again.

In [160]:
# cost calculation for a project 
# (SUM COST of all timesheet records * corresp cost rates - see roles)
cost_df = timeseries[['User', 'Project', 'effort', 'date', 'month']]
cost_df = cost_df.merge(headcount[['Name', 'cost']],
                        how='left', left_on='User', right_on='Name')
cost_df = cost_df[cost_df['User'].str[:3] != 'EXT']
cost_df['daily_cost'] = cost_df['cost'] * cost_df['effort']
cost = cost_df.groupby(['User', 'month'])['daily_cost'].sum()
cost = cost.unstack().fillna(0)
cost = cost[cost.sum(1) > 0]
month 2020-02 2020-03 2020-04 2020-05
CATHY THE NEW MANAGER 5362.50 11343.75 9332.81 0.00
FRANK THE CONSTULANT 7700.00 8662.50 7768.75 0.00
IVAN THE PROGRAMMER 11481.25 12925.00 11618.75 0.00
JACK THE EXPERT 19921.88 21621.88 18168.75 0.00
JACK THE MANAGER 11800.00 10900.00 9600.00 0.00

Now, we can get to profit per user per month using operations on dataframes. And here it bears some fruit. Profit = revenue - cost. Indeed, it required some data cleanup first - but not too much

In [161]:
profit = revenue - cost
month 2020-02 2020-03 2020-04 2020-05
CATHY THE NEW MANAGER 6187.50 8006.25 79.69 0.00
FRANK THE CONSTULANT 8050.00 3762.50 7106.25 0.00
IVAN THE PROGRAMMER 13581.25 16118.75 14025.00 0.00
JACK THE EXPERT 9171.88 9709.38 6675.00 0.00
JACK THE MANAGER 12290.62 11418.75 9221.88 0.00

That's what we've promised, right? Ok, second one - people who entered most Billable hours in March

In [162]:
t = timeseries  # concatenating just in case you are reading from phone
t.where(t['Billable'] & (t['month'] == '2020-03')
TOM THE EXPERT      26.75
Name: effort, dtype: float64

What else?

Now let’s look at how to apply some python and data science techniques (we will get to more details in our next posts) to data you’ve seen above and how to visualize it nicely.

First, let’s take a PM and visualize revenue on his/her projects per month.

In [163]:
%matplotlib inline

pm_selected = "CATHY THE NEW MANAGER"
drawdt = revenue.loc[pm_selected].T, drawdt.values, color="green",
        width=1, align='center', edgecolor='black');
plt.xticks(range(len(drawdt.index)), drawdt.index);
plt.ylabel("Revenue / month: "+pm_selected);

That was simple. Then, some fun for those who know slightly more advanced python - you can make an interactive chart in few lines, e.g., here we want to make it visual in terms of if the value is above or below the benchmark (works if you copy notebook, is not clickable on the blog)

In [164]:
# collapse

%matplotlib notebook

# colors of bars
def get_colors(v):
    colors = []
    for i in range(len(drawdt.index)):
        color = "yellow"
        if (drawdt[i] < v*0.9):
            color = "red"
        if (drawdt[i] > v*1.1):
            color = "green"
    return colors

# plot drawing
def plt_all(isdef, yvalue):
    if isdef:
        bcolors = "white"
        xtitle = 'Set value'
        a = np.empty(len(drawdt.index))
        plt.plot(range(len(drawdt.index)), a, color="black", alpha=0.5)
        bcolors = get_colors(yvalue)
        xtitle = 'At ${:.0f}/m is'.format(yvalue), drawdt.values, color=bcolors,
            width=1, align='center', edgecolor='black')
    plt.xticks(range(len(drawdt.index)), drawdt.index)
    plt.ylabel("Revenue / month: "+pm_selected)
    red_patch = mpatches.Patch(color='red', label='loss')
    yellow_patch = mpatches.Patch(color='yellow', label='in 10%')
    green_patch = mpatches.Patch(color='green', label='profit')
    plt.legend(handles=[red_patch, yellow_patch, green_patch],
               loc='upper left', title=xtitle, fontsize=7)
    plt.gca().set_title('Click on the plot to set benchmark')

plt_all(True, 0.0)

def onclick(event):
    plt_all(False, event.ydata)

plt.gcf().canvas.mpl_connect('button_press_event', onclick);

Ok, as a Team Leader, you might not do that. Sorry. Let’s get some calculations done. First, let’s identify “suspicious” time entered (e.g., a person who has more than three days in a row same effort on a given project and it is not 8 hours) - this is a quick check you do yourself, without asking PMO anything and making this official. I can call this suspicious because nature of our work (and yours might be different - so you look for another pattern) makes it highly unlikely that you do spend the same amount of time on one project a few days in a row (unless you are assigned full time). What you are doing is likely just splitting your working time between your projects in some manner.

In [165]:
# remove the weekend
working = timeseries[(timeseries['workweek'] == True)
                     & (timeseries.Billable)].copy()
working = working.groupby(["User", "Project", "date"]).sum().sort_index()
working['value_grp'] = (working.effort.diff(1) == 0).astype('int')

def streak(df):  # function that finds streak of 1s: 0,1,1,0,1 -> 0,1,2,0,1
    df0 = df != 0
    return df0.cumsum()-df0.cumsum().where(~df0).ffill().fillna(0).astype(int)

working['streak'] = streak(
    working['value_grp'])  # streak of identical effort
result = working[(0 < working.effort) &
                 (working.effort < 1) &
                 (working['streak'] > 3)].reset_index()

result = result[result.User.str[:3] != 'EXT'].groupby(['User', 'Project']).last()
effort date streak
User Project
CATHY THE NEW MANAGER Upgrade in two contracts - 1 0.06 2020-03-12 6
FRANK THE CONSTULANT Upgrade in two contracts - 1 0.12 2020-03-05 4
Upgrade in two contracts - 2 0.12 2020-03-05 4
JACK THE EXPERT Some upgrade project 0.12 2020-03-13 4
JOHN THE CONSULTANT PRJ f8b96bd2c4 0.25 2020-04-28 6
TIM THE LEAVER Big upgrade and rework 0.12 2020-03-09 4
Upgrade to new version 0.12 2020-03-10 4

To be clear, we do not recommend sending emails with the subject “The list of those whose timings are suspicious” based on the above. People will likely change their behaviors, and you might not easily find the next pattern. As a manager, you dig into your data, find insights, and act on them the way you see fit. You do not just tell your teenage kid that now you know where he hides cigarettes, do you?

Use cases

Here are some cases where the above can be helpful - we will look at some of them in our next posts.

  • Decision-making - e.g. identify top loss-making projects
  • Identify projects which require management attention - also apply Machine Learning here and identify the projects you'll pick up yourself
  • Better analysis of non-billable hours
  • Identify suspicious behaviors - anomaly detections
  • Revenue and effort projections based on existing patterns and not highlighting when the plan deviates
  • Consolidated analytic on demand (e.g. profitability forecast, revenue forecast, unallocated capacity) in case your ERP will not do anything like that

The point is that loading your data (excel, CSV, TSV - whatever) is simple, and manipulating it is simple - more straightforward than doing it in many excel files and faster than waiting for PMOs.

Stay tuned for our next post.

Copyright © Dmytro Karabash, Maxim Korotkov; 2020.
This notebook is licensed under a Creative Commons Attribution 4.0 International License.