Publishing data and interpretations together

Learning to explore published data in the Reflective Archaeological Practice course

Made by Rachel Opitz, Archaeology, University of Glasgow

Archaeological excavation publications will often provide specialist reports on finds. These reports might describe finds that are key to the interpretation of the site, provide a catalogue for key objects, and include charts that quantify finds and maps that show where they were found. If a project publishes its data, both descriptive and spatial, together with the excavation report, readers have the opportunity to explore and interrogate that data on their own, addressing different questions and perhaps even drawing different conclusions than the report's authors.

Understanding the meanings behind patterns of finds recovered through excavation is a tricky problem. We hope to distinguish activity areas, places devoted to domestic and industrial use, or inhabited places that are distinct from liminal ones. To successfully unravel these patterns, we must look not only at the distributions of different types of finds, but how they correlate with one another, the character of the contexts in which they were recovered, and their own physical and social characteristics. Are they likely to be curated? Are they light and likely to be moved from one area to another by post-depositional processes? It's all a bit of a mess.

The aim of this exercise is for you to:

  • learn to work real special finds data from an excavation, in all its messiness
  • start thinking about quantitative and spatial approaches to finds data from excavations and how they can help us better understand the patterns we see
  • reflect on the added value and challenges of publishing excavation data together with excavation reports.

We'll undertake this reflective exercise using data collected by the Gabii Project, a 10+ year excavation in central Italy.

The Gabii Project Reports Vol. 1 is available at:

The data for the whole project is available at:

To start working with the data from the Gabii Project, you need to put together your toolkit. You're currently working inside something called a jupyter notebook, which will be a key part of your analysis toolkit. It's a place to keep notes, pictures, code and maps together. You can add tools and data into your jupyter notebook and then use them to ask questions and make maps and visualisations that help answer those questions.

Here's what you need to do: Work your way down the page, carefully reading the notes and comments in each cell (each box on the page) and looking through the code. Anything written with a # symbol in front of it is a comment I've included to explain to you what the code is doing. Comments will appear in blue type. Code will be a mix of black, green and purple type. In jupyter notebooks you hit 'Ctrl+Enter' to execute the code in each cell. Once you have read and understood the comments and code hit 'Ctrl+Enter' to execute the code and then think about the results.

You can make changes to the code to ask different questions. Simply double click in the cell to be able to type in it, make your changes, and re-run the code by simply hitting 'Ctrl+Enter' in the cell again. If things go wrong, 'ctrl+z' will undo your changes and you can try again!

Let's get started... Hit 'Ctrl'+'Enter' to run the code in any cell in the page.

In [ ]:
%matplotlib inline
# Matplotlib is your tool for drawing graphs and basic maps. You need this!

import pandas as pd
import requests
import fiona
import geopandas as gpd
import matplotlib.pyplot as plt

# These are what we call prerequisites. They are basic toosl you need to get started.
# Pandas manipulate data. Geo-pandas manipulate geographic data. They're also black and white and like to eat bamboo... 
# You need these to manipulate your data!
# Fiona helps with geographic data.
# Requests are for asking for things. It's good to be able to ask for things.

# Remember to hit Ctrl+Enter to make things happen!
In [ ]:
# The first step is to bring the data into the jupyter notebook. We'll start by grabbing the spatial data.
# This data describes the location and shape of each context (Stratigraphic Unit) at Gabii. 

url = ''
# This is where I put the data. It's in a format called geojson, used to represent geometry (shapes) and attributes (text).

request = requests.get(url)
# Please get me the data at that web address (url)
b = bytes(request.content)
# I will use the letter 'b' to refer to the data, like a nickname
with fiona.BytesCollection(b) as f:
    crs =
    gabii_su_poly = gpd.GeoDataFrame.from_features(f, crs=crs)
# I will use the fiona tool to wrap up all the data from 'b', check the coordinate system (crs) listed in the features
# and print out the first few lines of the file so I can check everything looks ok. 
# Don't worry if you don't understand all the details of this part!
In [ ]:
# Now we have polygons, the shapes of our contexts. Let's visualise the data to double check that all is well

gabii_map1 = gabii_su_poly.plot(column='DESCRIPTIO', cmap='Blues', edgecolor='grey', figsize=(15, 15));
# 'plot' means draw me an image showing the geometry of each feature in my data. 
# We want to control things like the color of different types of features on our map. 
# I used the 'Blues' colorscale command (cmap stands for 'colour map') 
# and asked it to draw the polygons differently based on the type of feature.

The colorscale options are: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, inferno, inferno_r, jet, jet_r, magma, magma_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, winter, winter_r

Swap out 'Blue' in the cell above for any of these options...

In [ ]:
# Now bring in all the basic Gabii special finds data as a table- descriptions, object types, IDs and the contexts from which they come.
# We've had a few special finds over the years.
sf_su = pd.read_csv("")

# Print a preview of any table by typing it's name.

# SU is the context numbers and SF_ID is the special find ID number

You'll know from having explored the 'A mid-Republican House' publication that the stratigraphic units from the published house are in the 1000-1999 range. Looking through the publication also gives you an idea of some of the types of finds that were studied as part of the interpretation of the house, including things like bronze nails, bone hairpins, and loom weights. You might wonder how common some of these objects are across the entire site of Gabii, and how that compares to their frequency in the house that was studied for the publication. Are they only found in this house, or are they common everywhere? You might also want to explore the kinds of contexts in which these objects were found to decide how relevant they might be to the overall interpretation. After all, it's very different if you find a loom weight on the floor of a house than if you find it in a refuse pit. Finally, you might want to have a look at the diversity of all the special finds from the published area. The published data allows you to explore these questions

In [ ]:
# Let's start by getting an idea of all the types of special finds from the published house and how many of them there are.

# Get all the finds from the table where the 'SU' number is between 1000 and 1999
AreaB = sf_su.loc[sf_su['SU'].between(1000, 1999, inclusive=True)]

# Count how many of each 'SF_OBJECT_TYPE' (the type of object) there are and sort from most to least common.
pd.value_counts(AreaB['SF_OBJECT_TYPE'].values, sort=True)
In [ ]:
# Looks like there are 4 pins from the published house, which should include the bone hairpin from the publication. 
# We can see what SUs they came from and a little more detail like this:

# declare the types of things you want to have in your table
types = ['Pin']

# select them from the big table, a bit like you selected the stratigraphic unit numbers you wanted to see.
B_Pins = AreaB.loc[sf_su['SF_OBJECT_TYPE'].isin(types)]

# type the name of the table to print it out

# First thing you learned from the data - the highlighted hairpin isn't the only pin...
In [ ]:
# Now we can make a map that shows the SUs with pins in blue and all the other SUs in white

# make a map with all the SUs from area B
areaB =  gabii_su_poly.loc[gabii_su_poly['SU'].between(1000, 1999, inclusive=True)]

# make a map with only the SUs that have pins (table above)
areaB_Pins = gabii_su_poly.loc[gabii_su_poly['SU'].isin([1016,1165,1168])]

# combine these two maps and set their colours to white and blue respectively
fig, ax = plt.subplots(figsize=(15,15))
areaB.plot(ax=ax, color='white', edgecolor='black')
areaB_Pins.plot(ax=ax, color='blue', edgecolor='black');

A bit of healthy archaeological suspicion should be creeping into your minds right now. Those are some really big layers in which the pins were found. Look up their SU numbers over in the database at They are big colluvial accumulations. Sadly finds from big colluvial accumulations are not generally all that meaningful, as they're unlikely to be in their original context.

Reflect for a moment: how has looking at the published data changed your reading of the discussion of the special finds in the publication?

In [ ]:
 #Let's go a little further and look at the pins from across the whole site to see if the situation is similar.
types = ['Pin']
Pins = sf_su.loc[sf_su['SF_OBJECT_TYPE'].isin(types)]
In [ ]:
# You can see just above that there are 110 rows in the table, so 110 pins identified as special finds at Gabii.
# Now let's count up how many of these pins appear in each context (SU).
# This command will print out a list of the number of pins in each SU next to that SU number.
pd.value_counts(Pins['SU'].values, sort=True)
In [ ]:
#Then let's combine our polygons representing context shape and location
#with the special finds data
# We do this with a command called 'merge'

gabii_pins = gabii_su_poly.merge(Pins, on='SU')

# adding .head() to the end of a dataframe name will print out just the first few rows.
In [ ]:
# Now plot the SUs with pins across the whole site against a background of all the SUs
# Follow the same procedure as above to maps the area B SUs with pins

fig, ax = plt.subplots(figsize=(15,15))
gabii_su_poly.plot(ax=ax, color='white', edgecolor='black')
gabii_pins.plot(ax=ax, color='blue', edgecolor='black');

OK, what do you see here? Do the pins mostly seem to be in large layers? What factors might contribute to this pattern? Do big layer simply aggregate lots of stuff?

In [ ]:
# We can try and see the relationship between layer size and count by sorting
#our list of finds by the surface area of each layer.
# We use the command 'sort_values'

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
In [ ]:
# It looks like in fact the pins turn up in both large and small layers. 
# Let's remove the large ones from the map to see the small ones hiding upder them.
#  Cut out these layers by excluding SUs with a surface area greater than 800.
gabii_pins_small = gabii_pins.loc[gabii_pins['Shape_Area']<50]

In [ ]:
# plot these smaller layers as a map
fig, ax = plt.subplots(figsize=(15,15))
gabii_su_poly.plot(ax=ax, color='white', edgecolor='black')
gabii_pins_small.plot(ax=ax, color='blue', edgecolor='black');

# You can change the maximum size of the layers allowed by changing '50' in the cell above and re-running ctrl+enter
In [ ]:
# You might choose to explore in the database at to read more about 
# the smaller SUs where pins were found across the site or the pins themselves. 

# Get a list of the relevant SU numbers by displaying the table
with pd.option_context('display.max_rows', None, 'display.max_columns', None):

So far we have been thinking about a specific class of objects. What if you wanted to know more about the published special finds assemblage as a whole. You can read about them individually as described in the publication or linked in the database at You can also use the same techniques you used to look at the pins to explore the whole assemblage of published area B house finds spatially.

In [ ]:
# combine all the spatial data for the SUs with the special finds table using 'merge'
su_poly_sf = gabii_su_poly.merge(sf_su, on='SU')
# then select only the SUs for area B
b_house_sf = su_poly_sf.loc[su_poly_sf['SU'].between(1000, 1999, inclusive=True)]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
In [ ]:
# To investigate the whole assemblage further, we are going to need more tools.
import pysal
from sklearn import cluster
import seaborn as sns
import numpy as np

We're going to use cluster analysis to try and better understand our patterns. Clustering is a broad set of techniques for finding groups within a data set. When we cluster observations, we want items in the same group to be similar and items in different groups to be dissimilar. Clustering allows us to identify which things are alike on the basis of multiple characteristics. K-means clustering is a simple and frequently applied clustering method for splitting a dataset into a set of k (k being an arbitrary number you get to choose) groups.

In [ ]:
# Next step: cluster together contexts where the pattern of all types of finds are similar, 
# with and without respect to the size of the context.
# Make 5 clusters and account for the size of the context and counts of different types of tools. Drop all the other fields.
b_house_sf_c = b_house_sf.groupby('SU')['SF_OBJECT_TYPE'].value_counts().unstack().fillna(0);
b_house_sf_counts = b_house_sf.merge(b_house_sf_c, on='SU')
# Next step: cluster together contexts where the pattern of the three types of textile tools are similar, 
# with and without respect to the size of the context.
# Make 5 clusters and account for the size of the context and counts of different types of tools. Drop all the other fields.
km5 = cluster.KMeans(n_clusters=5)
X = b_house_sf_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','SF_ID','SF_DESCRIPTION','SF_OBJECT_TYPE'], axis=1).values
km5cls =
km5cls_labels = km5cls.predict(X)
In [ ]:
from mpl_toolkits.mplot3d import Axes3D
centroids = km5cls.cluster_centers_
print(centroids) # cluster centre coordinates
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=km5cls_labels,
            s=50, cmap='Accent')
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2], marker='*', c='#050505', s=1000)

Each cluster produced should contain the SUs that are similar to one another on the basis of the number of each type of textile tool and the size of the surface area of the SU.

In [ ]:
# Plot the clusters, groups of contexts that have similar overall assemblages.
# Give a different colour to the SUs that belong to each cluster.

f1, ax = plt.subplots(1, figsize=(15,15))

   .plot(column='cl1', categorical=True, legend=True, \
         linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax, alpha = 0.1)

In [ ]:
#Do the same, ignoring the size of the context.
km5 = cluster.KMeans(n_clusters=5)
X = b_house_sf_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','SF_ID','SF_DESCRIPTION','SF_OBJECT_TYPE','Shape_Area'], axis=1).values
km5cls2 =
km5cls2_labels = km5cls2.predict(X)
In [ ]:
# Plot the clusters, groups of contexts that have similar overall assemblages.
# Give a different colour to the SUs that belong to each cluster.

f1, ax = plt.subplots(1, figsize=(15,15))

   .plot(column='cl2', categorical=True, legend=True, \
         linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax, alpha = 0.1)


The patterns are definitely different. How can we interpret the fact that context size affects the pattern of the distribution of textile tools? Do big units, which perhaps represent dumps or colluvial mashups, have a fundamentally different character than the varied small contexts?

In [ ]:
# Look at the difference with and without context size taken into account side by side.
fig, axes = plt.subplots(ncols=2,figsize=(15, 5))
   .plot(column='cl2', categorical=True, legend=True, \
         linewidth=0.1, cmap='Accent', edgecolor='white', alpha = 0.1, ax=axes[0]).axis('equal')
   .plot(column='cl1', categorical=True, legend=True, \
         linewidth=0.1, cmap='Accent', edgecolor='white', alpha = 0.1, ax=axes[1]).axis('equal')

# Either way, 1016 (the big blue layer) has a different finds assemblage type from any other SU. 
# Maybe we should exclude it, and the other big colluvial layers.
In [ ]:
# assign the cluster IDs to each context permanently
b_house_sf_counts = b_house_sf_counts.assign(cl1=km5cls.labels_)
b_house_sf_counts = b_house_sf_counts.assign(cl2=km5cls2.labels_)
b_house_sf_counts_small = b_house_sf_counts.loc[b_house_sf_counts['Shape_Area']<100]
In [ ]:
# What happens when we change the number of clusters (groups), and ignore the big layers?
km7 = cluster.KMeans(n_clusters=7)
km7cls3 =['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','SF_ID','SF_DESCRIPTION','SF_OBJECT_TYPE','Shape_Area'], axis=1).values)

f3, ax = plt.subplots(1, figsize=(15,15))

   .plot(column='cl3', categorical=True, alpha = 0.2, legend=True, \
         linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax)

b_house_sf_counts_small = b_house_sf_counts_small.assign(cl3=km7cls3.labels_)\

That also changes things. Without going into too much detail, finding the ideal number of clusters is a black art. Try playing around with the number of clusters in the notebook, or the size cut-off for inclusion. Clustering = black magic

In [ ]:
# Now let's see if the type of SU has any influence of the pattern of finds. 
# We need to add some more data, pulled from the SU descriptions in the database. 
su = pd.read_csv("")
In [ ]:
# Group the third cluster assignments (the ones without big layers making a mess) by formation process
b_house = b_house_sf_counts_small.join(su.set_index('SU'), on='SU')

# How can we interpret this. For example, it seems most accumulations are class 5, while most collapses are class 1.
# Intentional depositions are split fairly evenly across several classes.
# Would you say formation process is important to the assemblage's character based on this?
In [ ]:
# Group the third cluster assignments (the ones without big layers making a mess) by soil compaction

# Would you say soil compaction is related to the character of the assemblage?

That concludes this tutorial.

You could continue to explore the data, looking at different classes of finds, assemblages found in different types of SUs, and the locations of specific finds discussed in the publication.

Hopefully you have:

  • learn to work real special finds data from an excavation, in all its messiness
  • start thinking about quantitative and spatial approaches to finds data from excavations and how they can help us better understand the patterns we see
  • reflected on the added value and challenges of publishing excavation data together with excavation reports.

You should continue to reflect on questions raised by this exercise:

  • How do the publication and the published data work together?
  • Is it easy to move back and forth between the published data and the publication?
  • How else might the interpretations of the authors and the data on which those interpretations are based be presented together?

You might be wondering who makes this data. In summer 2018 when the most recent maps were made it was these people: The Gabii Team

In 2012 back when we finished excavated the house it was these people: The Gabii Team