Notebook

Visualizing Classical Music Influence with `networkx`¶

0. Introduction¶

I came across this 1927 graphic showing the musical descendants of Czerny:

What follows is my attempt to create a similar music-ancestry graph directly from data.

1. Scraping the data¶

Fortunately for us, Wikipedia has an incredible comprehensive series of lists of music students by teacher that is perfect for our purposes – each teacher is given a heading and followed by a list of students.

Let's try scraping it.

In [1]:

%matplotlib notebook
import matplotlib.pyplot as plt

In [2]:

import codecs
import json
import urllib

from requests import get
from bs4 import BeautifulSoup
import networkx as nx

In [3]:

pages = [
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_A_to_B',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_C_to_F',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_G_to_J',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_K_to_M',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_N_to_Q',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_R_to_S',
    'https://en.wikipedia.org/wiki/List_of_music_students_by_teacher:_T_to_Z'
]

We'll go through each of the 7 pages and build up an adjacency list of teacher->student edges.

We'll skip over students or teachers that don't have an associated Wikipedia link to ensure that the people we're getting are all notable enough to have Wikipedia entries.

In [4]:

adjlist = ''

for url in pages:
    response = get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    for heading in soup.find_all('h3'):
        if not heading.find('a'):
            continue
        url = heading.find('a')['href']
        if '/wiki/' not in url:
            continue
        teacher_name = url.split('/wiki/')[1].split('_(')[0].split('#')[0]

        student_entries = heading.find_next_sibling("div", class_="columns").find_all('li')
        student_names = []
        for student in student_entries:
            if not student.find('a'):
                continue

            url = student.find('a')['href']
            if '/wiki/' in url:
                student_names.append(url.split('/wiki/')[1].split('_(')[0].split('#')[0])

        adjlist += '{} {}\n'.format(teacher_name, ' '.join(student_names))

In [5]:

adjlist[:1000]

Out[5]:

'Arkady_Abaza Nikolai_Roslavets\nChristian_Ferdinand_Abel Carl_Friedrich_Abel\nHermann_Abendroth Allard_de_Ridder Herbert_Eimert G%C3%BCnther_Herbig Wilhelm_Sch%C3%BCchter William_Steinberg\nDieter_Acker Susanne_Erding-Swiridoff\nAdolphe_Adam L%C3%A9o_Delibes Ferdinand_Poise Lo%C3%AFsa_Puget\nLouis_Adam Joseph_Daussoigne-M%C3%A9hul Ferdinand_H%C3%A9rold Friedrich_Kalkbrenner Henry_Lemoine\nJohn_Luther_Adams Corey_Dargel\nMurray_Adaskin Andrew_Dawes Boyd_McDonald Paul_Pedersen Rodney_Sharman Timothy_Williams\nGuido_Adler Karel_Navr%C3%A1til Anton_Webern Egon_Wellesz\nOskar_Adler Hans_Keller J%C3%B3zef_Koffler Dragan_Plamenac Arnold_Schoenberg\nSamuel_Adler Martin_Amlin Claude_Baker Roger_Briggs Jason_Robert_Brown David_Crumb Greg_Danner Eric_Ewazen Peng-Peng_Gong Jay_Greenberg Anthony_Iannaccone Kamran_Ince Michael_Isaacson Scott_Lindroth Marc_Mellits Carter_Pann Robert_Paterson Paul_Phillips Paul_Reller Michael_Alec_Rose Daria_Semegen Gordon_Stout Christopher_Theofanidis Michael_Sidney_Timpson Fi'

The names are URL-encoded because we took them from the link hrefs, so let's decode as UTF-8.

In [6]:

adjlist = urllib.unquote(adjlist).decode('utf-8')

In [7]:

print adjlist[:1000]

Arkady_Abaza Nikolai_Roslavets
Christian_Ferdinand_Abel Carl_Friedrich_Abel
Hermann_Abendroth Allard_de_Ridder Herbert_Eimert Günther_Herbig Wilhelm_Schüchter William_Steinberg
Dieter_Acker Susanne_Erding-Swiridoff
Adolphe_Adam Léo_Delibes Ferdinand_Poise Loïsa_Puget
Louis_Adam Joseph_Daussoigne-Méhul Ferdinand_Hérold Friedrich_Kalkbrenner Henry_Lemoine
John_Luther_Adams Corey_Dargel
Murray_Adaskin Andrew_Dawes Boyd_McDonald Paul_Pedersen Rodney_Sharman Timothy_Williams
Guido_Adler Karel_Navrátil Anton_Webern Egon_Wellesz
Oskar_Adler Hans_Keller Józef_Koffler Dragan_Plamenac Arnold_Schoenberg
Samuel_Adler Martin_Amlin Claude_Baker Roger_Briggs Jason_Robert_Brown David_Crumb Greg_Danner Eric_Ewazen Peng-Peng_Gong Jay_Greenberg Anthony_Iannaccone Kamran_Ince Michael_Isaacson Scott_Lindroth Marc_Mellits Carter_Pann Robert_Paterson Paul_Phillips Paul_Reller Michael_Alec_Rose Daria_Semegen Gordon_Stout Christopher_Theofanidis Michael_Sidney_Timpson Fisher_Tull Dan_Welcher Michael_Glenn_Will

Let's write this adjacency list to a file, and load it into networkx.

In [8]:

with codecs.open("adjlist.txt", "w", "utf-8") as temp:
    temp.write(adjlist)

In [9]:

G = nx.read_adjlist('adjlist.txt', create_using=nx.DiGraph())

In [10]:

G.number_of_nodes()

Out[10]:

Let's also export our graph to a JSON format that Cytoscape.js can understand. This might be useful later on if we want to make a JavaScript visualization of the data.

In [11]:

def pretty_node(node):
    """Given a node name, format it for display purposes - e.g. 'Carl_Friedrich_Abel' => 'C. Abel'."""
    return u'{}. {}'.format(node[0], node.split('_I')[0].split('_')[-1])

In [12]:

def export_to_cytoscape_json(graph, filename, weight_fn=None):
    def exported_node(node, weight_fn=None):
        return {
            'data': {
                'id': unicode(pretty_node(node)),
                'fullName': unicode(node.replace('_', ' ')),
                'weight': weight_fn(node) if weight_fn else 1
            }
        }
    
    # Topologically sort the nodes if possible (i.e. the graph is acyclic)
    try:
        nodes = [exported_node(i, weight_fn) for i in nx.topological_sort(graph)]
    except:
        nodes = [exported_node(i, weight_fn) for i in graph.nodes()]
    links = [{'data': {'source': pretty_node(u[0]), 'target': pretty_node(u[1])}} for u in graph.edges()]
    with codecs.open(filename, "w", "utf-8") as file:
        json.dump(nodes + links, file, indent=2)

In [13]:

export_to_cytoscape_json(G, 'cytoscape.json')

2. Who are the most influential teachers?¶

Now that we have our graph, let's analyze it!

For starters, let's see if we can figure out who the most influential teacher is.

But how do we quantify "influence"? One natural metric to use is Katz centrality, which works similarly to Google's PageRank – nodes that are connected to influential nodes themselves gain influence, and so on.

Katz centrality is a directional measure, and it turns out we need to compute it over the reversed graph to find the most influential teachers – otherwise we just end up getting the most "influential students".

In [14]:

c = nx.katz_centrality(G.reverse())

In [15]:

for teacher in sorted(c, key=c.get)[::-1][:20]:
    print teacher, c[teacher]

Nadia_Boulanger 0.274006486745
Darius_Milhaud 0.133865479678
Karlheinz_Stockhausen 0.111888232649
Olivier_Messiaen 0.107905535493
Roger_Sessions 0.0922829267002
André_Gedalge 0.082045493007
Vincent_d'Indy 0.0675442830938
Arnold_Schoenberg 0.0669900736399
Paul_Hindemith 0.0659519740423
Charles-Marie_Widor 0.0584518233985
Gabriel_Fauré 0.0566999118051
Franz_Liszt 0.0563658803324
Aaron_Copland 0.0560104591138
Antoine_François_Marmontel 0.0541201524149
Louis_Vierne 0.0538454383179
Louis_Andriessen 0.0532446143755
Charles_Villiers_Stanford 0.053224381422
Milton_Babbitt 0.0530784911786
Ferruccio_Busoni 0.052250859768
Salomon_Jadassohn 0.0499370156958

There's a lot of famous names there, but in the end, it's not even close. Nadia Boulanger has more than double the centrality of anybody else. Given that Boulanger taught everyone from Copland to Piazzolla to Quincy Jones, her rank is well-deserved.

And let's make a subgraph of just the 100 most-influential teachers – it'll be easier to work with than the whole graph of ~5,000 people.

In [16]:

most_important_teachers = sorted(c, key=c.get)[::-1][:100]

In [17]:

important_teachers_graph = nx.subgraph(G, most_important_teachers)

3. Let's visualize it!¶

I stole some code from StackOverflow to visualize hierarchical graphs in a pretty way:

In [18]:

# https://stackoverflow.com/questions/29586520/can-one-get-hierarchical-graphs-from-networkx-with-python-3/29597209
def hierarchy_pos(G, root, levels=None, width=1., height=1.):
    '''If there is a cycle that is reachable from root, then this will see infinite recursion.
       G: the graph
       root: the root node
       levels: a dictionary
               key: level number (starting from 0)
               value: number of nodes in this level
       width: horizontal space allocated for drawing
       height: vertical space allocated for drawing'''
    TOTAL = "total"
    CURRENT = "current"
    def make_levels(levels, node=root, currentLevel=0, parent=None):
        """Compute the number of nodes for each level
        """
        if not currentLevel in levels:
            levels[currentLevel] = {TOTAL : 0, CURRENT : 0}
        levels[currentLevel][TOTAL] += 1
        try:
            neighbors = G.neighbors(node)
            for neighbor in neighbors:
                if not neighbor == parent:
                    levels =  make_levels(levels, neighbor, currentLevel + 1, node)
        except:
            pass
        return levels

    def make_pos(pos, node=root, currentLevel=0, parent=None, vert_loc=0):
        dx = 1.0/levels[currentLevel][TOTAL]
        left = dx/2
        pos[node] = ((left + dx*levels[currentLevel][CURRENT])*width, vert_loc)
        levels[currentLevel][CURRENT] += 1
        try:
            neighbors = G.neighbors(node)
            for neighbor in neighbors:
                if not neighbor == parent:
                    pos = make_pos(pos, neighbor, currentLevel + 1, node, vert_loc-vert_gap)
        except:
            pass
        return pos
    
    if levels is None:
        levels = make_levels({})
    else:
        levels = {l:{TOTAL: levels[l], CURRENT:0} for l in levels}
    vert_gap = height / (max([l for l in levels])+1)
    return make_pos({})

And wrapped some helpful logic around it:

In [19]:

def draw_hierarchical_graph(graph, root=None):
    plt.figure(figsize=(10, 10))

    root = root or nx.topological_sort(graph).next()
    pos = hierarchy_pos(graph, root)
    reduced_graph = nx.subgraph(graph, pos.keys())
    
    relabeled_graph = nx.relabel_nodes(reduced_graph, pretty_node)
    relabeled_pos = {pretty_node(node): p for node, p in pos.items()}
    nx.draw(relabeled_graph, pos=relabeled_pos, with_labels=True, node_size=500, node_color='w', font_size=9, arrowsize=8)

Ok, let's try it! Note that if we don't provide a root node, the graph will be drawn from the first node in topological order (in other words, any node that doesn't have any parents):

In [20]:

draw_hierarchical_graph(important_teachers_graph)

A bit messy, but on the whole, that's a reasonable looking graph! It starts from the influential 19th-century French teacher and pianist Marmontel, and proceeds through his various students to such names as Boulanger, Messiaen, Milhaud, and Cage.

4. The missing links¶

Let's expand our graph a little bit.

One thing we could do is simply add more teachers from the most-influential list – say, take the top 150 instead of the top 100. But let's try something else.

What if we find all the people who are "in between" influential teachers but not influential teachers themselves – that is, people who both taught and were taught by people on the influential-teachers list?

In [21]:

most_important_teachers_students = set([n for t in most_important_teachers for n in G.successors(t)])

In [22]:

def has_important_students(teacher):
    students = [s for s in G.successors(teacher)]
    return any([student in most_important_teachers for student in students])

most_important_teachers_and_important_students = \
    set(most_important_teachers) | set([p for p in most_important_teachers_students if has_important_students(p)])           

In [23]:

len(most_important_teachers_and_important_students) - len(most_important_teachers)

Out[23]:

Looks like there are 32 of these students who are "in between" influential teachers. Who are they?

In [24]:

most_important_teachers_and_important_students - set(most_important_teachers)

Out[24]:

{u'Alban_Berg',
 u'Aleksander_Micha\u0142owski',
 u'Alexander_Zemlinsky',
 u'Anna_Yesipova',
 u'Charles_Koechlin',
 u'Charles_Wood',
 u'Ernst_Krenek',
 u'Eusebius_Mandyczewski',
 u'Franz_Lachner',
 u'Friedrich_Kiel',
 u'Fritz_Reiner',
 u'Gian_Carlo_Menotti',
 u'Hans_von_Koessler',
 u'Henri_Dutilleux',
 u'Henri_Vieuxtemps',
 u'Horatio_Parker',
 u'Isidor_Seiss',
 u'Istv\xe1n_Thom\xe1n',
 u'Jacques-Nicolas_Lemmens',
 u'Johann_Nepomuk_Hummel',
 u'Louis_Niedermeyer',
 u'Ludwig_Thuille',
 u'Ludwig_van_Beethoven',
 u'Marcel_Dupr\xe9',
 u'Marion_Bauer',
 u'Maurice_Ravel',
 u'Orpha-F._Deveaux',
 u'Paul_Pisk',
 u'Peter_Mennin',
 u'Roberto_Gerhard',
 u'Romain-Octave_Pelletier_I',
 u'Stephen_Heller'}

Ah yeah, some big names here, like Beethoven (taught by Salieri and Albrechtsberger and in turn taught Czerny).

Let's make a slightly bigger subgraph consisting of both the 100 most influential teachers and their "in-between" students:

In [25]:

teacher_and_student_subgraph = nx.subgraph(G, most_important_teachers_and_important_students)

Unfortunately, by bringing these extra people into the graph, we introduce some cycles, making a topological sort (which we need to visualize the graph) no longer possible:

In [26]:

try:
    print [x for x in nx.topological_sort(teacher_and_student_subgraph)]
except Exception:
    import traceback
    traceback.print_exc()

Traceback (most recent call last):
  File "<ipython-input-26-280518520b52>", line 2, in <module>
    print [x for x in nx.topological_sort(teacher_and_student_subgraph)]
  File "/Users/alex/src/music-graph/env/lib/python2.7/site-packages/networkx/algorithms/dag.py", line 208, in topological_sort
    raise nx.NetworkXUnfeasible("Graph contains a cycle or graph changed "
NetworkXUnfeasible: Graph contains a cycle or graph changed during iteration

How many cycles are there, anyway?

In [27]:

for cycle in nx.simple_cycles(teacher_and_student_subgraph):
    print cycle

[u'Claude_Champagne', u'Orpha-F._Deveaux']
[u'Simon_Sechter', u'Henri_Vieuxtemps']

Removing individual edges seems to be a bit of a pain in networkx, so let's just take the easy way out and arbitrarily remove one node from each of the cycles:

In [28]:

most_important_teachers_and_important_students.remove(u'Orpha-F._Deveaux')
most_important_teachers_and_important_students.remove(u'Henri_Vieuxtemps')

In [29]:

teacher_and_student_subgraph = nx.subgraph(G, most_important_teachers_and_important_students)

And now we can finally visualize this augmented subgraph. Unfortunately all the extra links introduced by the "in-between" students make it a little messy:

In [30]:

draw_hierarchical_graph(teacher_and_student_subgraph)

5. The Beethoven tree¶

If we take set Beethoven as the root of the visualization, we get something resembling the original graphic that inspired this:

In [31]:

draw_hierarchical_graph(teacher_and_student_subgraph, 'Ludwig_van_Beethoven')

If we compare it to the original image, there's some overlap but not much.

The original image identifies the three major branches of the tree starting from Czerny as Liszt, Leschetizky, and Kullak. In our case, we get Liszt and Leschetizky, but Heller instead of Kullak.

WHen we go one leveldeper, the list becomes unrecognizable. Aside from a few overlapping people (e.g. Anna Yesipova), the people in our list are completely different than those in hte original graphic.

Much of this is surely attributable to the diffferent focuses of the trees. The original tree specifically highlights notable pianists, while in our case we filtered to people who are extremely influential within the graph – in other words, great teachers notable for the number and relative influence of their students.

The original graphic is also a product of its time – perhaps the reason it doesn't show the Heller subtree is because his most famous grand-student, Aaron Copland was still relatively obscure in 1927.

6. Before Beethoven¶

Now that we've seen the tree rooted at Beethoven, let's try going one level deeper into the past by starting from one of his teachers instead. Who were his teachers, anyway?

In [32]:

[x for x in teacher_and_student_subgraph.predecessors('Ludwig_van_Beethoven')]

Out[32]:

[u'Antonio_Salieri', u'Johann_Georg_Albrechtsberger']

Let's Albrechtsberger (the Salieri subtree is actually very similar because they both taught Moscheles, Hummel, and Beethoven – guess it was a small world):

In [33]:

draw_hierarchical_graph(teacher_and_student_subgraph, u'Johann_Georg_Albrechtsberger')

Oof, that's a messy graph. What if instead of using teacher_and_student_subgraph we fall back to the more restrictive important_teachers_graph?

With Beethoven and Hummel gone, the graph beomces more manageable, and includes an interesting tree descending from César Franck. Even here it's hard to miss Nadia Boulanger's influence – she personally taught another 5 of the 100 most influential teachers!

In [34]:

draw_hierarchical_graph(important_teachers_graph, u'Johann_Georg_Albrechtsberger')

Is there a path between J.S. Bach and Beethoven? There is, but it passes through some low-ranked teachers who didn't make it into our subgraph – Homilius, Hiller, and Neefe:

In [35]:

for n in nx.shortest_path(G, 'Johann_Sebastian_Bach', 'Ludwig_van_Beethoven'):
    print u'{} [{:.4f}]'.format(n.replace('_', ' '), c[n])

Johann Sebastian Bach [0.0404]
Gottfried August Homilius [0.0169]
Johann Adam Hiller [0.0154]
Christian Gottlob Neefe [0.0124]
Ludwig van Beethoven [0.0172]

7. What's the longest path?¶

We can use networkx's dag_longest_path algorithm to find the longest path within our graph. The only catch is that the notion of "longest path" is meaningless in a graph that contains cycles:

In [36]:

try:
    print nx.dag_longest_path(G)
except Exception:
    import traceback
    traceback.print_exc()

Traceback (most recent call last):
  File "<ipython-input-36-00dbbde4bbf0>", line 2, in <module>
    print nx.dag_longest_path(G)
  File "<decorator-gen-376>", line 2, in dag_longest_path
  File "/Users/alex/src/music-graph/env/lib/python2.7/site-packages/networkx/utils/decorators.py", line 82, in _not_implemented_for
    return not_implement_for_func(*args, **kwargs)
  File "/Users/alex/src/music-graph/env/lib/python2.7/site-packages/networkx/algorithms/dag.py", line 653, in dag_longest_path
    for v in nx.topological_sort(G):
  File "/Users/alex/src/music-graph/env/lib/python2.7/site-packages/networkx/algorithms/dag.py", line 208, in topological_sort
    raise nx.NetworkXUnfeasible("Graph contains a cycle or graph changed "
NetworkXUnfeasible: Graph contains a cycle or graph changed during iteration

As before, let's just remove one node from each of the cycles:

In [37]:

for cycle in nx.simple_cycles(G):
    print cycle

[u'Simon_Sechter', u'Henri_Vieuxtemps']
[u'Wladimir_Vogel', u'Erik_Bergman', u'Heinz_Tiessen']
[u'Wladimir_Vogel', u'Erik_Bergman']
[u'Claude_Champagne', u'Orpha-F._Deveaux']
[u'\u0218tefan_Niculescu', u'Mihail_Jora']

In [38]:

G_without_cycles = nx.subgraph(G, [n for n in G.nodes if n not in [u'Henri_Vieuxtemps', u'Erik_Bergman', u'Orpha-F._Deveaux', u'Mihail_Jora']])

And let's double-check that we've actually gotten rid of all the cyles:

In [39]:

for cycle in nx.simple_cycles(G_without_cycles):
    print cycle

Great, now we can finally find the longest path:

In [40]:

for n in nx.dag_longest_path(G_without_cycles):
    print u'{} [{:.4f}]'.format(n.replace('_', ' '), c[n])

Orlande de Lassus [0.0147]
Andrea Gabrieli [0.0143]
Giovanni Gabrieli [0.0154]
Heinrich Schütz [0.0370]
Johann Schelle [0.0153]
Johann David Heinichen [0.0120]
Johann Georg Pisendel [0.0139]
Franz Benda [0.0193]
Johann Wilhelm Hertel [0.0120]
Carl Friedrich Christian Fasch [0.0137]
Carl Friedrich Zelter [0.0189]
Carl Friedrich Rungenhagen [0.0197]
Franz Wüllner [0.0224]
Hans von Koessler [0.0244]
Ernő Dohnányi [0.0284]
Béla Bartók [0.0330]
Fritz Reiner [0.0169]
Vincent Persichetti [0.0336]
Paavo Heininen [0.0171]
Jouni Kaipainen [0.0117]
Tomi Räisänen [0.0106]

Wow, an unbroken link from Orlande de Lassus (1532–1594) all the way to Tomi Räisänen (1976–)!

8. Could we see the whole network?¶

Are all of the 132 people in teacher_and_student_subgraph connected? We can figure this out by determining the weakly connected components of the graph (weakly meaning that the direction of each link doesn't matter):

In [41]:

for comp in nx.weakly_connected_components(teacher_and_student_subgraph):
    print len(comp), comp

110 set([u'Ernst_Krenek', u'Ton_de_Leeuw', u'George_Rochberg', u'Engelbert_Humperdinck', u'Louis_Vierne', u'Paul_Pisk', u'B\xe9la_Bart\xf3k', u'Heinrich_Neuhaus', u'Ralph_Vaughan_Williams', u'Henri_Dutilleux', u'Charles_Wood', u'Karlheinz_Stockhausen', u'Simon_Sechter', u'Th\xe9odore_Dubois', u'Roger_Sessions', u'Roger_Reynolds', u'Charles_Koechlin', u'Franz_Liszt', u'Otto_Luening', u'Alban_Berg', u'Mario_Davidovsky', u'Howard_Hanson', u'Maria_Curcio', u'Gabriel_Faur\xe9', u'Edward_Burlingame_Hill', u'Friedrich_Kiel', u'Isabelle_Vengerova', u'Paul_Dukas', u'Maurice_Ravel', u'Ludwig_van_Beethoven', u'Horatio_Parker', u'Aaron_Copland', u'Gian_Carlo_Menotti', u'Lazare_L\xe9vy', u'Peter_Mennin', u'Seymour_Shifrin', u'G\xe9rard_Grisey', u'Alexander_Goehr', u'Nadia_Boulanger', u'Robert_Erickson', u'No\xebl_Gallon', u'Ludwig_Thuille', u'Jacques-Nicolas_Lemmens', u'Paul_Hindemith', u'Moritz_Hauptmann', u'Isidor_Philipp', u'Samuel_Adler', u'Carl_Czerny', u'Hans_von_Koessler', u'Randall_Thompson', u'Antonio_Salieri', u'Egon_Petri', u'Josef_Rheinberger', u'Walter_Piston', u'Franz_Lachner', u'Vincent_Persichetti', u'Robert_Fuchs', u'Olivier_Messiaen', u'Johann_Nepomuk_Hummel', u'Leopold_Auer', u'Bernard_Rogers', u'Andr\xe9_Gedalge', u'Eusebius_Mandyczewski', u'Charles-Marie_Widor', u'Johann_Georg_Albrechtsberger', u'Arnold_Schoenberg', u'Ignaz_Moscheles', u'Stephen_Heller', u'Isidor_Seiss', u'Jacob_Druckman', u'Ernest_Bloch', u'Marcel_Dupr\xe9', u'Pauline_Oliveros', u'Theodor_Leschetizky', u'Ross_Lee_Finney', u'Jen\u0151_Hubay', u"Vincent_d'Indy", u'Albert_Lavignac', u'Anna_Yesipova', u'Anton_Reicha', u'Fran\xe7ois-Joseph_F\xe9tis', u'Ferruccio_Busoni', u'Claude_Champagne', u'Istv\xe1n_Thom\xe1n', u'Darius_Milhaud', u'Roberto_Gerhard', u'John_Weinzweig', u'Leonard_Stein', u'Rosario_Scalero', u'Ernest_Guiraud', u'Lukas_Foss', u'Percy_Goetschius', u'Fritz_Reiner', u'Marion_Bauer', u'Louis_Di\xe9mer', u'Salomon_Jadassohn', u'Romain-Octave_Pelletier_I', u'Fr\xe9d\xe9ric_Chopin', u'Carl_Reinecke', u'Georges_Mathias', u'C\xe9sar_Franck', u'Louis_Niedermeyer', u'Joseph_Joachim', u'Alexander_Zemlinsky', u'Jean_Gallon', u'Charles_Villiers_Stanford', u'Milton_Babbitt', u'Antoine_Fran\xe7ois_Marmontel', u'Aleksander_Micha\u0142owski', u'Artur_Schnabel'])
3 set([u'Louis_Andriessen', u'Luigi_Dallapiccola', u'Luciano_Berio'])
1 set([u'Johann_Sebastian_Bach'])
2 set([u'Franco_Donatoni', u'Goffredo_Petrassi'])
1 set([u'Joseph_Schwantner'])
1 set([u'Otakar_\u0160ev\u010d\xedk'])
1 set([u'Zbigniew_Drzewiecki'])
1 set([u'M\xe1ty\xe1s_Seiber'])
2 set([u'Manuel_Garc\xeda', u'Mathilde_Marchesi'])
1 set([u'Pran_Nath'])
1 set([u'Boris_Blacher'])
1 set([u'Wolfgang_Fortner'])
1 set([u'Nikolai_Rimsky-Korsakov'])
1 set([u'Jos\xe9_Rey_de_la_Torre'])
1 set([u'Louis_Plaidy'])
1 set([u'Dorothy_DeLay'])
1 set([u'Heinrich_Sch\xfctz'])

Looks like there's one massive connected component of 110 people (~85% of the whole graph), a few small 2-3 person chains, and about a dozen people on their own.

Let's try visualizing the graph made up of just the 110 people who are all connected:

In [42]:

connected_comp = nx.weakly_connected_components(teacher_and_student_subgraph).next()
connected_teacher_and_student_subgraph = nx.subgraph(teacher_and_student_subgraph, connected_comp)

In [43]:

plt.figure(figsize=(10, 10))
connected_teacher_and_student_subgraph_pretty = nx.relabel_nodes(connected_teacher_and_student_subgraph, pretty_node)
nx.draw(connected_teacher_and_student_subgraph_pretty, with_labels=True, node_size=500, node_color='w', font_size=9, arrowsize=8)

Well, everyone is certainly connected, but there doesn't seem to be a good way visualize the graph using networkx. Let's export it to JSON instead so we can play around with it using Cytoscape.js.

In [44]:

def remap(x, in_min, in_max, out_min, out_max):
    return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min

export_to_cytoscape_json(
    connected_teacher_and_student_subgraph,
    'cytoscape_connected_111.json', 
    weight_fn=lambda node: remap(c[node], min(c.values()), max(c.values()), 10, 200)
)

After some messing around with Cytoscape.js, here is a reasonable interactive visualization.

9. In conclusion¶

Thanks for making it this far!

So, what have we accomplished?

We've:

managed to ingest student-teacher relationships for ~5,000 classical musicians from Wikipedia.
determined (using the Katz centrality metric) who the most influential teachers are. (Conclusion: Nadia Boulanger by a long shot, followed by Milhaud, Stockhausen, and Messiaen)
reproduced, as best as we could, the original graphic that this exploration was inspired by, and noted similarities and differences.
found the longest unbroken teacher->student link in the data, stretching through four centuries from Orlande de Lassus (1532–1594) all the way to Tomi Räisänen (1976–).
and finally, made a spiffy interactive visualization of a particularly important connected subgraph of teachers.

In [ ]:

Visualizing Classical Music Influence with networkx¶