Ejemplo de extracción de datos

Este notebook utiliza una colección digital descrita a través de ficheros MARCXML que incluye metadatos descriptivos del catálogo Moving Image Archive de la Biblioteca Nacional de Escocia.

Importando las librerías de código

In [1]:
# https://pypi.org/project/pymarc/
import pymarc, re, csv
import pandas as pd
from pymarc import parse_xml_to_array

Generando un fichero CSV como salida con el contenido procesado a partir de los archivos originales

In [7]:
with open('registros_marc.csv', 'w') as csv_fichero:
    csv_salida = csv.writer(csv_fichero, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    csv_salida.writerow(['titulo', 'autor', 'lugar_produccion', 'fecha', 'extension', 'creditos', 'materias', 'resumen', 'detalles', 'enlace'])


    registros = parse_xml_to_array(open('Moving-Image-Archive/Moving-Image-Archive-dataset-MARC.xml'))

    for registro in registros:

        titulo = autor = lugar_produccion = fecha = extension = creditos = materias = resumen = detalles = enlace =''

        # titulo
        if registro['245'] is not None:
          titulo = registro['245']['a']
          if registro['245']['b'] is not None:
            titulo = titulo + " " + registro['245']['b']

        # autor
        if registro['100'] is not None:
          autor = registro['100']['a']
        elif registro['110'] is not None:
          autor = registro['110']['a']
        elif registro['700'] is not None:
          autor = registro['700']['a']
        elif registro['710'] is not None:
          autor = registro['710']['a']

        # lugar de producción
        if registro['264'] is not None:
          lugar_produccion = registro['264']['a']

        # fecha
        for f in registro.get_fields('264'):
            fechas = f.get_subfields('c')
            if len(fechas):
                fecha = fechas[0]

                if fecha.endswith('.'): fecha = fecha[:-1]


        # Physical Description - extent
        for f in registro.get_fields('300'):
            extension = f.get_subfields('a')
            if len(extension):
                extension = extension[0]
                # TODO cleaning
            detalles = f.get_subfields('b')
            if len(detalles):
                detalles = detalles[0]

        # creditos
        if registro['508'] is not None:
          creditos = registro['508']['a']

        # Resumen
        if registro['520'] is not None:
          resumen = registro['520']['a']

        # Materia
        if registro['653'] is not None:
            materias = '' 
            for f in registro.get_fields('653'):
                materias += f.get_subfields('a')[0] + ' -- '
            materias = re.sub(' -- $', '', materias)


        # enlace
        if registro['856'] is not None:
          enlace = registro['856']['u']


        csv_salida.writerow([titulo,autor,lugar_produccion,fecha,extension,creditos,materias,resumen,detalles,enlace])

Leyendo el fichero CSV

In [8]:
# Este comando añade el contenido del fichero a un Pandas DataFrame
df = pd.read_csv('registros_marc.csv')

Consultando el contenido

In [9]:
df
Out[9]:
titulo autor lugar_produccion fecha extension creditos materias resumen detalles enlace
0 (GLASGOW TRAMS AND BOTANIC GARDENS). RUSSELL, Stanley Livingstone [Place of production not identified] : 1950.0 (2.00 mins) : Director, [filmed by Stanley L. Russell, Thame... Bus Stations and Depots -- Buses and Coaches, ... The Botanic Gardens, Glasgow with shots of the... mute, colour http://movingimage.nls.uk/film/0001
1 (LAST DAY OF THE TRAMS, GLASGOW). NaN [Place of production not identified] : 1962.0 (28.00 mins) : Director, [filmed by SAAC]. Transport -- Glasgow -- documentary -- amateur Footage of the last trams to run in Glasgow, a... silent, colour http://movingimage.nls.uk/film/0002
2 INTO THE MISTS. NaN [Place of production not identified] : 1956.0 (10.04 mins) : Director, [filmed by W.S. Dobson]. Ceremonies -- Emotions, Attitudes and Behaviou... The story of the last Edinburgh tram. Shots o... silent, colour http://movingimage.nls.uk/film/0004
3 PASSING OF THE TRAMCAR, the. NaN [Place of production not identified] : 1962.0 (63.36 mins) : NaN Ceremonies -- Transport -- Glasgow Footage of the last tram to run in Glasgow. Th... silent, colour http://movingimage.nls.uk/film/0005
4 SCOTS OF TOMORROW. Campbell Harper Productions [Place of production not identified] : 1959.0 (13.00 mins) : Producer, Campbell Harper Films Ltd.. Art and Artists, general -- Education -- edu... Scottish school pupils studying scientific and... sound, black and white http://movingimage.nls.uk/film/0007
... ... ... ... ... ... ... ... ... ... ...
6012 CITY OF BIRMINGHAM . NaN [Place of production not identified] : 1948.0 (6.11 mins) : NaN Ceremonies -- Construction and Engineering -- ... Built and engined by John Brown & Co. Ltd. S... silent, colour http://movingimage.nls.uk/film/UCS0195
6013 BUILDING THE BIG DREDGE - STAGE 1. NaN [Place of production not identified] : 1964.0 (8min20sec) : Producer, Stephen Group Film Unit. Construction and Engineering -- Ships and Ship... Shots of Indonesian Sea Dredge No. 1, under co... silent, colour http://movingimage.nls.uk/film/UCS0204
6014 ALEXANDER STEPHEN'S YARD. NaN [Place of production not identified] : 1964.0 (11.57 mins) : Producer, . Employment, Industry and Industrial Relations ... Shots of the Alexander Stephen's yard, and the... silent, colour http://movingimage.nls.uk/film/UCS0207
6015 QUEEN ELIZABETH Ship No. 552. NaN [Place of production not identified] : 1940.0 (5min24sec) : NaN Employment, Industry and Industrial Relations ... Built and engineered by John Brown & Co. Ltd. ... silent, black and white http://movingimage.nls.uk/film/UCS0213
6016 RUAHINE. NaN [Place of production not identified] : 1951.0 (12.26 mins) : NaN Carriages -- Ceremonies -- Ships and Shipping ... Footage of "Ruahine" ship being launched and t... silent, black and white/colour http://movingimage.nls.uk/film/UCS0214

6017 rows × 10 columns

Consultando las columnas

In [5]:
df.columns
Out[5]:
Index(['titulo', 'autor', 'lugar_produccion', 'fecha', 'extension', 'creditos',
       'materias', 'resumen', 'detalles', 'enlace'],
      dtype='object')

¿Cuántos registros existen?

In [6]:
len(df)
Out[6]:
6017

Explorando las materias

Creamos una lista de materias y la ordenamos alfabéticamente

In [15]:
df['materias'][2]
Out[15]:
'Ceremonies -- Emotions, Attitudes and Behaviour -- Local Government -- Transport -- Edinburgh -- amateur'
In [16]:
df['materias'].str.split('--', expand=True).stack()
Out[16]:
0     0                  Bus Stations and Depots 
      1               Buses and Coaches, general 
      2     Celebrations, Traditions and Customs 
      3                     Children and Infants 
      4                   Leisure and Recreation 
                            ...                  
6016  0                                Carriages 
      1                               Ceremonies 
      2                       Ships and Shipping 
      3                           Dunbartonshire 
      4                                 technical
Length: 23742, dtype: object
In [17]:
# Obtener valores únicos
materias = pd.unique(df['materias'].str.split(' -- ', expand=True).stack()).tolist()
for materia in sorted(materias, key=str.lower):
    print(materia)
Aberdeen
Aberdeenshire
advertising
Agriculture
Air displays and shows
Air Raids
Aircraft see also Helicopters
Airports
amateur
Angus
Animals
animation
Architecture and Buildings
Argyllshire
Art and Artists, general  
Arts and Crafts
Ayrshire
Banff
Berwickshire
biographical
Birds
Borders
British Empire, the
Broadcasting, general
Buddhism
Bulldozers
Bus Stations and Depots
Buses and Coaches, general
Butchers and Butcher Shops
Bute
Cafeterias and Canteens
Caithness
Camping
Canals
Canoeing
Carriages
Celebrations, Traditions and Customs
Celts and Celtic Culture
Ceremonies
Cheese and Cheese Making
Children and Infants
children's
Christmas  see also New Year
cine mag
Clackmannanshire
comedy
Construction and Engineering
crime
Crime, Punishment and Law Enforcement
dance
Dentistry
Depression, the
Disillusionment
documentary
Dumfriesshire
Dunbartonshire
Dundee
East Lothian
Easter
Edinburgh
Education
educational
Emotions, Attitudes and Behaviour
Employment, Industry and Industrial Relations
Environment
ethnographic
experimental
fantasy
Ferries
Fife
Fire Service
Fish and Fishing
Fish Gutting
Fish Markets
Fishing Boats
Fishwives
Food and Drink
Forth River
Glasgow
Gorbals, the
Healthcare
Highland Games
Highlands, the
historical
Hogmanay
Holiday Camps
Home Guard
Home Life
home movies and videos
horror
Housing and Living Conditions
industrial
Inner Hebrides
Institutional Care
instructional
Invernesshire
Kincardineshire
Kinrosshire
Kirkudbrightshire
Lanarkshire
Landscapes and Seascapes
Leisure and Recreation
Lifeboats
Lobster Fishing
Local Government
local topical
Loch Ness Monster, the
Media, Communication and the Creative Industries
medical
Midlothian
Military, the
Morayshire
Music
music
Music Hall
music video
Nairn
newsreel
Orkney Islands
Outer Hebrides
Paddle Steamers
parody
Peat and Peat Cutting
Peebles- shire
Perth
Politics
Power Resources
promotional
propaganda
public information
Religion
religion
Renfrewshire
Reptiles
Reservoirs
Residential Homes for the Elderly
Restaurants
Revenge
Riding of the Marches
Rodents
romance
Ross-shire 
Roxburghshire  
Royalty
Science and Technology
science fiction
scientific
Selkirkshire
Shetland Islands
Ships and Shipping
Special Needs Education
Spinning
sponsored
Sporting Activities
sports
Spring
Stained Glass
Stirling
Stirlingshire
Sutherland
technical
television arts
television documentary
television educational
television entertainment
television news
television sport
Tourism and Travel
training
Transport
travelogue
War
War Crimes
Water and Waterways
West Lothian
Wigtownshire
women film makers
In [18]:
df['materias']
Out[18]:
0       Bus Stations and Depots -- Buses and Coaches, ...
1          Transport -- Glasgow -- documentary -- amateur
2       Ceremonies -- Emotions, Attitudes and Behaviou...
3                      Ceremonies -- Transport -- Glasgow
4       Art and Artists, general   -- Education -- edu...
                              ...                        
6012    Ceremonies -- Construction and Engineering -- ...
6013    Construction and Engineering -- Ships and Ship...
6014    Employment, Industry and Industrial Relations ...
6015    Employment, Industry and Industrial Relations ...
6016    Carriages -- Ceremonies -- Ships and Shipping ...
Name: materias, Length: 6017, dtype: object

También podemos calcular con qué frecuencia se usa una materia

In [19]:
# Partir las materias y obtener el número de ocurrencias
materia_contador = df['materias'].str.split(' -- ').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)
# Añadimos las columnas
materia_contador.columns = ['materia', 'contador']
# Mostrar con barras horizontales
display(materia_contador.style.bar(subset=['contador'], color='#d65f5f').set_properties(subset=['contador'], **{'width': '300px'}))
materia contador
0 amateur 2023
1 Leisure and Recreation 813
2 Glasgow 797
3 documentary 707
4 Transport 674
5 Employment, Industry and Industrial Relations 632
6 television news 542
7 Edinburgh 538
8 Sporting Activities 525
9 Celebrations, Traditions and Customs 453
10 Ships and Shipping 444
11 local topical 411
12 Children and Infants 407
13 Media, Communication and the Creative Industries 399
14 educational 379
15 Ceremonies 359
16 Education 356
17 Arts and Crafts 353
18 Tourism and Travel 352
19 Construction and Engineering 303
20 Agriculture 299
21 Fish and Fishing 272
22 sponsored 270
23 Emotions, Attitudes and Behaviour 267
24 promotional 264
25 Food and Drink 262
26 newsreel 256
27 Landscapes and Seascapes 245
28 Art and Artists, general 237
29 television documentary 224
30 Lanarkshire 222
31 Ayrshire 219
32 Fife 218
33 Aberdeen 214
34 home movies and videos 203
35 Military, the 198
36 War 197
37 Animals 192
38 Renfrewshire 188
39 Home Life 183
40 Politics 182
41 Power Resources 180
42 Environment 180
43 Science and Technology 180
44 Water and Waterways 176
45 Argyllshire 171
46 Architecture and Buildings 167
47 Religion 167
48 Forth River 166
49 Dunbartonshire 166
50 Aberdeenshire 166
51 Perth 161
52 Healthcare 159
53 women film makers 158
54 Birds 148
55 advertising 134
56 Fishing Boats 128
57 Housing and Living Conditions 127
58 comedy 125
59 Highlands, the 124
60 Royalty 122
61 West Lothian 119
62 Buses and Coaches, general 118
63 animation 118
64 Dundee 117
65 industrial 114
66 Invernesshire 111
67 Dumfriesshire 107
68 Music 105
69 Borders 104
70 Carriages 103
71 Inner Hebrides 102
72 Outer Hebrides 101
73 Ferries 93
74 Stirlingshire 91
75 Orkney Islands 90
76 technical 88
77 Local Government 87
78 sports 78
79 Shetland Islands 76
80 experimental 72
81 Crime, Punishment and Law Enforcement 67
82 East Lothian 65
83 travelogue 62
84 British Empire, the 61
85 Bute 61
86 Institutional Care 59
87 Ross-shire 58
88 Paddle Steamers 58
89 instructional 57
90 Stirling 57
91 Midlothian 55
92 Roxburghshire 54
93 Celts and Celtic Culture 53
94 Morayshire 47
95 Berwickshire 47
96 Peat and Peat Cutting 47
97 Caithness 47
98 Angus 41
99 Selkirkshire 41
100 Spinning 40
101 music 40
102 propaganda 37
103 television sport 37
104 Highland Games 37
105 Camping 36
106 Fish Markets 35
107 Aircraft see also Helicopters 34
108 biographical 33
109 Cafeterias and Canteens 33
110 Canals 32
111 Banff 31
112 Riding of the Marches 31
113 Fish Gutting 31
114 Christmas see also New Year 30
115 television arts 30
116 Sutherland 30
117 Bus Stations and Depots 29
118 television educational 28
119 Restaurants 28
120 Fishwives 27
121 religion 25
122 Wigtownshire 24
123 music video 23
124 children's 20
125 Canoeing 19
126 romance 19
127 Air displays and shows 19
128 medical 18
129 Peebles- shire 18
130 Gorbals, the 18
131 Butchers and Butcher Shops 18
132 Disillusionment 18
133 Reservoirs 16
134 Lobster Fishing 16
135 television entertainment 16
136 Airports 16
137 scientific 15
138 fantasy 15
139 dance 15
140 Special Needs Education 14
141 Bulldozers 13
142 public information 13
143 historical 13
144 Kincardineshire 13
145 Kirkudbrightshire 12
146 Lifeboats 12
147 crime 12
148 ethnographic 12
149 cine mag 11
150 Loch Ness Monster, the 11
151 Holiday Camps 11
152 Rodents 11
153 Home Guard 10
154 Clackmannanshire 10
155 training 10
156 horror 9
157 Residential Homes for the Elderly 8
158 science fiction 8
159 Dentistry 8
160 Nairn 8
161 Fire Service 7
162 Music Hall 7
163 parody 7
164 Revenge 5
165 Depression, the 5
166 Air Raids 5
167 Reptiles 4
168 Kinrosshire 4
169 Spring 4
170 Broadcasting, general 3
171 Hogmanay 3
172 Buddhism 2
173 Cheese and Cheese Making 2
174 War Crimes 1
175 Easter 1
176 Stained Glass 1
In [ ]:
 
In [ ]: