Producción de Gas y Petróleo en Argentina - Procesamiento de datasets¶

Librerías y configuraciones¶

In [1]:

import pandas as pd
import requests
import os

pd.set_option("display.max_columns", None)

Descarga y unión de datasets¶

In [2]:

datasets = [
    {"anio": 2023, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/231c39b3-e81e-4398-af8d-b115807f2c25/download/produccin-de-pozos-de-gas-y-petrleo-2023.csv"},
    {"anio": 2022, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/876b3746-85e2-4039-adeb-b1354436159f/download/produccin-de-pozos-de-gas-y-petrleo-2022.csv"},
    {"anio": 2021, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/465be754-a372-4c31-b855-81dc5fe3309f/download/produccin-de-pozos-de-gas-y-petrleo-2021.csv"},
    {"anio": 2020, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/c4a4a6a0-e75a-4e12-ae5c-54d53a70348c/download/produccin-de-pozos-de-gas-y-petrleo-2020.csv"},
    {"anio": 2019, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/8bc0d61c-0408-43d4-a7bc-7178fcb5d37e/download/produccin-de-pozos-de-gas-y-petrleo-2019.csv"},
    {"anio": 2018, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/333fd72a-9b83-4bc1-bc94-0f5940b52331/download/produccin-de-pozos-de-gas-y-petrleo-2018.csv"},
    {"anio": 2017, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/df4857e1-7c3f-4980-b5b5-184fe78bfcf0/download/produccin-de-pozos-de-gas-y-petrleo-2017.csv"},
    {"anio": 2016, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/d8539ae8-0a71-4339-a16c-139b21bd2cd0/download/produccin-de-pozos-de-gas-y-petrleo-2016.csv"},
    {"anio": 2015, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/e375aa35-fd8d-41d6-aa0c-e6879ca567a1/download/produccin-de-pozos-de-gas-y-petrleo-2015.csv"},
    {"anio": 2014, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/cd9813a7-1e19-4f60-a02a-7903dd81aff7/download/produccin-de-pozos-de-gas-y-petrleo-2014.csv"},
    {"anio": 2013, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/bc7ac8fe-2cec-4dab-acdd-322ea1ccc887/download/produccin-de-pozos-de-gas-y-petrleo-2013.csv"},
    {"anio": 2012, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/0dce0e75-1556-47ee-8615-1955fbd54ade/download/produccin-de-pozos-de-gas-y-petrleo-2012.csv"},
    {"anio": 2011, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/4817272c-7365-4bdd-b02d-75b118218b10/download/produccin-de-pozos-de-gas-y-petrleo-2011.csv"},
    {"anio": 2010, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/364ca28e-d069-4bd6-8771-925f0db152a8/download/produccin-de-pozos-de-gas-y-petrleo-2010.csv"},
    {"anio": 2009, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/48585038-055a-4437-bb1d-4fe36073f453/download/produccin-de-pozos-de-gas-y-petrleo-2009.csv"},
    {"anio": 2008, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/3430b5a8-a516-42ca-a47d-2e1ce45925fb/download/produccin-de-pozos-de-gas-y-petrleo-2008.csv"},
    {"anio": 2007, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/be663a63-f020-4e28-8f31-c5f81d47554d/download/produccin-de-pozos-de-gas-y-petrleo-2007.csv"},
    {"anio": 2006, "link": "http://datos.energia.gob.ar/dataset/c846e79c-026c-4040-897f-1ad3543b407c/resource/4e1c55e5-1f1b-4fc8-aa37-2080d9795f29/download/produccin-de-pozos-de-gas-y-petrleo-2006.csv"}
]

In [4]:

for i in datasets:
    response = requests.get(i["link"])
    if response.status_code == 200:
        content = response.content
        file_path = f"datasets/{i['anio']}.csv"
        with open(file_path, "wb") as file:
            file.write(content)
        print(f'CSV del año {i["anio"]} descargado con éxito.')
    else:
        print(f'Fallo al descargar CSV del año {i["anio"]} - Código: {response.status_code}')

CSV del año 2023 descargado con éxito.
CSV del año 2022 descargado con éxito.
CSV del año 2021 descargado con éxito.
CSV del año 2020 descargado con éxito.
CSV del año 2019 descargado con éxito.
CSV del año 2018 descargado con éxito.
CSV del año 2017 descargado con éxito.
CSV del año 2016 descargado con éxito.
CSV del año 2015 descargado con éxito.
CSV del año 2014 descargado con éxito.
CSV del año 2013 descargado con éxito.
CSV del año 2012 descargado con éxito.
CSV del año 2011 descargado con éxito.
CSV del año 2010 descargado con éxito.
CSV del año 2009 descargado con éxito.
CSV del año 2008 descargado con éxito.
CSV del año 2007 descargado con éxito.
CSV del año 2006 descargado con éxito.

Elección de columnas relevantes:

In [2]:

columnas_a_leer = {
    "prod_pet": float,
    "prod_gas": float,
    "prod_agua": float,
    "tef": float,
    "tipoextraccion": str,
    "tipoestado": str,
    "tipopozo": str,
    "empresa": str,
    "sigla": str,
    "formacion": str,
    "profundidad": float,
    "areayacimiento": str,
    "cuenca": str,
    "provincia": str,
    "tipo_de_recurso": str,
    "clasificacion": str,
    "fecha_data": str
}

Instanciado de dataframes:

In [3]:

archivos = os.listdir("datasets")

df = pd.concat([pd.read_csv("datasets/"+i, header=0, usecols=list(columnas_a_leer), dtype=columnas_a_leer) for i in archivos if i.endswith(".csv")])

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15006994 entries, 0 to 322869
Data columns (total 17 columns):
 #   Column           Dtype  
---  ------           -----  
 0   prod_pet         float64
 1   prod_gas         float64
 2   prod_agua        float64
 3   tef              float64
 4   tipoextraccion   object 
 5   tipoestado       object 
 6   tipopozo         object 
 7   empresa          object 
 8   sigla            object 
 9   profundidad      float64
 10  formacion        object 
 11  areayacimiento   object 
 12  cuenca           object 
 13  provincia        object 
 14  tipo_de_recurso  object 
 15  clasificacion    object 
 16  fecha_data       object 
dtypes: float64(5), object(12)
memory usage: 2.0+ GB

Procesamiento de dataframe:¶

Como primera medida, para reducir el espacio del dataframe, crearemos subtablas para unir los datos repetidos en la tabla principal.

In [5]:

columnas_tablas = [
    "tipoextraccion",
    "tipoestado",
    "tipopozo",
    "empresa",
    "formacion",
    "areayacimiento",
    "cuenca",
    "provincia",
    "tipo_de_recurso",
    "clasificacion",
    "sigla",
    "fecha_data"
]

In [6]:

tablas = {}

for i in columnas_tablas:
    categorias_unicas = df[i].unique()
    tablas.update({i: pd.DataFrame({i: categorias_unicas})})
    tablas[i][i+"_id"] = range(0, len(categorias_unicas))
    df = df.merge(tablas[i], on=i, how="left")
    df.drop(i, axis=1, inplace=True)

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15006994 entries, 0 to 15006993
Data columns (total 17 columns):
 #   Column              Dtype  
---  ------              -----  
 0   prod_pet            float64
 1   prod_gas            float64
 2   prod_agua           float64
 3   tef                 float64
 4   profundidad         float64
 5   tipoextraccion_id   int64  
 6   tipoestado_id       int64  
 7   tipopozo_id         int64  
 8   empresa_id          int64  
 9   formacion_id        int64  
 10  areayacimiento_id   int64  
 11  cuenca_id           int64  
 12  provincia_id        int64  
 13  tipo_de_recurso_id  int64  
 14  clasificacion_id    int64  
 15  sigla_id            int64  
 16  fecha_data_id       int64  
dtypes: float64(5), int64(12)
memory usage: 2.0 GB

Luego convertimos valores flotantes a enteros:

In [11]:

columnas_a_redondear = [
    "prod_pet",
    "prod_gas",
    "prod_agua",
    "tef",
    "profundidad"
]

for i in columnas_a_redondear:
    df[i] = df[i].round().astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15006994 entries, 0 to 15006993
Data columns (total 17 columns):
 #   Column              Dtype
---  ------              -----
 0   prod_pet            int32
 1   prod_gas            int32
 2   prod_agua           int32
 3   tef                 int32
 4   profundidad         int32
 5   tipoextraccion_id   int64
 6   tipoestado_id       int64
 7   tipopozo_id         int64
 8   empresa_id          int64
 9   formacion_id        int64
 10  areayacimiento_id   int64
 11  cuenca_id           int64
 12  provincia_id        int64
 13  tipo_de_recurso_id  int64
 14  clasificacion_id    int64
 15  sigla_id            int64
 16  fecha_data_id       int64
dtypes: int32(5), int64(12)
memory usage: 1.7 GB

Procedemos a guardar todos los dataframes:

In [8]:

df.to_csv("datasets/produccion.csv", index=False)
for i in tablas:
    tablas[i].to_csv(f"datasets/{i}.csv", index=False)

In [13]:

df.head()

Out[13]:

	prod_pet	prod_gas	prod_agua	tef	profundidad	tipoextraccion_id	tipoestado_id	tipopozo_id	areayacimiento_id	clasificacion_id	sigla_id
0	85	1980	8	15	2185	0	0	0	0	0	0
1	0	0	0	0	2530	0	1	0	1	0	1
2	0	0	0	0	2058	0	1	0	0	0	2
3	0	0	0	0	2034	1	2	1	0	0	3
4	0	0	0	0	1920	0	1	0	0	1	4

Chequeo de tamaño de cada columna:

In [9]:

def obtener_tamanio_columnas_csv(ruta_archivo):
    # Lee el archivo CSV utilizando pandas
    dataframe = pd.read_csv(ruta_archivo)
    
    # Obtiene el tamaño en disco de cada columna
    tamanios = {}
    for columna in dataframe.columns:
        ruta_temporal = 'temp.csv'
        dataframe[columna].to_csv(ruta_temporal, index=False)
        tamanio_bytes = os.path.getsize(ruta_temporal)
        tamanio_mb = tamanio_bytes / (1024 * 1024)  # Convertir a MB
        tamanios[columna] = tamanio_mb
        
        # Elimina el archivo temporal
        os.remove(ruta_temporal)
    
    return tamanios

display(obtener_tamanio_columnas_csv("datasets/produccion.csv"))

{'prod_pet': 85.86493682861328,
 'prod_gas': 81.28928089141846,
 'prod_agua': 88.31593704223633,
 'tef': 83.22790718078613,
 'profundidad': 106.83516883850098,
 'tipoextraccion_id': 43.06702518463135,
 'tipoestado_id': 43.656578063964844,
 'tipopozo_id': 42.93544578552246,
 'empresa_id': 50.04568958282471,
 'formacion_id': 49.84333515167236,
 'areayacimiento_id': 65.66456317901611,
 'cuenca_id': 42.93563938140869,
 'provincia_id': 43.344736099243164,
 'tipo_de_recurso_id': 42.935373306274414,
 'clasificacion_id': 42.93537139892578,
 'sigla_id': 97.9739236831665,
 'fecha_data_id': 64.64103317260742}