Notebook

Load and connect data¶

In [4]:

%pip install mysql-connector-python

Collecting mysql-connector-python
  Downloading mysql_connector_python-8.0.29-py2.py3-none-any.whl (342 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 342.0/342.0 kB 2.1 MB/s eta 0:00:0000:0100:01
Collecting protobuf>=3.0.0
  Downloading protobuf-4.21.1-cp37-abi3-manylinux2014_aarch64.whl (403 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 403.3/403.3 kB 7.2 MB/s eta 0:00:0000:01
Installing collected packages: protobuf, mysql-connector-python
Successfully installed mysql-connector-python-8.0.29 protobuf-4.21.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.

In [70]:

import mysql.connector
import os
import pandas as pd


def query(sql):
    mydb = mysql.connector.connect(
      host='mage-development.cxj4djmtpwkx.us-west-2.rds.amazonaws.com',
      user='root',
      password=os.getenv('DB_PASSWORD'),
      database='materia_development',
    )
    mycursor = mydb.cursor()

    mycursor.execute(sql)

    return mycursor.fetchall()


def query_table(table_name):
    columns = [r[0] for r in query(f'DESCRIBE {table_name}')]
    rows = query(f'SELECT * FROM {table_name}')
    
    return pd.DataFrame(rows, columns=columns)

In [71]:

', '.join([r[0] for r in query('SHOW TABLES')])

Out[71]:

'auth_group, auth_group_permissions, auth_permission, auth_user, auth_user_groups, auth_user_user_permissions, authtoken_token, corsheaders_corsmodel, data_cleaning_pipeline, data_management_dataconnector, data_management_datadestination, data_management_datasource, data_management_entity, data_management_feature, data_management_featureset, data_management_featuresetsource, data_management_featuresetsourceoperation, data_management_featuresetversion, data_management_featuresetversionjoin, data_management_transformeraction, django_admin_log, django_content_type, django_migrations, django_session, django_site, experimentation_experiment, inference_predictionoutcome, inference_predictionresult, lifecycle_management_trainingfeature, lifecycle_management_trainingfeatureset, lifecycle_management_trainingset, model_with_versions, model_with_versions_from_staff, oauth2_provider_accesstoken, oauth2_provider_application, oauth2_provider_grant, oauth2_provider_refreshtoken, prototyping_custompredictionvalue, prototyping_model, prototyping_modelcollection, prototyping_modelversion, prototyping_trainingrun, suggestion_management_suggestion, suggestion_suggestion, users_group, users_groupapplication, users_groupmembership, users_invitation, users_userprofile'

In [78]:

import mage_ai

# df = query_table('materia_development.model_with_versions')
# mage_ai.connect_data(df, 'model_with_versions')

df = query_table('materia_development.features_with_feature_sets')
mage_ai.connect_data(df, 'features_with_feature_sets')

Out[78]:

<FeatureSet 35> Cleaning suggestions:
1. Remove columns with no values(['partition_filter_version']): Remove columns with no values to increase data quality.
2. Remove columns with high empty rate(['deleted_at_feature', 'window_feature', 'transformer_action_id_feature', 'deleted_at_version', 'parent_feature_set_version_id_version', 'reference_feature_set_version_id_version', 'insights_analyzed_at_version', 'deleted_at_feature_set', 'description_feature_set', 'source_type_feature_set']): Remove columns with many missing values may increase data quality.
3. Remove columns with single value(['structure_type_version']): Remove columns with a single unique value to reduce the amount of redundant data.
4. Fix syntax errors(['username_user']): Fix syntactical errors to reduce the amount of noise in the data.
5. Reformat values(['username_user', 'email_user', 'uuid_feature', 'name_feature', 'description_feature', 'name_feature_set']): Format entries in these columns as fully lowercase to improve data quality.
6. Fill in missing values(['deleted_at_feature', 'window_feature', 'transformed_feature_id_feature', 'transformer_action_id_feature', 'source_feature_id_feature', 'deleted_at_version', 'entity_feature_id_version', 'timestamp_feature_id_version', 'statistics_calculated_at_version', 'parent_feature_set_version_id_version', 'reference_feature_set_version_id_version', 'model_version_id_version', 'partition_filter_version', 'insights_analyzed_at_version', 'deleted_at_feature_set', 'description_feature_set', 'model_id_feature_set', 'source_type_feature_set']): Fill missing values with a placeholder to mark them as missing.
7. Fill in missing values(['user_id_feature', 'user_id_version']): For each column, fill missing entries with the median value.
8. Fill in missing values(['description_feature', 'operation_feature']): Fill missing entries using the previously occurring entry in the timeseries.
9. Remove outliers(['id_user']): Remove 633 outlier(s) to reduce the amount of noise in this column.
10. Remove outliers(['user_id_feature']): Remove 341 outlier(s) to reduce the amount of noise in this column.
11. Remove outliers(['transformer_action_id_feature']): Remove 6 outlier(s) to reduce the amount of noise in this column.
12. Remove outliers(['source_feature_id_feature']): Remove 28 outlier(s) to reduce the amount of noise in this column.
13. Remove outliers(['entity_feature_id_version']): Remove 428 outlier(s) to reduce the amount of noise in this column.
14. Remove outliers(['timestamp_feature_id_version']): Remove 313 outlier(s) to reduce the amount of noise in this column.
15. Remove outliers(['user_id_version']): Remove 633 outlier(s) to reduce the amount of noise in this column.
16. Remove outliers(['user_id_feature_set']): Remove 633 outlier(s) to reduce the amount of noise in this column.

Clean data using pipeline created from app¶

In [ ]:

mage_ai.clean(df, pipeline_uuid=1)