%pip install mysql-connector-python
Collecting mysql-connector-python Downloading mysql_connector_python-8.0.29-py2.py3-none-any.whl (342 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 342.0/342.0 kB 2.1 MB/s eta 0:00:0000:0100:01 Collecting protobuf>=3.0.0 Downloading protobuf-4.21.1-cp37-abi3-manylinux2014_aarch64.whl (403 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 403.3/403.3 kB 7.2 MB/s eta 0:00:0000:01 Installing collected packages: protobuf, mysql-connector-python Successfully installed mysql-connector-python-8.0.29 protobuf-4.21.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Note: you may need to restart the kernel to use updated packages.
import mysql.connector
import os
import pandas as pd
def query(sql):
mydb = mysql.connector.connect(
host='mage-development.cxj4djmtpwkx.us-west-2.rds.amazonaws.com',
user='root',
password=os.getenv('DB_PASSWORD'),
database='materia_development',
)
mycursor = mydb.cursor()
mycursor.execute(sql)
return mycursor.fetchall()
def query_table(table_name):
columns = [r[0] for r in query(f'DESCRIBE {table_name}')]
rows = query(f'SELECT * FROM {table_name}')
return pd.DataFrame(rows, columns=columns)
', '.join([r[0] for r in query('SHOW TABLES')])
'auth_group, auth_group_permissions, auth_permission, auth_user, auth_user_groups, auth_user_user_permissions, authtoken_token, corsheaders_corsmodel, data_cleaning_pipeline, data_management_dataconnector, data_management_datadestination, data_management_datasource, data_management_entity, data_management_feature, data_management_featureset, data_management_featuresetsource, data_management_featuresetsourceoperation, data_management_featuresetversion, data_management_featuresetversionjoin, data_management_transformeraction, django_admin_log, django_content_type, django_migrations, django_session, django_site, experimentation_experiment, inference_predictionoutcome, inference_predictionresult, lifecycle_management_trainingfeature, lifecycle_management_trainingfeatureset, lifecycle_management_trainingset, model_with_versions, model_with_versions_from_staff, oauth2_provider_accesstoken, oauth2_provider_application, oauth2_provider_grant, oauth2_provider_refreshtoken, prototyping_custompredictionvalue, prototyping_model, prototyping_modelcollection, prototyping_modelversion, prototyping_trainingrun, suggestion_management_suggestion, suggestion_suggestion, users_group, users_groupapplication, users_groupmembership, users_invitation, users_userprofile'
import mage_ai
# df = query_table('materia_development.model_with_versions')
# mage_ai.connect_data(df, 'model_with_versions')
df = query_table('materia_development.features_with_feature_sets')
mage_ai.connect_data(df, 'features_with_feature_sets')
<FeatureSet 35> Cleaning suggestions: 1. Remove columns with no values(['partition_filter_version']): Remove columns with no values to increase data quality. 2. Remove columns with high empty rate(['deleted_at_feature', 'window_feature', 'transformer_action_id_feature', 'deleted_at_version', 'parent_feature_set_version_id_version', 'reference_feature_set_version_id_version', 'insights_analyzed_at_version', 'deleted_at_feature_set', 'description_feature_set', 'source_type_feature_set']): Remove columns with many missing values may increase data quality. 3. Remove columns with single value(['structure_type_version']): Remove columns with a single unique value to reduce the amount of redundant data. 4. Fix syntax errors(['username_user']): Fix syntactical errors to reduce the amount of noise in the data. 5. Reformat values(['username_user', 'email_user', 'uuid_feature', 'name_feature', 'description_feature', 'name_feature_set']): Format entries in these columns as fully lowercase to improve data quality. 6. Fill in missing values(['deleted_at_feature', 'window_feature', 'transformed_feature_id_feature', 'transformer_action_id_feature', 'source_feature_id_feature', 'deleted_at_version', 'entity_feature_id_version', 'timestamp_feature_id_version', 'statistics_calculated_at_version', 'parent_feature_set_version_id_version', 'reference_feature_set_version_id_version', 'model_version_id_version', 'partition_filter_version', 'insights_analyzed_at_version', 'deleted_at_feature_set', 'description_feature_set', 'model_id_feature_set', 'source_type_feature_set']): Fill missing values with a placeholder to mark them as missing. 7. Fill in missing values(['user_id_feature', 'user_id_version']): For each column, fill missing entries with the median value. 8. Fill in missing values(['description_feature', 'operation_feature']): Fill missing entries using the previously occurring entry in the timeseries. 9. Remove outliers(['id_user']): Remove 633 outlier(s) to reduce the amount of noise in this column. 10. Remove outliers(['user_id_feature']): Remove 341 outlier(s) to reduce the amount of noise in this column. 11. Remove outliers(['transformer_action_id_feature']): Remove 6 outlier(s) to reduce the amount of noise in this column. 12. Remove outliers(['source_feature_id_feature']): Remove 28 outlier(s) to reduce the amount of noise in this column. 13. Remove outliers(['entity_feature_id_version']): Remove 428 outlier(s) to reduce the amount of noise in this column. 14. Remove outliers(['timestamp_feature_id_version']): Remove 313 outlier(s) to reduce the amount of noise in this column. 15. Remove outliers(['user_id_version']): Remove 633 outlier(s) to reduce the amount of noise in this column. 16. Remove outliers(['user_id_feature_set']): Remove 633 outlier(s) to reduce the amount of noise in this column.
mage_ai.clean(df, pipeline_uuid=1)