PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and many more.
The design and simplicity of PyCaret is inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise. Seasoned data scientists are often difficult to find and expensive to hire but citizen data scientists can be an effective way to mitigate this gap and address data-related challenges in the business setting.
Official Website: https://www.pycaret.org Documentation: https://pycaret.readthedocs.io/en/latest/
Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries. PyCaret's default installation is a slim version of pycaret that only installs hard dependencies that are listed in requirements.txt. To install the default version:
pip install pycaret
When you install the full version of pycaret, all the optional dependencies as listed here are also installed.To install version:
pip install pycaret[full]
# check version
from pycaret.utils import version
version()
'2.3.10'
from pycaret.datasets import get_data
data = get_data('insurance')
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
data.shape
(1338, 7)
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123, silent=True)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Target | charges |
2 | Original Data | (1338, 7) |
3 | Missing Values | False |
4 | Numeric Features | 2 |
5 | Categorical Features | 4 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (936, 14) |
10 | Transformed Test Set | (402, 14) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 9323 |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | False |
46 | Polynomial Degree | None |
47 | Trignometry Features | False |
48 | Polynomial Threshold | None |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
# check transformed X_train
get_config('X_train')
age | bmi | sex_female | children_0 | children_1 | children_2 | children_3 | children_4 | children_5 | smoker_yes | region_northeast | region_northwest | region_southeast | region_southwest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 36.0 | 27.549999 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
904 | 60.0 | 35.099998 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
670 | 30.0 | 31.570000 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
617 | 49.0 | 25.600000 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
373 | 26.0 | 32.900002 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1238 | 37.0 | 22.705000 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1147 | 20.0 | 31.920000 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
106 | 19.0 | 28.400000 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1041 | 18.0 | 23.084999 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1122 | 53.0 | 36.860001 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
936 rows × 14 columns
# list columns of transformed X_train
get_config('X_train').columns
Index(['age', 'bmi', 'sex_female', 'children_0', 'children_1', 'children_2', 'children_3', 'children_4', 'children_5', 'smoker_yes', 'region_northeast', 'region_northwest', 'region_southeast', 'region_southwest'], dtype='object')
# train all models using default hyperparameters
best = compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
gbr | Gradient Boosting Regressor | 2702.7680 | 23242056.4409 | 4801.5704 | 0.8348 | 0.4397 | 0.3113 | 0.0310 |
rf | Random Forest Regressor | 2736.7455 | 24862762.2305 | 4970.6959 | 0.8213 | 0.4674 | 0.3294 | 0.1340 |
catboost | CatBoost Regressor | 2865.4462 | 25334554.7726 | 5017.8574 | 0.8193 | 0.4774 | 0.3419 | 0.5570 |
lightgbm | Light Gradient Boosting Machine | 2959.5584 | 25236477.0456 | 5013.0892 | 0.8171 | 0.5427 | 0.3685 | 0.1040 |
ada | AdaBoost Regressor | 4162.2323 | 28328260.0955 | 5316.6146 | 0.7985 | 0.6349 | 0.7263 | 0.0100 |
et | Extra Trees Regressor | 2814.2964 | 28815493.0260 | 5339.0879 | 0.7964 | 0.4889 | 0.3350 | 0.1280 |
xgboost | Extreme Gradient Boosting | 3302.3215 | 31739266.6000 | 5615.5941 | 0.7701 | 0.5661 | 0.4218 | 0.2300 |
llar | Lasso Least Angle Regression | 4315.7895 | 38355976.5080 | 6173.8740 | 0.7311 | 0.6105 | 0.4415 | 0.0060 |
ridge | Ridge Regression | 4336.2309 | 38381496.8000 | 6175.9541 | 0.7309 | 0.6193 | 0.4454 | 0.0060 |
lasso | Lasso Regression | 4323.0688 | 38375137.8000 | 6175.3801 | 0.7308 | 0.6140 | 0.4431 | 0.0100 |
br | Bayesian Ridge | 4333.6881 | 38381669.3629 | 6175.9476 | 0.7308 | 0.6151 | 0.4450 | 0.0060 |
lr | Linear Regression | 4323.6136 | 38380061.2000 | 6175.7164 | 0.7308 | 0.6175 | 0.4432 | 0.8810 |
lar | Least Angle Regression | 4450.2675 | 39682987.5896 | 6267.8924 | 0.7232 | 0.6467 | 0.4693 | 0.0070 |
dt | Decision Tree Regressor | 3148.3402 | 43766011.6491 | 6584.7198 | 0.6855 | 0.5331 | 0.3455 | 0.0070 |
huber | Huber Regressor | 3455.2997 | 48908984.4059 | 6971.2642 | 0.6545 | 0.4790 | 0.2174 | 0.0160 |
omp | Orthogonal Matching Pursuit | 5754.7768 | 57503216.4290 | 7566.7093 | 0.5997 | 0.7418 | 0.8990 | 0.0050 |
par | Passive Aggressive Regressor | 4164.7843 | 61324373.4835 | 7747.8332 | 0.5840 | 0.4724 | 0.2586 | 0.0070 |
en | Elastic Net | 7369.0573 | 90443346.8000 | 9468.6782 | 0.3791 | 0.7377 | 0.9256 | 0.0060 |
knn | K Neighbors Regressor | 7805.8425 | 126951808.0000 | 11221.6535 | 0.1218 | 0.8398 | 0.9147 | 0.0090 |
dummy | Dummy Regressor | 9192.5418 | 148516792.8000 | 12132.4733 | -0.0175 | 1.0154 | 1.5637 | 0.0060 |
print(best)
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=123, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
type(best)
sklearn.ensemble._gb.GradientBoostingRegressor
# train individual model
dt = create_model('dt')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 3001.2294 | 37001480.2590 | 6082.8842 | 0.7790 | 0.4984 | 0.3140 |
1 | 3389.8885 | 49305179.5732 | 7021.7647 | 0.7133 | 0.5574 | 0.3361 |
2 | 2926.0191 | 42025684.6666 | 6482.7220 | 0.4679 | 0.6215 | 0.4025 |
3 | 2744.7144 | 34078761.4507 | 5837.7017 | 0.7154 | 0.5412 | 0.3740 |
4 | 3924.4816 | 59489464.3207 | 7712.9414 | 0.5575 | 0.6455 | 0.4796 |
5 | 3322.5435 | 42747575.4453 | 6538.1630 | 0.7250 | 0.4869 | 0.2928 |
6 | 3158.7047 | 49369669.1652 | 7026.3553 | 0.6641 | 0.4511 | 0.3089 |
7 | 2405.2970 | 31318616.6440 | 5596.3038 | 0.8278 | 0.4497 | 0.1434 |
8 | 3021.5461 | 39091793.3775 | 6252.3430 | 0.7475 | 0.5117 | 0.4381 |
9 | 3588.9772 | 53231891.5889 | 7296.0189 | 0.6571 | 0.5679 | 0.3653 |
Mean | 3148.3402 | 43766011.6491 | 6584.7198 | 0.6855 | 0.5331 | 0.3455 |
Std | 410.7953 | 8481549.4829 | 638.3390 | 0.1005 | 0.0631 | 0.0878 |
print(dt)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')
%%time
# tune hyperparameters of model
tuned_dt = tune_model(dt)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 1710.0867 | 18253568.8962 | 4272.4196 | 0.8910 | 0.3435 | 0.1349 |
1 | 2342.9618 | 33002910.7856 | 5744.8160 | 0.8081 | 0.4462 | 0.1421 |
2 | 1992.6884 | 23279759.5944 | 4824.9103 | 0.7053 | 0.4672 | 0.1580 |
3 | 2250.2711 | 25594847.8750 | 5059.1351 | 0.7863 | 0.4246 | 0.2126 |
4 | 2157.4516 | 24978154.4390 | 4997.8150 | 0.8142 | 0.4363 | 0.1531 |
5 | 1991.3288 | 18794342.2788 | 4335.2442 | 0.8791 | 0.3399 | 0.1565 |
6 | 1688.3935 | 20093049.8225 | 4482.5272 | 0.8633 | 0.3137 | 0.1210 |
7 | 2060.8145 | 26178263.6299 | 5116.4698 | 0.8561 | 0.4613 | 0.1332 |
8 | 2088.2260 | 23545921.7229 | 4852.4140 | 0.8479 | 0.3741 | 0.1592 |
9 | 2233.1985 | 27217915.9631 | 5217.0793 | 0.8247 | 0.4302 | 0.1662 |
Mean | 2051.5421 | 24093873.5007 | 4890.2830 | 0.8276 | 0.4037 | 0.1537 |
Std | 206.3066 | 4191347.7243 | 423.0902 | 0.0514 | 0.0529 | 0.0238 |
Wall time: 1.02 s
print(tuned_dt)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mae', max_depth=6, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')
bagged_tunned_dt = ensemble_model(tuned_dt)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 1776.4799 | 18836339.8037 | 4340.0852 | 0.8875 | 0.3481 | 0.1376 |
1 | 2300.3532 | 30644279.5547 | 5535.7276 | 0.8218 | 0.4243 | 0.1539 |
2 | 1924.2258 | 20323874.4793 | 4508.2008 | 0.7427 | 0.4367 | 0.1694 |
3 | 2163.4615 | 22106257.5369 | 4701.7292 | 0.8154 | 0.3846 | 0.1967 |
4 | 2173.8514 | 25725889.5417 | 5072.0696 | 0.8087 | 0.4448 | 0.1605 |
5 | 2161.0406 | 18397565.9304 | 4289.2384 | 0.8817 | 0.3326 | 0.1567 |
6 | 1726.2661 | 19731367.8977 | 4442.0004 | 0.8657 | 0.3310 | 0.1356 |
7 | 2079.3486 | 24270253.8962 | 4926.4849 | 0.8665 | 0.4205 | 0.1367 |
8 | 1986.6675 | 20824366.8000 | 4563.3723 | 0.8655 | 0.3657 | 0.1718 |
9 | 2103.8869 | 25488147.2315 | 5048.5787 | 0.8358 | 0.4128 | 0.1513 |
Mean | 2039.5582 | 22634834.2672 | 4742.7487 | 0.8391 | 0.3901 | 0.1570 |
Std | 174.7402 | 3663777.6927 | 375.7245 | 0.0417 | 0.0412 | 0.0180 |
print(bagged_tunned_dt)
BaggingRegressor(base_estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mae', max_depth=6, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best'), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=123, verbose=0, warm_start=False)
type(bagged_tunned_dt)
sklearn.ensemble._bagging.BaggingRegressor
blender = blend_models([tuned_dt, bagged_tunned_dt])
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 1684.9289 | 18347418.5622 | 4283.3887 | 0.8904 | 0.3422 | 0.1300 |
1 | 2268.6353 | 31524193.3944 | 5614.6410 | 0.8167 | 0.4271 | 0.1416 |
2 | 1926.8691 | 21534874.6378 | 4640.5684 | 0.7273 | 0.4481 | 0.1605 |
3 | 2159.7119 | 23453407.4775 | 4842.8718 | 0.8042 | 0.3988 | 0.1923 |
4 | 2139.0034 | 25136877.9484 | 5013.6691 | 0.8130 | 0.4384 | 0.1554 |
5 | 2052.6057 | 17787615.2115 | 4217.5366 | 0.8856 | 0.3296 | 0.1540 |
6 | 1677.8415 | 19736734.1414 | 4442.6044 | 0.8657 | 0.3187 | 0.1243 |
7 | 2027.8924 | 25005479.8275 | 5000.5480 | 0.8625 | 0.4340 | 0.1293 |
8 | 2014.0100 | 21739953.6415 | 4662.6123 | 0.8596 | 0.3648 | 0.1631 |
9 | 2135.5030 | 26042026.2510 | 5103.1389 | 0.8323 | 0.4166 | 0.1543 |
Mean | 2008.7001 | 23030858.1093 | 4782.1579 | 0.8357 | 0.3918 | 0.1505 |
Std | 186.2108 | 3922992.8879 | 402.2733 | 0.0461 | 0.0463 | 0.0192 |
print(blender)
VotingRegressor(estimators=[('dt', DecisionTreeRegressor(ccp_alpha=0.0, criterion='mae', max_depth=6, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')), ('Bagging', BaggingRegressor(base_estima... min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best'), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=123, verbose=0, warm_start=False))], n_jobs=-1, verbose=False, weights=None)
type(blender)
sklearn.ensemble._voting.VotingRegressor
stacker = stack_models([tuned_dt, bagged_tunned_dt])
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 2173.8615 | 16817381.9889 | 4100.9001 | 0.8996 | 0.3435 | 0.2386 |
1 | 2786.7597 | 28068236.9149 | 5297.9465 | 0.8368 | 0.4200 | 0.2644 |
2 | 2570.5098 | 19679843.5088 | 4436.1970 | 0.7508 | 0.4307 | 0.3009 |
3 | 2639.4898 | 21157821.8605 | 4599.7632 | 0.8233 | 0.4016 | 0.3216 |
4 | 2503.6254 | 23966267.1643 | 4895.5354 | 0.8218 | 0.4563 | 0.2546 |
5 | 2612.0938 | 17387471.0128 | 4169.8287 | 0.8882 | 0.3376 | 0.2613 |
6 | 2320.3529 | 18792686.8876 | 4335.0533 | 0.8721 | 0.3620 | 0.2697 |
7 | 2548.0337 | 23883895.8817 | 4887.1153 | 0.8687 | 0.4224 | 0.2371 |
8 | 2501.3542 | 18423443.3622 | 4292.2539 | 0.8810 | 0.3827 | 0.3059 |
9 | 2624.7099 | 24671225.4689 | 4967.0137 | 0.8411 | 0.4406 | 0.2781 |
Mean | 2528.0791 | 21284827.4051 | 4598.1607 | 0.8483 | 0.3997 | 0.2732 |
Std | 163.6628 | 3515187.8280 | 376.4910 | 0.0416 | 0.0394 | 0.0270 |
print(stacker)
StackingRegressor(cv=5, estimators=[('dt', DecisionTreeRegressor(ccp_alpha=0.0, criterion='mae', max_depth=6, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')), ('Bagging', BaggingRegressor(base... min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best'), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=123, verbose=0, warm_start=False))], final_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False), n_jobs=-1, passthrough=True, verbose=0)
evaluate_model(best)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…
interpret_model(tuned_dt)
interpret_model(tuned_dt, plot = 'reason', observation = 105)
interpret_model(tuned_dt, plot = 'pdp', feature = 'age')
interpret_model(tuned_dt, plot = 'msa')
# predict on holdout / test set
pred_holdout = predict_model(best);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Gradient Boosting Regressor | 2386.2018 | 17296249.1379 | 4158.8759 | 0.8789 | 0.3985 | 0.2922 |
pred_holdout.head()
age | bmi | sex_female | children_0 | children_1 | children_2 | children_3 | children_4 | children_5 | smoker_yes | region_northeast | region_northwest | region_southeast | region_southwest | charges | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49.0 | 42.680000 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9800.888672 | 10621.483595 |
1 | 32.0 | 37.334999 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 4667.607422 | 7290.151941 |
2 | 27.0 | 31.400000 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 34838.871094 | 36012.959871 |
3 | 35.0 | 24.129999 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 5125.215820 | 7553.788882 |
4 | 60.0 | 25.740000 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 12142.578125 | 14904.032497 |
# predict on new data
data2 = data.copy()
data2.drop('charges', axis=1, inplace=True)
data2.head()
age | sex | bmi | children | smoker | region | |
---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest |
1 | 18 | male | 33.770 | 1 | no | southeast |
2 | 28 | male | 33.000 | 3 | no | southeast |
3 | 33 | male | 22.705 | 0 | no | northwest |
4 | 32 | male | 28.880 | 0 | no | northwest |
# finalize model
best_final = finalize_model(best)
# predict on data2
predictions = predict_model(best_final, data=data2)
predictions.head()
age | sex | bmi | children | smoker | region | Label | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 18894.260073 |
1 | 18 | male | 33.770 | 1 | no | southeast | 3698.287534 |
2 | 28 | male | 33.000 | 3 | no | southeast | 6029.271578 |
3 | 33 | male | 22.705 | 0 | no | northwest | 8958.189116 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3900.039002 |
save_model(best_final, 'insurance-pipeline')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='charges', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strateg... learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=123, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)]], verbose=False), 'insurance-pipeline.pkl')
loaded_pipeline = load_model('insurance-pipeline')
Transformation Pipeline and Model Successfully Loaded
print(loaded_pipeline)
Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='charges', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strateg... learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=123, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)]], verbose=False)
# visualize pipeline
from sklearn import set_config
set_config(display = 'diagram')
loaded_pipeline
Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='charges', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strateg... learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=123, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)]], verbose=False)
DataTypes_Auto_infer(display_types=False, ml_usecase='regression', target='charges')
Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strategy='mean', target_variable=None)
New_Catagorical_Levels_in_TestData(replacement_strategy='least frequent', target='charges')
passthrough
passthrough
passthrough
passthrough
New_Catagorical_Levels_in_TestData(replacement_strategy='least frequent', target='charges')
Make_Time_Features(list_of_features=None, time_feature=Index([], dtype='object'))
passthrough
passthrough
passthrough
passthrough
passthrough
passthrough
passthrough
Dummify(target='charges')
Remove_100(target='charges')
Clean_Colum_Names()
passthrough
passthrough
passthrough
passthrough
GradientBoostingRegressor(random_state=123)