# P31 导入文件
import pandas as pd
df = pd.read_csv('/Users/emmett/Downloads/tel_data.csv')
print(df.head(10))
customerID gender SeniorCitizen Partner Dependents tenure PhoneService \ 0 7590-VHVEG Female 0 Yes No 1 No 1 5575-GNVDE Male 0 No No 34 Yes 2 3668-QPYBK Male 0 No No 2 Yes 3 7795-CFOCW Male 0 No No 45 No 4 9237-HQITU Female 0 No No 2 Yes 5 9305-CDSKC Female 0 No No 8 Yes 6 1452-KIOVK Male 0 No Yes 22 Yes 7 6713-OKOMC Female 0 No No 10 No 8 7892-POOKP Female 0 Yes No 28 Yes 9 6388-TABGU Male 0 No Yes 62 Yes MultipleLines InternetService OnlineSecurity ... DeviceProtection \ 0 No phone service DSL No ... No 1 No DSL Yes ... Yes 2 No DSL Yes ... No 3 No phone service DSL Yes ... Yes 4 No Fiber optic No ... No 5 Yes Fiber optic No ... Yes 6 Yes Fiber optic No ... No 7 No phone service DSL Yes ... No 8 Yes Fiber optic No ... Yes 9 No DSL Yes ... No TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \ 0 No No No Month-to-month Yes 1 No No No One year No 2 No No No Month-to-month Yes 3 Yes No No One year No 4 No No No Month-to-month Yes 5 No Yes Yes Month-to-month Yes 6 No Yes No Month-to-month Yes 7 No No No Month-to-month No 8 Yes Yes Yes Month-to-month Yes 9 No No No One year No PaymentMethod MonthlyCharges TotalCharges Churn 0 Electronic check 29.85 29.85 No 1 Mailed check 56.95 1889.5 No 2 Mailed check 53.85 108.15 Yes 3 Bank transfer (automatic) 42.30 1840.75 No 4 Electronic check 70.70 151.65 Yes 5 Electronic check 99.65 820.5 Yes 6 Credit card (automatic) 89.10 1949.4 No 7 Mailed check 29.75 301.9 No 8 Electronic check 104.80 3046.05 Yes 9 Bank transfer (automatic) 56.15 3487.95 No [10 rows x 21 columns]
从上面的结果可以看出,这个文件一共有21列内容,有的是数据,有的是字符。
# P32 检查文件中是否有缺失值
df.isnull()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7038 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7039 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7040 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7041 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7042 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7043 rows × 21 columns
可以看到,所有没有缺失的单元格被标上了False。如果该单元格缺失,会被填上True。 但是这里,我们没有办法看到所有单元格的情况,我们按列进行统计看看。
# 按行求和,所有的False会被转化为0,所有True会被转化为1,如果一整列都是0,说明该列没有缺失。
df.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
# P33 对数据类型进行修正
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
# 在以上的结果中我们发现TotalCharges这一列的数据类型是object,然而这一列应该是数字。肯定是其中混入了一些不是数字的元素,我们对这一列进行统计频次,看看这一列都是啥东西。
df["TotalCharges"].value_counts()
11 20.2 11 19.75 9 20.05 8 19.9 8 .. 6849.4 1 692.35 1 130.15 1 3211.9 1 6844.5 1 Name: TotalCharges, Length: 6531, dtype: int64
# 发现:有个含空格的单元格出现了11次。现在把这些空单元格替换为这组数据的中位数。
# 中位数
median = df["TotalCharges"][df["TotalCharges"] != " "].median()
# 将median填到空格里
df.loc[df["TotalCharges"] == " ", "TotalCharges"] = median
# 再把整列的数据类型修改为float
df["TotalCharges"] = df["TotalCharges"].astype(float)
# 检查
df["TotalCharges"].value_counts()
1397.475 11 20.200 11 19.750 9 20.050 8 19.900 8 .. 6849.400 1 692.350 1 130.150 1 3211.900 1 6844.500 1 Name: TotalCharges, Length: 6531, dtype: int64
# P34 将数字类型外的其他列都改成分类变量(category)
# 先看下都有哪些列
df.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'], dtype='object')
# 只有tenure, MonthlyCharges, TotalCharges是数字类型(float),其他的都改成分类变量(categorical)
# 先把这三列的列名挑出来
number_columns = ["tenure", "MonthlyCharges", "TotalCharges"]
for column in number_columns:
df[column] = df[column].astype(float)
# 修改其他列 set的减法:所有列名的set减去number_columns的set
for column in set(df.columns) - set(number_columns):
df[column] = pd.Categorical(df[column])
# 再查看每一列的数据类型
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null category 1 gender 7043 non-null category 2 SeniorCitizen 7043 non-null category 3 Partner 7043 non-null category 4 Dependents 7043 non-null category 5 tenure 7043 non-null float64 6 PhoneService 7043 non-null category 7 MultipleLines 7043 non-null category 8 InternetService 7043 non-null category 9 OnlineSecurity 7043 non-null category 10 OnlineBackup 7043 non-null category 11 DeviceProtection 7043 non-null category 12 TechSupport 7043 non-null category 13 StreamingTV 7043 non-null category 14 StreamingMovies 7043 non-null category 15 Contract 7043 non-null category 16 PaperlessBilling 7043 non-null category 17 PaymentMethod 7043 non-null category 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null float64 20 Churn 7043 non-null category dtypes: category(18), float64(3) memory usage: 611.1 KB
# P35 Describe描述所有分类变量的列
df.describe(include=["category"])
customerID | gender | SeniorCitizen | Partner | Dependents | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 |
unique | 7043 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 4 | 2 |
top | 0002-ORFBO | Male | 0 | No | No | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | No |
freq | 1 | 3555 | 5901 | 3641 | 4933 | 6361 | 3390 | 3096 | 3498 | 3088 | 3095 | 3473 | 2810 | 2785 | 3875 | 4171 | 2365 | 5174 |
这里和数据类型的describe结果是不一样的
count表示了所有的个数,unique表示种类数,也就是去重之后的,比如gender里面,只有男女,所以unique是2。Top和freq是最多的分类和对应的频数。
# P36 看某一列的数据分布
df["Churn"].value_counts()
No 5174 Yes 1869 Name: Churn, dtype: int64
# P37 多维度查看月费字段统计
# Churn:用户是否流失
# PaymentMethod:支付方式
# MonthlyCharges:月费
# 查看按照Churn、PaymentMethod两个指标分组后的月费均值
df.groupby(["Churn", "PaymentMethod"])["MonthlyCharges"].mean()
Churn PaymentMethod No Bank transfer (automatic) 65.049417 Credit card (automatic) 64.562209 Electronic check 74.232032 Mailed check 41.403911 Yes Bank transfer (automatic) 77.875581 Credit card (automatic) 77.356034 Electronic check 78.700980 Mailed check 54.557143 Name: MonthlyCharges, dtype: float64
从这里可以很明显的的发现,无论是哪种支付方式,用户流失了(Churn=Yes)的月费均值都大于未流失的。这就是隐藏在数据中的规律。
# P38 Churn字段的数据映射
# 需要把Yes/No转化为1-0数据,这对于机器学习数据的处理是一个很有用的操作
df["Churn"] = df["Churn"].map({"Yes":1,"No":0})
df["Churn"].value_counts()
0 5174 1 1869 Name: Churn, dtype: int64
这和之前对于Churn的描述是一致的,只不过No被替换为了0,Yes被替换为了1。
# P39 查看字段的相关性矩阵
# 统计DataFrame的相关性矩阵
df.corr()
/var/folders/3g/cxwzy4z940n8qvfv8vrkc97m0000gn/T/ipykernel_84349/1043428941.py:4: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr()
tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|
tenure | 1.000000 | 0.247900 | 0.825464 |
MonthlyCharges | 0.247900 | 1.000000 | 0.650864 |
TotalCharges | 0.825464 | 0.650864 | 1.000000 |
可以发现,相关性矩阵只会通过数值型的变量来计算。
# P40 数据的采样
# 随机挑选一定的行数
df.sample(10)
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4824 | 3339-EAQNV | Male | 1 | Yes | No | 72.0 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | Yes | One year | Yes | Credit card (automatic) | 97.25 | 7133.10 | 0 |
1539 | 5071-FBJFS | Female | 0 | Yes | Yes | 4.0 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | No | Electronic check | 50.30 | 217.10 | 0 |
6006 | 1360-RCYRT | Male | 0 | Yes | Yes | 23.0 | No | No phone service | DSL | No | ... | No | Yes | No | No | Month-to-month | Yes | Bank transfer (automatic) | 30.35 | 678.75 | 0 |
4841 | 9058-MJLZC | Female | 0 | No | No | 24.0 | Yes | No | Fiber optic | Yes | ... | No | No | Yes | Yes | Month-to-month | Yes | Electronic check | 94.60 | 2283.15 | 0 |
6002 | 3727-RJMEO | Male | 0 | Yes | No | 6.0 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | No | Month-to-month | Yes | Electronic check | 82.85 | 460.25 | 1 |
5155 | 8294-UIMBA | Female | 0 | No | No | 30.0 | Yes | No | Fiber optic | No | ... | No | Yes | Yes | Yes | One year | Yes | Bank transfer (automatic) | 94.40 | 2638.10 | 0 |
3793 | 6278-FEPBZ | Female | 0 | No | No | 9.0 | Yes | No | No | No internet service | ... | No internet service | No internet service | No internet service | No internet service | Month-to-month | No | Mailed check | 20.25 | 186.15 | 0 |
2273 | 0219-QAERP | Male | 0 | Yes | No | 30.0 | Yes | No | No | No internet service | ... | No internet service | No internet service | No internet service | No internet service | One year | No | Credit card (automatic) | 19.80 | 576.65 | 0 |
5258 | 8634-CILSZ | Male | 0 | No | No | 69.0 | Yes | Yes | Fiber optic | No | ... | No | Yes | Yes | Yes | One year | Yes | Bank transfer (automatic) | 104.70 | 7220.35 | 1 |
2766 | 1229-RCALF | Female | 0 | Yes | No | 64.0 | Yes | Yes | Fiber optic | Yes | ... | Yes | Yes | Yes | Yes | One year | Yes | Electronic check | 115.00 | 7396.15 | 0 |
10 rows × 21 columns