Now it’s Here: Prediction of user churn with ML

Let’s get our hands dirty in this post with a full exercise for the prediction of user churn in Telecommunications using Jupyter Lab and a public dataset.

The requirements for this exercise are:

Basic Python Knowledge
Jupyter Notebook
Telecom churn public dataset
Libraries for ML: Scikit-learn, Pandas, Numpy etc.

Understanding the Dataset

Dataset basic information:

7043 rows
21 columns

Attributes or Columns:

customerID : Customer ID
gender : Whether the customer is a male or a female
SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
Partner : Whether the customer has a partner or not (Yes, No)
Dependents : Whether the customer has dependents or not (Yes, No)
tenure : Number of months the customer has stayed with the company
PhoneService : Whether the customer has a phone service or not (Yes, No)
MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity : Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup : Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport : Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract : The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling : Whether the customer has paperless billing or not (Yes, No)
PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges : The amount charged to the customer monthly
TotalCharges : The total amount charged to the customer
Churn : Whether the customer churned or not (Yes or No)

Column customerID will be excluded since it does not provide any value to the model
Column Churn is the prediction variable

Jupyter Notebook Setup

Run the following script as initial setup:

%pip install pandas, seaborn, scikit-learn, xgboost%

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder

pd.options.display.max_columns = None
pd.options.display.max_rows = None

Clean and Formatting the dataset

Then, we can import the data and start the preparation of the dataset for further analysis.

df = pd.read_csv('Telco_Churn.csv')

# Change numeric columns from string to decimal
df['TotalCharges'] = df['TotalCharges'].replace(" ", 0).astype('float32')
df["MonthlyCharges"] = df["MonthlyCharges"].astype('float32')

df.drop(['customerID'], axis=1, inplace=True)
df = df.replace(to_replace='No phone service', value='No')
df = df.replace(to_replace='No internet service', value='No')

# For the binary columns, lets change it to 0 and 1
df['gender'] = df['gender'].map({'Female': 1, 'Male': 0})
df['Partner'] = df['Partner'].map({'Yes': 1, 'No': 0})
df['Dependents'] = df['Dependents'].map({'Yes': 1, 'No': 0})
df['PhoneService'] = df['PhoneService'].map({'Yes': 1, 'No': 0})
df['MultipleLines'] = df['MultipleLines'].map({'Yes': 1, 'No': 0})
df['OnlineSecurity'] = df['OnlineSecurity'].map({'Yes': 1, 'No': 0})
df['OnlineBackup'] = df['OnlineBackup'].map({'Yes': 1, 'No': 0})
df['DeviceProtection'] = df['DeviceProtection'].map({'Yes': 1, 'No': 0})
df['TechSupport'] = df['TechSupport'].map({'Yes': 1, 'No': 0})
df['StreamingTV'] = df['StreamingTV'].map({'Yes': 1, 'No': 0})
df['StreamingMovies'] = df['StreamingMovies'].map({'Yes': 1, 'No': 0})
df['PaperlessBilling'] = df['PaperlessBilling'].map({'Yes': 1, 'No': 0})
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

Prediction of user churn – Exploratory Data Analysis

Let’s check some data charts to understand better the dataset.

ax = sns.catplot(y="Churn", kind="count", data=df, height=2.6, aspect=2.5, orient='h', hue="Churn")

Around 30% of the data is users that Churned.

plt.figure(figsize=(12, 6))
corr = df.apply(lambda x: pd.factorize(x)[0]).corr()
ax = sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, 
                 linewidths=.2, cmap="YlGnBu")

There is some positive correlation between additional services like DeviceProtection, TechSupport, StreamingTV, StreamingMovies.
Some positive correlation between churn with Senior Citizen and Partner

def kdeplot(feature):
    plt.figure(figsize=(9, 4))
    plt.title("KDE for {}".format(feature))
    sns.kdeplot(df, x=feature, hue='Churn')

kdeplot('tenure')
kdeplot('MonthlyCharges')
kdeplot('TotalCharges')

ax = sns.catplot(x="Contract", y="MonthlyCharges", hue="Churn", kind="box", data=df, height=4.2, aspect=1.4)

Low tenure users are more likely to churn
Clients with higher MonthlyCharges are more likely to churn

Prediction of user churn – Categories Columns Handling

To handle the category data, we will use One-Hot encoding.

encoder = OneHotEncoder(sparse_output=False)
categorical_columns = ['InternetService', 'Contract', 'PaymentMethod']
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))
df_encoded = pd.concat([df, one_hot_df], axis=1)
df_encoded = df_encoded.drop(categorical_columns, axis=1)

Prediction of user churn – Splitting the data

In this step, we will split the data on 25% for test dataset and 75% for training dataset. The model will be trained using the training dataset and to measure the accuracy we will use the test dataset.

X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.25, random_state = 42)

Prediction of user churn – Model Training

Now, we can train multiple models and compare the accuracy.

from sklearn import metrics

– Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=8, min_samples_leaf=3, min_samples_split=3, n_estimators=5000, random_state=13)
clf = clf.fit(X_train, y_train)

prediction_test = clf.predict(X_test)
print(metrics.accuracy_score(y_test, prediction_test))

We got 80.6% of accuracy for Random Forest Classifier

– Logistic Regression

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
result = LR.fit(X_train, y_train)

prediction_test = LR.predict(X_test)
print(metrics.accuracy_score(y_test, prediction_test))

We got 80.7% accuracy for Logistic Regression

– Support Vector Machine (SVM)

from sklearn.svm import SVC
svm = SVC(kernel='linear') 
svm.fit(X_train,y_train)

preds = svm.predict(X_test)
metrics.accuracy_score(y_test, preds)

We got 78.9% accuracy for SVM

– Ada Boost Classifier

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train,y_train)

preds = ada.predict(X_test)
metrics.accuracy_score(y_test, preds)

We got 79.9% accuracy for Ada Boost

– XG Boost Classifier

from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

preds = xgb.predict(X_test)
metrics.accuracy_score(y_test, preds)

We got 77.9% accuracy for XG Boost

– KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 11) 
knn.fit(X_train,y_train)
predicted_y = knn.predict(X_test)

accuracy_knn = knn.score(X_test,y_test)
print(accuracy_knn)

We got 78.4% accuracy for KNN

– Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

gb_pred = gb.predict(X_test)
print(metrics.accuracy_score(y_test, gb_pred))

We got 80.1% accuracy for Gradient Boosting Classifier

Prediction of user churn – Conclusions

df_conc = pd.DataFrame([80.6, 80.7, 78.9, 79.9, 77.9, 78.4, 80.1], index=["Random Forest Classifier", "Logistic Regression", "SVM", "Ada Boost", "XG Boost", "KNN", "Gradient Boosting"], columns=["Model Accuracy"])

df_conc

Logistic Regression model provides higher accuray among all the models tested. However it is recommended to continue applying Machine Learning Optimization Techniques in order to improve the accuracy of the model.

Interesting Links