top of page
Project | 02

Churn Analysis

churn.jpeg
Business Question
 How to best predict who(customer) is going to churn so that the bank can proactively go to him and provide them better services beforehand
Dataset
 Data has been retreived from Kaggle(https://www.kaggle.com/sakshigoyal7/credit-card-customers) and it is orginated from the URL(https://leaps.analyttica.com/home). This dataset consists of 10271 customer data including their Attrition Status(Attrited/Existing) which I have used for response variable, and eighteen features describing their salary, age, transaction amount, and etc.
dataset.PNG
Data Pre-processing
 After implementing the checking of missing data, what I have found out was there is no missing values in this dataset. Among 18 features, 5 were categorical and 14 were numerical variables. Plotting the numerical variables, some of those features such as Customer’s age and Months_on_book were nearly normally distributed, and some of them were skewed, but I have decided not to make any transformation or manipulation since they were good enough. Constructing heatmap for the numerical variables, what was observed is that there was not a significant multicollinearity between the variables, because the plot was filled mostly violet except diagonal. For the feature engineering, we first needed to encode categorical variables to better represent the data and so the machine can understand and interpret. Slightly different methods of categorical encoding, label encoding and one-hot-encoding, were used for the response variable and other categorical variables.
coprr.PNG
plot.PNG
Assumptions/Simplifications
 The primary assumptions made in our analysis is that we reduce the dimension by removing three columns regarding client number, Naive bayes ratios(two of them). We can assume that those are unnecessary for the prediction and analysis. Also, while the pre-processing we found out that some of our numerical features are skewed, but we can assume that random forest can handle those as well as categorical features that are ordinal or nominal. Here, we also assumed that we are given $200K of budget spending on purpose of boosting up our Return On Investment, as a result of predicting churning customers.
Modeling Approach
 Splitting the encoded data into two parts, train and test datasets, train dataset was fit to the baseline model which is logistic regression. Logistic regression was chosen just because of the binary repsonse variable, as it is useful in explaining the relationship between binary response variable and nominal/ordinal independent variables. Fitting trainset to logistic regression using 'LogisticRegression()' using sklearn package and scoring testset from that model, AUC ROC, recall and specificity are showed as a result.
Figure1.PNG
 Figure1 shows a confusion matrix which indicates true negative, true positive, false negative, and false positive values that are 165, 1637, 165, 59, respectively. Calculated with those values, area under ROC curve is 0.73, recall is 0.96, and specificity is 0.5. Having specificity as low as 0.5, even though having recall as high as 0.96, logistic regression is not a perfect model for the prediction and not suitable especially in this business case. The reason why it is best to u se specificity(TN/TN+FP) is best explained by the fact that predicting churning customers as non-churning critically matters in the business. Meanwhile, recall(TP/TP+FN) is relatively less important in this project in that predicting non-churning customers as churning does not critically affects the business.
Figure2-1.PNG
 As can be seen in Figure 2-1, coefficient values for each feature generated from logistic regression provide a basis for a feature importance score. The positive socres indicate a feature which predicts Exisiting customers, whereas the negative scores indicate a feature that predicts churning customers.
Figure2-2.PNG
 Figure 2-2 shows the sorted coefficients for each features. Change in transaction count (Total_Ct_Chng_Q4_Q1), total number of products held by the customer(Total_Relationship_Count), 'Gender_Male' are the primary features which has which has high positive coefficients, explaining the existing customers. On the other hand, the number of months inactive in the last 12 months(Months_Inactive_12_mon), the number of contacts in the last 12 months(Contacts_Count_12_mon), and 'Gender_Female’ are also primary features which has negative cofficients explaining churning customers. Furthermore, a probability of churning can be calculated from the logistic transformation of added effects. For example, for a certain customer who is a woman, who has three change in transaction count, two contracts, and so on, an added effects is -0.49(Gender_F)+30.74(change in_trans_count)-20.57(Contacts_Count)+… and so on. If the added effect is 2, log transformation gives a probability = 1/(1+exp(-2))=0.88 which means that customer having those features has 88% chance of "Existing". Nonetheless, again, logistic regression performs risky from its low true negative rate(specificity).
 Random Forest has been performed next, as having many incidental features after the categorical encoding and as it can deal with nonlinearities better than logistic regression. Also, it has been chosen for the ranking of the feature importance in that it is easy to measure the relative importance of each feature on the prediction. Using built-in fucntion from sklearn package, Random Forest showed 0.96 accuracy, 0.93 cross-validation score, AUC ROC(Area under ROC curve) 0.89, 0.97 F-1 score, 0.98 recall score, and 0.79 specificity. The new model has been improved compared to the baseline model not only having higher AUR ROC, F-1 score, and recall, but also having greater increase in true negative rate.
Figure3.PNG
Figure3 shows a confusion matrix which indicates true negative, true positive, false negative, and false positive values that are 258, 1675, 72, 21, respectively. Calculated with those values, specificity, as increased to 0.79, suggests that correctly predicting churning customers among the ones who is, in fact, churning is as high as 0.79.
Figure4.PNG
Figure 4 shows the feature importance of Random Forest model, presenting its top fifteen features. Total transaction amount for last 12 months(Total_Trans_Amt), total transaction count for last 12 months(Total_Trans_Ct), and total revolving balance on the credit card(Total_Revolving_Bal) are the primary features for the model prediction. Change in transaction count(Total_Ct_Chng_Q4_Q1) and total number of products held by the customer(Total_Relationship_Count) are ranked 3rd and 5th on the figure, although the 5th one has fairly low importance.
Alternative Model
 XGBoost has been chosen for our alternative model because this model supports both regression and classification predictive modeling with the highest performance in computational speed.  Figure 5 demonstrates a confusion matrix resulting in fitting XGBoost, and the specificity in this case is 0.88 which is now considered high enough to be useful in our business case. As can been seen in Figure 6, in addition to the specificity, our alternative model performed excellently as 0.97 of accuracy, Area under ROC curve 0.93, and recall as high as 0.98.
xgb1.PNG
xgb2.PNG
 Figure 7 is feature importance plot of XGBoost, and in this case, the primary features of this model are Total transaction count, total revolving balance in customer’s credit card, and total relationship count which means the total number of products held by the customer.
xgb3.PNG
Conclusion
(Advice on how business would use the model -- and model output to answer the business question.)
 Our pre-assumed budget of $200k can be seperated into two parts, customer acquisition cost and customer retention cost. False positive value which predicts disengaging customers as existing is highly related to customer acquisition cost(CAC). For the customers our model wrongly predicts that they will exist, we would have to work on customer acquisition because those are the ones we need to make up for. Meanwhile, false negative and true negative values which are associated with engaging customers are highly related to customer retention cost(CRC). For the customers our model predicts that they will churn, we need to work on customer retention, to keep them from leaving Therefore, the calculated ratio of customer retention cost to acquisition cost comes out to be about 1 to15, which makes $187.5K for customer retention cost, and $12.5K for customer acquisition cost. It can be easily calculated from the confusion matrix. Our model suggests the fact on how we will spend our customer retention; the less the total number of transactions and their amount, they are more likely to churn, and the more the change in transaction count and total revolving balance, they are also, more likely to churn. Therefore they are our target customers. For them, my suggestion is to reward customers by giving them discounts, exclusive or special offers, and generate marketing campaigns, messaging and offers at least once every thirty days For the customer acquisition cost, even though we have a lower budget than we have for retention, a schematic plan of viral marketing is highly recommended. Eventually, these budget plan will maximize Return On Investment.
Architecture Diagram
Architecture_diagram.PNG
Instructions for running code
1. Click the link : jpark143-jp.github.io/Churn_Analysis at master · jpark143-jp/jpark143-jp.github.io
2. Clone my github directory folder, "Churn Analysis", to your working directory.
3. On your command line, change directory to "Churn Analysis" and try typing following sample  input; python main.py 54 1 40 2 3 1 1438.3 808 630.3 0.997 705 19 0.9 0.562 F Graduate Married 'Less than $40K' Blue
4. You can change the value for which you want to predict.
Figure2 shows side-by-side barplots for every categorical variable against Diabetic. Considering we have about 70% of factor "no" and 30% of factor "yes" for Diabetic in total, we can grasp a general idea of data from these plots.

(1) main.py

""" Module serve as Feature engineering + Random Forest(Final Model)

prepared for the streaming input;

- Sample usage : python main.py 54 1 40 2 3 1 1438.3 808 630.3 0.997 705 19 0.9 0.562 F Graduate Married 'Less than $40K' Blue

- important notes for input spec is on README.md

Author : JungHwan Park

Date : 03.05.2021

Email : jpark143@g.ucla.edu

Content : Final model prediction

"""

import urllib.request

import sys

import numpy as np

import joblib

from final_modules.encode_pred import onehotencoding_binding

if __name__ == "__main__":

CUSTOMER_AGE = float(sys.argv[1])

DEPENDENT_COUNT = float(sys.argv[2])

MONTHS_ON_BOOK = float(sys.argv[3])

TOTAL_RELATIONSHIP_COUNT = float(sys.argv[4])

MONTHS_INACTIVE_12_MON = float(sys.argv[5])

CONTACTS_COUNT_12_MON = float(sys.argv[6])

CREDIT_LIMIT = float(sys.argv[7])

TOTAL_REVOLVING_BAL = float(sys.argv[8])

AVG_OPEN_TO_BUY = float(sys.argv[9])

TOTAL_AMT_CHNG_Q4_Q1 = float(sys.argv[10])

TOTAL_TRANS_AMT = float(sys.argv[11])

TOTAL_TRANS_CT = float(sys.argv[12])

TOTAL_CT_CHNG_Q4_Q1 = float(sys.argv[13])

AVG_UTILIZATION_RATIO = float(sys.argv[14])

GENDER = sys.argv[15]

EDUCATION_LEVEL = sys.argv[16]

MARITAL_STATUS = sys.argv[17]

INCOME_CATEGORY = sys.argv[18]

CARD_CATEGORY = sys.argv[19]

#

# --- Print them out, for validation:

print(f"CUSTOMER_AGE: {CUSTOMER_AGE}")

print(f"DEPENDENT_COUNT: {DEPENDENT_COUNT}")

print(f"MONTHS_ON_BOOK: {MONTHS_ON_BOOK}")

print(f"TOTAL_RELATIONSHIP_COUNT: {TOTAL_RELATIONSHIP_COUNT}")

print(f"MONTHS_INACTIVE_12_MON: {MONTHS_INACTIVE_12_MON}")

print(f"CONTACTS_COUNT_12_MON: {CONTACTS_COUNT_12_MON}")

print(f"CREDIT_LIMIT: {CREDIT_LIMIT}")

print(f"TOTAL_REVOLVING_BAL: {TOTAL_REVOLVING_BAL}")

print(f"AVG_OPEN_TO_BUY: {AVG_OPEN_TO_BUY}")

print(f"TOTAL_AMT_CHNG_Q4_Q1: {TOTAL_AMT_CHNG_Q4_Q1}")

print(f"TOTAL_TRANS_AMT: {TOTAL_TRANS_AMT}")

print(f"TOTAL_TRANS_CT: {TOTAL_TRANS_CT}")

print(f"TOTAL_CT_CHNG_Q4_Q1: {TOTAL_CT_CHNG_Q4_Q1}")

print(f"AVG_UTILIZATION_RATIO: {AVG_UTILIZATION_RATIO}")

print(f"GENDER: {GENDER}")

print(f"EDUCATION_LEVEL: {EDUCATION_LEVEL}")

print(f"MARITAL_STATUS: {MARITAL_STATUS}")

print(f"INCOME_CATEGORY: {INCOME_CATEGORY}")

print(f"CARD_CATEGORY: {CARD_CATEGORY}")

#

# --- Create final variables for validation:

ONEHOTENCODING_LIST = onehotencoding_binding(GENDER, EDUCATION_LEVEL, MARITAL_STATUS,

INCOME_CATEGORY, CARD_CATEGORY)

ALLBINDED_LIST = {'Customer_Age':CUSTOMER_AGE,

'Dependent_count':DEPENDENT_COUNT,

'Months_on_book':MONTHS_ON_BOOK,

'Total_Relationship_Count':TOTAL_RELATIONSHIP_COUNT,

'Months_Inactive_12_mon':MONTHS_INACTIVE_12_MON,

'Contacts_Count_12_mon':CONTACTS_COUNT_12_MON,

'Credit_Limit':CREDIT_LIMIT,

'Total_Revolving_Bal':TOTAL_REVOLVING_BAL,

'Avg_Open_To_Buy':AVG_OPEN_TO_BUY,

'Total_Amt_Chng_Q4_Q1':TOTAL_AMT_CHNG_Q4_Q1,

'Total_Trans_Amt':TOTAL_TRANS_AMT,

'Total_Trans_Ct':TOTAL_TRANS_CT,

'Total_Ct_Chng_Q4_Q1':TOTAL_CT_CHNG_Q4_Q1,

'Avg_Utilization_Ratio':AVG_UTILIZATION_RATIO,

**ONEHOTENCODING_LIST}

FINAL_LIST = list(ALLBINDED_LIST.values())

MODEL_INPUT = np.reshape(FINAL_LIST, (1, -1))

URL = "https://stats404-junghwanpark.s3.ap-northeast-2.amazonaws.com/rf_Bankchurners.joblib"

FILENAME = "rf_Bankchurners.joblib"

# Save the file

with open(FILENAME, 'wb') as file:

joblib.dump(URL, file)

urllib.request.urlretrieve(URL, FILENAME)

# Load from file

with open(FILENAME, 'rb') as file:

MODEL = joblib.load(file)

FINAL_PRED = MODEL.predict(MODEL_INPUT)

print(f"Final_predicted_outcome : {FINAL_PRED}")

(2) encode_pred.py

"""Module containing helper function(s) for feature engineering

"""

def encoding_Gender(Gender):

"""Function to encode Gender"""

Gender_list={'Gender_F':0, 'Gender_M':0}

if Gender == 'F':

Gender_list['Gender_F']=1

elif Gender == 'M':

Gender_list['Gender_M']=1

else : raise ValueError("Error: Wrong input")

#Gender_list1= list(Gender_list.values())

return(Gender_list)

def encoding_Education_Level(Education_Level):

"""Function to encode Education_Level"""

Education_list= {'Education_Level_College' : 0, 'Education_Level_Doctorate':0, 'Education_Level_Graduate':0, 'Education_Level_High School':0,

'Education_Level_Post-Graduate':0, 'Education_Level_Uneducated':0, 'Education_Level_Unknown':0}

if Education_Level == 'College':

Education_list['Education_Level_College'] = 1

elif Education_Level == 'Doctorate':

Education_list['Education_Level_Doctorate'] = 1

elif Education_Level == 'Graduate':

Education_list['Education_Level_Graduate'] = 1

elif Education_Level == 'High School':

Education_list['Education_Level_High School'] = 1

elif Education_Level == 'Post-Graduate':

Education_list['Education_Level_Post-Graduate'] = 1

elif Education_Level == 'Uneducated':

Education_list['Education_Level_Uneducated'] = 1

elif Education_Level == 'Unknown':

Education_list['Education_Level_Unknown'] = 1

else : raise ValueError("Error: Wrong input")

return(Education_list)

def encoding_Marital_Status(Marital_Status):

"""Function to encode Marital_Status"""

Marital_Status_list={'Marital_Status_Divorced':0, 'Marital_Status_Married':0, 'Marital_Status_Single':0, 'Marital_Status_Unknown':0}

if Marital_Status == 'Divorced':

Marital_Status_list['Marital_Status_Divorced']=1

elif Marital_Status == 'Married':

Marital_Status_list['Marital_Status_Married']=1

elif Marital_Status == 'Single':

Marital_Status_list['Marital_Status_Single']=1

elif Marital_Status == 'Unknown':

Marital_Status_list['Marital_Status_Unknown']=1

else : raise ValueError("Error: Wrong input")

return(Marital_Status_list)

def encoding_Income_Category(Income_Category):

"""Function to encode Income_Category"""

Income_Category_list={'Income_Category_$120K +':0, 'Income_Category_$40K - $60K':0, 'Income_Category_$60K - $80K':0, 'Income_Category_$80K - $120K':0, 'Income_Category_Less than $40K':0, 'Income_Category_Unknown':0}

if Income_Category == '$120K +':

Income_Category_list['Income_Category_$120K +']=1

elif Income_Category == '$40K - $60K':

Income_Category_list['Income_Category_$40K - $60K']=1

elif Income_Category == '$60K - $80K':

Income_Category_list['Income_Category_$60K - $80K']=1

elif Income_Category == '$80K - $120K':

Income_Category_list['Income_Category_$80K - $120K']=1

elif Income_Category == 'Less than $40K':

Income_Category_list['Income_Category_Less than $40K']=1

elif Income_Category == 'Unknown':

Income_Category_list['Income_Category_Unknown']=1

else : raise ValueError("Error: Wrong input")

return(Income_Category_list)

def encoding_Card_Category(Card_Category):

"""Function to encode Card_Category"""

Card_Category_list={'Card_Category_Blue':0, 'Card_Category_Gold':0, 'Card_Category_Platinum':0, 'Card_Category_Silver':0}

if Card_Category == 'Blue':

Card_Category_list['Card_Category_Blue']=1

elif Card_Category == 'Gold':

Card_Category_list['Card_Category_Gold']=1

elif Card_Category == 'Platinum':

Card_Category_list['Card_Category_Platinum']=1

elif Card_Category == 'Silver':

Card_Category_list['Card_Category_Silver']=1

else : raise ValueError("Error: Wrong input")

return(Card_Category_list)

def onehotencoding_binding(Gender,Education_Level,Marital_Status,Income_Category,

Card_Category):

"""Function to bind them altogether"""

Gender_list=encoding_Gender(Gender)

Education_list=encoding_Education_Level(Education_Level)

Marital_Status_list=encoding_Marital_Status(Marital_Status)

Income_Category_list=encoding_Income_Category(Income_Category)

Card_Category_list=encoding_Card_Category(Card_Category)

one_hot_encoding_binding={**Gender_list,**Education_list,**Marital_Status_list,

**Income_Category_list,**Card_Category_list}

return(one_hot_encoding_binding)

(3) Code base before productionalized

# Reading-in Bankchurners data
import pandas as pd
import numpy as np
%matplotlib inline
from collections import Counter
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import sklearn.metrics as metrics
import matplotlib.pyplot as pyplot
from sklearn.model_selection import cross_val_score
import seaborn as sns

df = pd.read_csv('~/Downloads/BankChurners.csv')

df = df[df.columns[1:-2]]  # Removing unnecessary columnns

x = df.loc[:, df.columns != 'Attrition_Flag']
y = df['Attrition_Flag']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)
train = pd.concat([x_train, y_train], axis=1) # Concatenating train for further analysis.

for col in df.columns:
    na = df[col].isna().sum()
    print(f'{col} : Missing -> {na}')  

y_train.hist(bins=3)

 

## Create the function called 'NumericalPlot() which creates the plots for the numerical columns'

def NumericalPlot(col) :
    f, ax = plt.subplots(2, 3,figsize=(30, 15))
    
    ind = 0
    for i in range(2):
        for j in range(3):
            sns.distplot(df.loc[:, col[ind]],
                         hist=True,
                         ax=ax[i][j])
            ax[i][j].set_title(col[ind])
            ind += 1

Numcol=['Customer_Age','Credit_Limit','Months_on_book','Avg_Utilization_Ratio','Avg_Open_To_Buy','Total_Trans_Amt']
NumericalPlot(Numcol)

## Checking Multicollinearity among numerical variables using correlation plot

AllNumcol=['Customer_Age','Credit_Limit','Months_on_book','Avg_Utilization_Ratio','Avg_Open_To_Buy','Total_Trans_Amt','Dependent_count',
                  'Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Total_Revolving_Bal',
                  'Total_Amt_Chng_Q4_Q1','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1']
corr = x_train.loc[:, AllNumcol].corr()

plt.figure(figsize=(30,15))
sns.heatmap(corr, annot=True)
plt.title("Correlation Plot of Numerical Variables ", fontsize = 30)
plt.show()

# Using LabelEncoder(), replace 'Attrition_Flag' with a new encoded column, 'churn'.
le = LabelEncoder()
df['churn'] = le.fit_transform(df['Attrition_Flag'])
df = df.drop('Attrition_Flag', axis=1)

# For one-hot-encoding, we make the lists of cats and nums after dropping 'churn' which is already encoded.
categorical_cols = df.drop('churn',axis=1).select_dtypes(exclude=['int64','float64']).columns
numerical_cols = df.drop('churn',axis=1).select_dtypes(include=['int64','float64']).columns
categorical_cols 

# Creating a function which 'One-Hot-Encodes' categorical variables
def OneHotEncoding_Binding(data, cols):
    dumm = pd.get_dummies(data[[cols]])
    one_hot = pd.concat([data, dumm], axis=1)
    one_hot = one_hot.drop([cols], axis=1)
    return(one_hot) 

for col in categorical_cols:
    df = OneHotEncoding_Binding(df, col)

# Splitting encoded data
x = df.drop('churn', axis=1).values
y = df['churn']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=41)

# 1. Baseline model : Logistic regression

# Model fitting
basemodel=LogisticRegression(C=100, solver='lbfgs', max_iter=1000)
basemodel.fit(x_train, np.ravel(y_train))

# Evaluation Metrics
score_train=basemodel.score(x_train,y_train)
score_test=basemodel.score(x_test,y_test)
y_pred = basemodel.predict(x_test)
cv = cross_val_score(basemodel, x_test, y_test).mean()
auc_roc = roc_auc_score(y_test, y_pred)
recall= metrics.recall_score(y_test, basemodel.predict(x_test))


cm =metrics.confusion_matrix(y_test, basemodel.predict(x_test),labels=[0,1]) # Constructing Confusion Matrix
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g')
ax.set_xlabel('Predicted');ax.set_ylabel('True'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Churning', 'Existing']); ax.yaxis.set_ticklabels(['Churning', 'Existing']);

specificity=cm[0][0]/sum(cm[0])

# Print the results
print(metrics.classification_report(y_test, basemodel.predict(x_test), labels=[1,0]))
print(f"(Logistic Regression : train accuracy {score_train:.2%})")
print(f"Logistic Regression : test accuracy {score_test:.2%}")
print(f"Logistic Regression : cross validation score {cv:.2%}")
print(f"Logistic Regression : AUC ROC {auc_roc:.2}")
print(f"Logistic Regression : recall {recall:.2%}")
print(f"Logistic Regression : specificity  {specificity:.2%}")

 

importance = basemodel.coef_[0]
# summarizing feature importance
for a,b in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (a,b))
# plotting feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

 

# Coefficient handling

feature_index=df.drop('churn', axis=1).columns
logistic_reg_coef_table = pd.DataFrame(index = feature_index, data =importance, columns=['coef'])
logistic_reg_coef_table.sort_values(by =['coef'], ascending=False)

# 2. RandomForest

# Model fitting
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

# Evaluation Metrics
y_pred = rf.predict(x_test)
score = rf.score(x_test, y_test) 

cv = cross_val_score(rf, x_test, y_test).mean()
auc_roc = roc_auc_score(y_test, y_pred)
recall= metrics.recall_score(y_test, rf.predict(x_test))


cm =metrics.confusion_matrix(y_test, rf.predict(x_test), labels=[0,1]) # Constructing Confusion Matrix
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g' )
ax.set_xlabel('Predicted');ax.set_ylabel('True'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Churning', 'Existing']); ax.yaxis.set_ticklabels(['Churning', 'Existing']);

specificity=cm[0][0]/sum(cm[0])

# Print the results
print(metrics.classification_report(y_test, rf.predict(x_test), labels=[1,0]))
print(f"RandomForest : accuracy {score:.2%}")
print(f"RandomForest : cross validation score {cv:.2%}")
print(f"RandomForest : AUC ROC {auc_roc:.2}")
print(f"RandomForest : recall score {recall:.2%}")
print(f"RandomForest : specificity  {specificity:.2%}")

# Feature importance
features = list(df.drop('churn', axis=1).columns)
number_of_features = 15 # Top fifteen features
importances = rf.feature_importances_
index = np.argsort(importances)

ten_features = np.array(features)[index][-number_of_features:]
imp_values = importances[index][-number_of_features:]
y_ticks = np.arange(0, number_of_features)
fig, ax = plt.subplots()
ax.barh(y_ticks, imp_values)
ax.set_yticklabels(ten_features)
ax.set_yticks(y_ticks)
ax.set_title("RandomForest Feature Importances")
fig.tight_layout()
plt.show()

# 3. XGBoost ; Tree-based model + Gradient Descent -> high accuracy

xgb = XGBClassifier(eval_metric='logloss',use_label_encoder =False)

xgb.fit(x_train, y_train)

y_pred = xgb.predict(x_test)
score = xgb.score(x_test, y_test) 

cv = cross_val_score(xgb, x_test, y_test).mean()
auc_roc = roc_auc_score(y_test, y_pred)
recall= metrics.recall_score(y_test, xgb.predict(x_test))

cm =metrics.confusion_matrix(y_test, xgb.predict(x_test), labels=[0,1]) # Constructing Confusion Matrix
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g' )
ax.set_xlabel('Predicted');ax.set_ylabel('True'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Churning', 'Existing']); ax.yaxis.set_ticklabels(['Churning', 'Existing']);

specificity=cm[0][0]/sum(cm[0])

# Print the results
print(metrics.classification_report(y_test, xgb.predict(x_test), labels=[1,0]))
print(f"XGBoost : accuracy {score:.2%}")
print(f"XGBoost : cross validation score {cv:.2%}")
print(f"XGBoost : AUC ROC {auc_roc:.2}")
print(f"XGBoost : recall {recall:.2%}")
print(f"XGBoost : specificity  {specificity:.2%}")

# Feature importance
features = list(df.drop('churn', axis=1).columns)
number_of_features = 15 # Top fifteen features
importances = xgb.feature_importances_
index = np.argsort(importances)

ten_features = np.array(features)[index][-number_of_features:]
imp_values = importances[index][-number_of_features:]
y_ticks = np.arange(0, number_of_features)
fig, ax = plt.subplots()
ax.barh(y_ticks, imp_values)
ax.set_yticklabels(ten_features)
ax.set_yticks(y_ticks)
ax.set_title("XGBoost Feature Importances")
fig.tight_layout()
plt.show()

bottom of page