Probabilistic graphical models in customer analytics: comparison with classical predictive models
This paper presents the application of different methods in order to have a complex vision on customers’ churn problem. It detects which algorithms can be used not only for churn prediction but also for churn prevention issues. Churn prevention analysis.
Рубрика | Менеджмент и трудовые отношения |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 25.08.2020 |
Размер файла | 1,3 M |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Other
There is a pool of research which was not included in the previous sections, however, some of them will be reviewed here. A big study was done based on eleven datasets from several telecom companies worldwide (Verbeke et al., 2012). The number of observations in these datasets ranged from approximately 2,000 to almost 339,000. The proportion of observations in all datasets was the same: 67% for training set and 33% for testing set. The number of features also ranged from 15 to 727 units. The authors used a considerable amount of methods from classical to more complicated and modern ones. They compared comprise rule based classi?ers, decision tree methods, neural networks, nearest neighbour methods, Naive Bayes, Bayesian Belief Networks and a simple logistic regression model. As a result, Alternating Decision Trees was admitted to perform the best results, nevertheless, some other methods were not far away in terms of accuracy and overall performance.
In the companies, which predominantly provide services, especially, airline companies, it is highly important to maintain the key aspects of customers' satisfaction, identify those who are going to leave and prevent them from switching the company. The logistic regression once again proves its applicability for customer churn prediction by dividing all the clients of the airline company into four groups, setting different patterns of churning process, including the speed of churn, and analysing the customers' tendency to stop being loyal (Hu, 2019). Smart companies do not only are aimed at identifying churners when they are about to quit, but try to undertake proactive measures to explore the customers' needs and wants before even they actually think about them. Thus, it is quite important to do marketing research to understand the target audience's preferences and prevent them from churning long before the problem actually occurs (Bharadwaj et al., 2018). Along with the common logistic regression model, which by the way showed 87,52% accuracy, a Multilayer Perceptron (MLP) Neural Network with 0,01 learning rate was built with advanced accuracy of more than 94%. This model is promising for companies to know the customers in detail and offer the things they truly appreciate and need, thus, to establish a stable brand loyalty.
Identifying the potential churners is a big deal, however, it is not less important to remember about the key company's performance indicators, such as profit (Stripling et al., 2015). The logistic regression model becomes more valuable after adding the expected maximum profit metric into this model. It has been proved that the new model, called ProfLogit, shows better results than original logistic regression model does and obviously help increase the profit for several real datasets which contain variables from cost of offer to customer lifetime value. Telecommunication sector more than anyone else needs to use advanced classifiers to become more profitable and sustainable.
Another interesting research was done by Zhang (2007) to test a hybrid k-nearest neighbour-logistic regression (KNN-LR) classifier in several fields simultaneously. With the help of four datasets in each field the research was aiming to predict whether yearly income exceeds $50,000, whether the credit request is satisfied and whether a sample from a patient's breast is malignant or benign. For all datasets, the KNN-LR model outperforms simple LR. Decision Tree model C4.5 outperforms KNN-LR only on one dataset. The approach to create the hybrid models is often found in the papers with churn prediction topics. Frequently, the hybrid models outperform the original models, such as k-means, decision tree C4.5, logistic regression, KNN, etc. (Huang & Kechadi, 2013). The dataset consisted of 104,000 records about telecom customers including demographics, accounts and calls information, with approximately 5% churn rate and 121 features. A hybrid model-based classi?cation learning system represents a combination of weighted k-means clustering and a FOIL classi?cation method. The final model was produced after averaging the results of five validation sets, equally distributed parts of the initial dataset.
2.3 Graphical models for predictive analysis: Bayesian Belief Networks
Graphical models for predictive analysis, particularly, Bayesian networks make it possible to identify each influential variable which can transform a churner into non-churner. Frequently, it comes out that price is not the one and optimal tool to be changed to stimulate the customer not to churn. That is an interesting result coming from several Bayesian Belief Networks studies. A lot of studies about customer churn prediction is based on machine learning techniques. However, Bayesian Belief networks (BBN) proved to be not less useful in such predictions (Kisioglu & Topcu, 2011). A Turkish telecommunication company's data was used to conduct this research. It contained information about 2,000 subscribers, where approximately 25% were churners, gathered during a period of six months. After data processing, 9 out of 23 variables were kept for the model construction. The dataset comprises such variables as age, tenure, average billing amount, average minutes of usage, and average frequency of usage. Continuous variables were discretized in order to use the Bayesian Belief network. Correlation and collinearity were tested before and sensitivity was analysed after BBN execution. The results were shown from the point of view of tree scenarios. The first one is about those who have the highest probability to churn. They have below average number of minutes of usage and a descending billing trend with short tenure. In the second scenario, subscribers with low frequency of usage are dealt with. They have higher possibility to churn as well. Third scenario considers the type of tariff influence on churn, so the subscriber using the right tariff has less chances to churn. Authors believe that a good promotion in all three scenarios will help decrease churning behaviour and retain customers.
Another industry which needs customer churn analysis is transportation, especially, airline transportation. What is important when using the Bayesian belief network is that the more statistical information is gathered on the topic the higher the probability of obtaining reasonable results (Chen et al., 2017). This research is based on the China Southern Airline database of real customers. The dataset contains 35 variables of personal data, behaviour patterns, preferences and perceptions of the customers. The results have shown that passengers pay more attention to the services provided, delay notifications, common and vacation destinations. Frequent users need product recommendation, links and new routes introduction. Those with higher status are more loyal to their choices of flights and airlines no matter which services they actually receive. As it was found from the literature review, Bayesian Belief network was the promising method to study customer satisfaction in railway transport (Chakraborty et al., 2016), because it can thoroughly demonstrate the factor relationships of customer satisfaction. The dataset was gathered from large public transportation company in Australia. All variables were separated into two groups which have direct and indirect impact on the traveling experience. The first group included such blocks as transportation facility, station facility, operation information and other. As for the group two, passenger and service factors were considered. Building several scenarios, the authors came to the conclusion that any change in a node influences all following child nodes, however the parallel nodes are not affected. Performed analysis of node changing makes BBN an important tool for supporting decision-making. For example, it was found out that station equipment and comfortability influence the most on the first group node, meaning among journey components. Whereas the latter has the biggest impact on the overall satisfaction.
Of course, customer analytics is not the only field where Bayesian network approach is applied. Before launching a new product, extending existing one or reducing a product the special analysis should be made in order to understand the chances of the product success or failure. Conditional Bayesian network (CBN) was used to identify a failure rate of a product (Cai et al., 2011). The relationship between product variables and features and target variable, failure rate, was investigated with the help of the CBN creation. This model proved to be a powerful tool to predict the rate of product failure. After conducting a case study, comparison with decision tree method showed that CBN model has more classification accuracy and less structure complexity. The prediction of the product success is important; however, the location of the company's facilities also plays a big role in overall organisation's results. Bayesian network method is also useful in such kind of questions and can be complemented by other tools, for example, total cost of ownership (Dogan, 2012). In this particular paper, a hybrid model comprises both qualitative and quantitative characteristics, where Bayesian Belief Network approach brings more causal links and understandable and clear relationships. The dataset contains different costs, such as investment costs which are unique to the location, fixed and variable costs related to the life cycle of the manufacturing facilities, as well as different external factors. The author understands the importance and deals with uncertainty issue by taking advantages from the total cost of ownership (TCO) clear model structure and BN causal relations completeness. The idea of the approach is presented in four steps: to find the factors significant for location decision, define the structure of the network by building the causal relationships, quantify the probabilities and finally make the decision based on total costs. To reduce the model susceptibility by environmental uncertainty the historical data along with the expert judgements were assessed. However, this vulnerable aspect still needs to be worked on.
Even aeronautic industry is not left aside. Bayesian networks are also used for maintenance the decisions on aeroplane activities and reduction of costs on these actions (Ferreiro et al., 2012). Both saving costs on repairing the aeroplane and ensuring its safety are the target goals of the airline company, that is why it is important have a reliable model to predict the brake wear. Creating a new helpful model for aeroplane maintenance the following costs were taking into account in the structure: facilities, equipment and testing it, organization process, engineering, supervision, tooling, check-ins, logistics, data and record keeping. As a result, the model allowed to reduce the delays by adding to reporting and diagnosis stages the assessing and planning stages instead of just preparing, so that the fixing stage after the primetime. Thus, the article suggests a useful tool for flexible and comprehensive planning supporting the decision making and eliminate disruptions. This approach allows to perform proactive actions for time and costs saving instead of reactive responses to the uncertainty and unplanned circumstances by the delays and other disorders.
The studies exploring the customer satisfaction are quite popular and it is quite naturally that more and more approaches are tested to predict and understand this crucial key performance indicator. The interesting study was performed on the dataset based on fourteen European Union countries exploring the satisfaction of customers by railway transport (Perucca & Salini, 2014). The data represents the results of a survey with more than 17,000 observations. The performance of two methods was compared as well as the cases of their application: logistic regression and Bayesian Belief Network. The results show that Bayesian network approach gives more precise predictions than logistic regression. Another reason for the BN model is that it demonstrates the causal relationships between the variables. Bayesian Belief Network model in contrast to regression model proved the relation between satisfaction and attitude to railway transport, refuting the link between personal characteristics and overall satisfaction with the railway transport. BN enabled to find out that there is an intermediate factor influencing the customers' satisfaction and there is no direct impact of personal characteristics on the level of passengers' contentment. To sum up, this study praises the BN model for its predictability of future and explicability customers' behaviour.
The kind of a breakthrough was performed by Korean researchers Lee & Jo (2010) who investigated not customer churn behaviour prediction but churn motivation. For this study, the telecom industry was used as a source of a dataset of clients' personal and behavioural information and four types of Bayesian Network classifiers were used as the methods. The dataset contains almost 5,000 observations and 14 different variables. The variables themselves describe both personal and behavioural characteristics: age, device maker company, service grade, way of payment, number and frequency of calls and some loyalty variables as well and, logically, churn motivation itself. They proved in the study that BN classifiers should be suggested as a helpful tool to predict the churning motivation and build considerable decisions on the analysis results.
Probabilistic models can perform on different tasks somewhere in between traditional methods and machine learning algorithms but in terms of interpretability, it is unclear how it is going to work, while it is quite important for the task of churn prevention. Thus, the listed in the literature review methods can be compared not only by the accuracy of created models, while also by interpretability, which allows to use the findings in the future development of churn prevention strategies. As we can see from the review, machine learning algorithms are the best in their predictive power. Traditional methods give comparable to machine learning algorithms accuracy, while additionally provides us with some explanation of the direct relationship between the variables. Bayesian networks have both prediction power and the visual structure of dependencies between variables, which allows us to see what actions can be taken to prevent churn. That is why for our paper for comparison we have chosen logistic regression as traditional method, random forest and XGBoost algorithms as representatives of machine learning algorithms, and Bayesian Networks as a graphical tool.
3. Research design and methodology
3.1 Methods
For our research we have chosen four methods of analysis: Logistic Regression, Random Forest, Boosting and Bayesian Belief Networks. Nearly all of the articles listed in our literature review compare methods between them and based on their results we had chosen most accurate methods in order to see which of them can predict better and give actionable results. We will use Logistic Regression as a representative of classical methods of prediction analysis, Random Forest, eXtreme Gradient Boosting as two examples of Machine Learning algorithms, and Bayesian Belief Networks as a graphical model example. We have chosen Random Forest and Boosting as examples of ML algorithms for churn prediction due to the literature review, where in articles different methods of ML were compared and these two showed the best results compared to others.
Logistic regression is a classification algorithm. Generally, logistic regression is well suited for checking hypotheses about relationships between a categorical outcome variable and one or more categorical or continuous predictor variables (Peng et al., 2002). Basic logistic regression is similar to linear regression in terms of modelling, while the independent variable in logistic regression is categorical and has two levels, and also the type of relationship has different shape (Bewick, Cheek, & Ball, 2005). The dependent variable, Y, can take only two discrete - binary - values. For example, it can be "Yes" or "No", or "True" or "False". Moreover except for the basic (binary) logistic regression there is a multinomial logistic regression, where the dependent variable could have more than two values (Mood, 2010).
For choosing the significant variables for the prediction of dependent variables in logistic regression we look at the significance of each variable. In R we look at the value under "Pr(>|z|)" column and compare it with the chosen significance level. Usually, it is 5%. All the variables, which has lower value than chosen significance level are significant in this model. For seeing the independent variable with the strongest relation with the probability of dependent variable Y we look at the smallest value of p-value. To interpret the relation of independent and dependent variables we look at the «Estimate» column. In our case, we will predict if the chosen customer will leave the company to a certain period or not, therefore our dependent variable will be «Churn» or «Non-churn». For making any predictions, data is divided by two parts: train and test samples. We use train dataset for building the model and checking its prediction power. Then we put the test data in model, test and assess model's quality, and look how good the model prediction power is. After getting the accuracy results from all the models we compare them with each other and see which type of model is the best one for the prediction.
Random Forest uses a divide and conquers approach (Chen, Yang, & Lin, 2018). This method makes each tree numeric and it is trained by choosing a random attribute from a set of attributes. Based on the subcollection of the features each tree grows until it reaches maximum level. Further, for making predictions on test dataset the final model of decision tree is built. One of the strengths of this method is the ability to perform well on a big dataset with or datasets with lots missing values, without deleting observations from the dataset. Random forest is a method, where there are many ordinal classification trees grow. In order to make a prediction and label target variable with the class, we put the input set of values (vector) and it goes down each tree in the forest of trees. Each tree gives a prediction and decides with the label. The forest then chooses the label which has bigger number of predictions for any labels. All the trees operate as an ensemble. This method works really well because of the wisdom of crowds. Due to the fact lots of models - trees, operate as the bureau, and that is why in any case the decision of many can works better than the decision of only one. Also, model between themselves - trees, has low correlations between them. And this also helps this algorithm to have very accurate predictions: trees help each other to protect from individual errors, and even if some of trees can give wrong label and identification, other ones - will be right. Therefore, the power of random forest algorithm is in the set of uncorrelated trees - forest of trees, and good feature selection for prediction of target variable.
There are different boosting algorithms for making predictions. Boosting is type of algorithm where evaluation criteria is based on the previous experience: therefore, results is always based on the previous model predictions. Boosting models are built sequentially by minimizing the errors from the previously built models with increase of influence of high-level models. Gradient boosting is a more proficient type of boosting which employs gradient descent algorithm to minimize errors in sequential models. The XGBoost algorithm is based on the Gradient Boosted Decision Tree (GBDT). XGBoost is a decision-tree-based algorithm, which uses gradient boosting framework, handles with missing values, uses parallel processing and for avoiding overfitting - regularization (Si, Zhang, Keerthi, Mahajan, Dhillon, & Hsieh, 2017). It can be used for different problems: classification problem, regression, etc. One of the most distinctive features of XGBoost is the efficient dealing with sparse data and suitability for large-scale dataset. Why XGBoost works better than other boosting algorithms? This method likewise gradient boosting applies boosting weak learners, uses optimization and algorithmic improvement. As it was said before, XGBoost does the sequential tree building, and use for its parallelized implementation. This algorithm sets up pruning the trees rearward, what improves its computational performance along with usage of parameter "max_depth". One important notice: since it optimizes the computation and available memory on computer, the algorithm is also efficient in use of hardware resources, what discerns it from other algorithms - the regularization. There are two types of regularization, what prevents models from overfitting and are applicable to use while performing the model: LASSO and Ridge (McNeish, 2015). Overfitting is a common issue for machine learning algorithm, and the absence of it allows models to perform better predictions. Also, this type of algorithm uses built-in cross-validation in order to decrease the number of boosting iterations.
Generally, a Bayesian network is a graph probability model that represents a set of variables and their Bayesian probability dependencies. The mathematical apparatus of Bayesian Belief Networks (BBN or simply BN) was created by American scientist Judea Pearl, winner of the Turing Prize (2011). Formally, a BN is a directed acyclic graph with a random variable corresponding to each vertex, and the arcs of the graph encode conditional independence relations between these variables. This algorithm is a directed graph without directed loops in which the vertices correspond to variables in the distribution, and the edges connect "related" variables. Vertices can represent any type of variable, be weighted parameters, hidden variables, or hypotheses. There are effective methods that are used for computing and training Bayesian networks. If the variables of a BN are discrete random variables, then such a network is called a discrete Bayesian network. BNs that model sequences of variables are called dynamic Bayesian networks. BNs that can contain both discrete variables and continuous ones are called hybrid Bayesian Belief Networks. The BN, in which arcs in addition to conditional independence relations encode also causality relations, is called causal Bayesian network.
In order to really get valuable information from data we look at the nodes and edge. There is the sequential relationship, where one node affects other, and other one affects the third one. The sequential relationship between three variables tells us that the extreme variables are conditionally independent while the existence of the middle variable. Furthermore, the next possible option is a divergent relationship where one variable affects both second and third. As in the previous case, second and third variables are related only because of the first one. In such manner, the divergent relationship between the three variables tells us that the "effect" is conditionally independent, provided its "common cause". If the cause is known, the effects become independent; as long as the cause is unknown, the effects are connected through it. There is a third "type" of relation between three variables - converging relationship, where third and second variables together affect third one. Thus, in a converging relationship, two "causes" (first and second variables) are independent, but only as long as the meaning of their "general effect" is unknown, but when it receives a signification, the causes become dependent.
For performing the analysis, data cleaning, and creation of models RStudio environment was used. All the functions used were taken from the public packages listed below. Versions of the packages are actual for 18th of April 2020. List of packages includes "dplyr" (data preprocessing), "caret", "car" (prediction results), "ggplot", "sjPlot", "sjmisc", "lme4", "sjlabelled", "Rgraphviz", "corrplot" and "ROCR" (data visualization), "bnlearn" and "gRain" (Bayesian Network analysis), "stats" and "MASS" (Logistic Regression), "randomForest" and "xgboost" (Random Forest and eXtreme Gradient Boosting respectively).
3.2 Data
For our paper we have chosen the IBM public dataset from the Kaggle platform (Data is available via link: https://www.kaggle.com/blastchar/telco-customer-churn). This dataset contains information about Telco customers and if they left the company within the last month (churn). The data in the dataset can be analysed and used to develop customer-focused customer retention programs. The dataset consists of 7043 observations and 21 variables. Each row represents a unique customer, while the columns - variables contain information about customer's services, account, and demographic data (Table 1 in Appendix).
All the variables, except for the “churn” variable, were divided by 5 groups:
1. Socio-demographic variables:
a. “Gender” column shows if customer male or female.
b. “SeniorCitizen” variable represents if customer is a senior or not.
c. “Partner” highlights if the person is in the relationships or not.
d. “Dependents” shows if the customer has any dependents (children, senior parents, etc.).
2. Monetary variables:
a. “MonthlyCharges” give the information about the money amount charged to the customer monthly.
b. “TotalCharges” column indicates the total money amount charged to the customer for the whole duration of service usage.
3. Phone variables:
a. “PhoneService” shows the information about customer's usage of prone service: if he uses this or not.
b. “InternetService” column provides us with the information about customer's usage of internet services and in case if person uses internet gives the information about the provider type.
c. “MultipleLines” indicates if the customer has multiple lines or not.
d. “PaperlessBilling” indicates if person has a paperless billing or not.
e. “PaymentMethod” column shows how customer pays for the service (via electronic or mailed check, Bank transfer, Credit card).
4. Specific Internet variables:
a. “OnlineSecurity” gives the information about customer's online protection: if he uses programs which instantly blocking harmful and phishing websites.
b. “OnlineBackup” column shows if customer has his data stored in cloud or not.
c. “DeviceProtection” indicates if the person uses security measures, which protects phone with anti-malware protection, location tracking, and blocking of stolen device.
d. “TechSupport” tells if the customer uses tech support service.
e. “StreamingTV” column indicates if the customer has streaming TV or not.
f. “StreamingMovies” shows if the customer has streaming movies or not.
5. Time variables:
a. “Tenure” gives us the number of months person uses company services.
b. “Contract” column gives an information about type of customer's contract and its duration.
The «Churn» variable contains information about whether the user left the company in the last month and is used to determine the accuracy of the model and check its performance.
In this paper, we compare methods for two tasks: churn prediction and churn prevention. For the prediction task we compared methods based on the prediction power of created models, thus the main value for the model comparison is the accuracy of models. For the prevention task we decided to look what actions can be made in order to prevent the decrease of the client's churn in the company. Comparing the chosen methods, we can highlight strengths and weaknesses of each method. XGBoost and Random Forest as representatives of machine learning algorithms should give us high accuracy, which makes them perfect for churn prediction task. Additionally, Random Forest shows the importance of each variable in the predicted model.
Besides, as we mentioned before, we focus not only on the task of churn prediction but also churn prevention. As for the prevention task, logistic regression could have been a good tool, but it only shows a direct relationship between independent and dependent variables. Indirect connections can be seen through an interaction effect, but you should do it manually and it still will allow to lower only one level deeper into the analysis of relations. For Bayesian Networks, we have a model's structure which is formed automatically. Thus, we can build a network and see connections between independent variables. Besides, Bayesian Network visualize dependencies between the variables with the opportunity to check how the change in one variable can change the probability of a particular level of another variable. This high level of interpretability allows to develop future churn prevention strategies.
4. Analysis and results
4.1 Case descriptive statistics
Firstly, we look at the missing values presence and its number. In the data we have 11 missing values in column Total Charges, and, probably, it means that people just did not pay for any of the services and just accidently in the customers database. Another explanation can be in database filling mistakes: someone just accidently forgot to fill some fields. These values can be replaced with 0 or these customers can just be deleted. We decided to delete all these customers in order to avoid errors in prediction due to manual corrections to the database. Our final data used in the paper has information about 7032 customers and their 21 attributes.
In this paper we predict if the customer left in the previous month, or not. To estimate the model accuracy, we are going to compare our predicted results with the real results from column «Churn». In our data we have 5163 non-churners and 1869 churners, and in percentages 73.4% are non-churners, while 26.6% are churners (Figure 2). As we can see from out data around 27% of customers left the company in the previous month. In data we have 3 numeric variables: Monthly Charges, Total Charges, and tenure, two of the monetary, and one - tenure - is the time variable.
Figure 2. The bar plot showing the proportion of churners and non-churners in the dataset
Time variables
Clients, who stayed with the company rather small number of months are more likely to churn, and the level of tenure for them is around 10 (Figure 3). This means the median number of months for those, who left the company within the month, is 10. While for those, who did not churn for the previous month, the median tenure is around 40. This tells us, clients who applied for service recently more likely to churn. More than 50% (55% of customers) prefer to have month-to-month type of contract. While for those, who churned the rate of those who used month-to-month type exceed 89%. For those, who have long term contract, most of them are non-churners. Thus, it can be concluded that those who churned used to choose month-to-month type of contracts.
Figure 3. The violin plot showing two-sided specular distribution density for the tenure variables. In the middle of each violin plot, the median is placed (the red point). The bar plot showing the proportion of churners and non-churners, where colours represent levels of Contract variables
Monetary variables
The median monthly charge is higher for non-churners, compared to churners. The medians of monthly charges are 64.4 and 79.6 for non-churners and churners respectively (Figure 4). We can see that customers, who churned in the previous month has higher median monthly charge compared to those who did not churn. The shape distribution of total charges is similar for churners and non-churners, when the level of total charges is lower for churners. Variables of tenure and total charges are probably correlating, due to the fact those who spend more months with company cumulatively paid more for the company's services.
Figure 4. The violin plots showing two-sided specular distribution density for the numeric variables. In the middle of each violin plot, the median is placed (the red point)
Socio-demographic variables
The dataset of Telco company has 16 categorical attributes, while contract is one of them and it was already mentioned above. One of the categorical variables - gender. There is almost equal distribution of males and females as churners and non-churners (Figure 5). There are 2544 and 2619 as non-churners for females and males respectively. While as churners, there are 939 females and 930 males. Which means that gender is not an important predictor of churn. Most of customers from Telco dataset are not senior citizens: 5890 out of 7032, and it equals 84%. We can see that only 16% of people are senior citizens. Whereas for comparison of churn rate for customers, the churn rate for senior citizens is almost in two times higher than for non-senior citizens (42% for senior and 24% for non-senior). For those, who have not churned there is no difference if they have a partner or not (53% VS 47%), whereas for churners the number of those, who churned in the previous month and does not have a partner higher (64%), then for those who have a partner. Thus, it can be concluded, that customers with partners have lower churn rate. Talking about Dependents variable, we have more customers without any dependents (70%) in our data than with. From Figure 5 it is clear that people with dependents have lower churn rate compared for those who does not have any dependents (18% in comparison to 46%). Thus, people without dependents churned frequently than people with dependents in the previous month.
Figure 5. The bar plots showing the proportion of churners and non-churners, where colours represent levels of Gender, Senior Citizen, Partner and Dependents variables
Phone variables
There is no big difference in phone service usage between churners and non-churners (Figure 6). In addition, the number of those who does not use phone service is small in both groups (no more than 10%). For the Multiple Lines variable the distribution of churners if slightly different. The number of churners, who have multiple lines is higher for churners compare to non-churners (46% of churners VS 41% of non-churner have multiple lines). Thus, clients with multiple lines have a bit higher churn rate. For the Payment Method variable, most of the customers of the Telco company prefer electronic check (34% of all the customers). All the other methods of payment distributed almost equally across all the customers. Those who churn prefer to use electronic check as type of payment (57%). Moreover, less than 17% of churned in the previous month customers used mailed checks, while other 83% used electronic check or automated payments. For those, who stayed with the company, all the methods of payment are distributed equally. Speaking about Internet Service variable it can be seen that those, who do not have any Internet services are less likely to churn.
Figure 6. The bar plots showing the proportion of churners and non-churners, where colours represent levels of Phone Service, Multiple Lines, Internet Service and Payment Method variable
Specific Internet variables
Speaking about Specific internet services, clients with Fiber Optic type of provider are more likely to churn (69% compared to 35% percentage for non-churners using Fiber Optic) (Figure 7). Those who use DSL type of provider are less likely to churn. As a result, we can say that customers with Fiber optic Internet are more probable to churn, whereas for DSL users the churn rate is much lower. Other Services (Online Security, Online Backup, Device Protection, Tech Support, Streaming TV and Streaming Movies) also can be predictors of churn. All these six variables depend on Internet Service variable, because if the client does not have Internet - he will not be able to have one of those variables. Plots from Figure 7 reflects similar trends: those, who did not churn in the previous month almost equally use the additional services or not. For churners the percentage is not equal for the Online Security, Online Backup, Device Protection and Tech Support: churners rarely used any of these additional services. Thus, those who had these additional services have lower chance of churn.
Figure 7. The bar plots showing the proportion of churners and non-churners, where colours represent levels of Online Security, Online Backup, Device Protection and Tech Support variables
Correlation between numeric variables
In order to check the relation between numeric variables the correlation analysis was performed (Figure 8). According to the results from performed correlation analysis, variable Total Charges strongly related to the variable tenure, and the relation is significant, what means that the more months people spent with the company, the bigger amount of money they spent in total. As well Total Charges strongly related to the Month Charges, what means the more people spent with the services each month, the bigger the total charges. Additionally, tenure is significantly correlated with the Monthly Charges, what allows us to suppose, the more months people spent with the company, the higher the Monthly Charges become.
Figure 8. The correlation plot showing the correlation between numeric variables
4.2 Logistic Regression, Random Forest and XGBoost
As was written above, in Telco data there are some highly correlated variables. Thus, the variables Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, and Streaming Movies are directly related to the Internet Service. With the Internet Service values ??equal to “no”, all six of the above variables take the same value, indicating that the subscriber does not have the Internet. To satisfy the model assumptions, it is forbidden to use strongly correlating variables in the regression analysis. Moreover, it is completely illogical to include six group-specific variables to the general model, since the results can still not be extended to the entire sample, but only to a part of it. That is why we created shorter dataset that included only observations where the Internet Service variable is not equal to “no”. As a result, the new dataset consisted of 5512 observations with the same 20 variables (we deleted customer ID variable due to lack of predictive ability).
After creating an additional data frame, along with the main data it was randomly divided into training and test datasets. Initially, training data included 70% of observations, and test data - 30% for both datasets. However, after building the models and searching for the most optimal ratio of training and test data, the percentages were changed to 68% and 32% (4782 and 2250 observations, respectively) for the general dataset and 63% and 37% (3473 and 2039 observations, respectively) for a dataset having only observations where Internet Service variable has values ??other than “no”. All the results of the models and predictions below were built on this ratio of training and test data. These proportions were identified so that the specificity and sensitivity indicators were close to the same value (Figure 9). When we make predictions we mostly look at the accuracy, namely, the ratio of correctly predicted observations to the total number of observations. But it is important to remember about sensitivity and specificity, which reflect a more detailed picture of the prediction, helping to choose the optimal cut-off. The sensitivity of the test reflects the probability that the churn prediction will be positive among those observations which are actually churned (churn variable equals “yes”). In contrast, the specificity of the test reflects the probability that the churn prediction will give negative results among those observations which, in fact, did not churn (churn variable equals “no”). In other words: number of true positive assessments divided by number of all positive assessments and number of true negative assessments divided by number of all negative assessments.
Figure 9. The optimal cut-off for the general dataset and dataset with only positive Internet Service observations. The optimal cut-off is at the intersection of specificity and sensitivity lines.
The results of the best logistic regression model for the general dataset are presented in Table 2. Initially, all the variables were included in the model with the exception of six specific variables related to the availability of the Internet. Then, using the stepAIC function the model with the lowest Akaike information criterion (AIC) value was found. AIC is the main indicator of the quality of the logistic regression models and is used to compare different models and find the best one. It is based on the maximum likelihood method, and comparing the model selects the best one such way so that the risk of overtraining or under-learning of the model is minimized. AIC deals with both the risk of overfitting and the risk of underfitting. AIC for the best model on general data turned out to be equal to 4086. It includes 9 of the following independent variables: Senior Citizen, Dependents, Tenure, Phone Service, Contract, Paperless Billing, Payment Method, Monthly Charges and Total Charges. The dependent variable in all models was churn. However, after checking for multicollinearity, one of the variables was removed from the model (Table 3).
Table 2 Results of the best logistic regression model for the general dataset,
where dependent variable is churn
Note: The first model shows the best model according to the AIC. The second model is the model that was used in the analysis due to multicollinearity in the first model and better AIC than in the third model.
Multicollinearity appears when there is collinearity between two or more variables. Multicollinearity can be evaluated by calculating a score called the variance inflation factor (VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The absence of multicollinearity equals the smallest possible value of VIF - one. As a rule of thumb, a VIF value that exceeds 5 indicates a problematic amount of collinearity (James et al., 2014). In our case, there is a strong collinearity between two variables: tenure and total charges. This result can be explained by a strong correlation between these two variables, due to the fact, that the more months the customer stays with the company, the higher is the total amount of money charged to this customer for the whole duration of service usage. When faced to multicollinearity, one of the collinear variables should be removed, since the presence of multicollinearity implies that the information these variable provide about the response is redundant in the presence of the other variables (James et al., 2014, P. Bruce and Bruce (2017)). One by one both of the correlating variables were removed to compare constructed models. According to the results (Table 2), AIC for the model where the monetary variable was removed was lower, which means this model is better than the one with lack of tenure variable. We used this best model in further analysis and churn prediction.
Table 3 VIF output for the best logistic regression models for the general dataset
Variables |
GVIF for Model 1 |
GVIF for Model 2 |
|
Senior Citizen |
1.10 |
1.09 |
|
Dependents |
1.05 |
1.05 |
|
Tenure |
14.64 |
1.82 |
|
Phone Service |
1.32 |
1.30 |
|
Contract |
1.44 |
1.41 |
|
Paperless Billing |
1.10 |
1.10 |
|
Payment Method |
1.34 |
1.31 |
|
Monthly Charges |
2.93 |
1.92 |
|
Total Charges |
18.25 |
- |
Speaking about the results of this regression model, first of all, it is worth noting that all variables except Dependents turned out to be significant at the 1% level (P-value equals 0.001). The value of the Dependents turned out to be significant at the 5% level. The clients with dependents are 17% less likely to churn than those without. At the 1% level, intercept turned out to be significant, which suggests that providing all model variables have a zero coefficient (equal to the values of the reference group for factor variables and the initial value for numeric variables), the estimated odds ratio of churn is 0.302. Further, passing directly to the variables, tenure turned out to be significant. For a one-unit increase in the tenure score, the expected change in odds is 0.965. Thus, we can say for an additional month staying in the company, we expect to see 3.5% decrease in the odds of churn. In other words, the more months the customer has stayed with the company the lower is the chance to leave this company. Payment method variable for electronic check and Senior Citizen variable were also significant on the 1% level. This means that the odds to churn for the customers who make payment via automatic bank transfer are 37% lower than the odds for the group making payment via electronic check. Probably, this results can be explained by automatization of bank transfer: the client does not even notice how money disappear from his account or notice, but it does not affect him that much, because the transfer is automatic and he does not need to make any moves to “lose” his money, just watching which is less harmful from a psychological point of view. Senior clients are 33% more likely to churn than non-senior.
Another significant variable is phone service. Here we can see that the odds of leaving the company for the clients who have a phone service option are about 56% lower than the odds for the clients who do not. It seems logical because a lack of phone service option tells that clients only use TV or Internet service in this company, meaning they probably use phone services in other companies. Usually, people use one company for everything because it is more convenient and that is why there is a high probability to churn for these clients from one company to turn on all the services in another. Talking about contract variable, we can conclude that the odds to churn for those who have a month to month contract with the company are about 63% higher than the odds for a one year contract. Compared with two years contract the odds are even higher, thus, clients with the longest possible contacts are almost 84% less likely to churn than those who have only month to month contracts. The results show us that the odds to churn for the clients having paperless billing option over the odds to churn for those who do not have this option is 1.575. In terms of percent change, we can say that the odds for the clients who have paperless billing option are 57.5% higher than the odds for those who have not. Finally, the coefficient for monthly charges says that, holding all other variables at a fixed value, we will see a 2.7% increase in the odds of leaving the company for a one-unit increase in monthly charges since the odds ratio is 1.027. In other words, the more money the customer pays each month to the company the higher is the chance to churn from this company, probably because everyone is trying to reduce their costs and want to pay as less as possible for different services.
Initially, as for the general regression model, all variables were included in the model for Internet dataset. However, again the best AIC showed the model with multicollinearity. That is why we deleted the Total Charges variable from the model. The results of the best logistic regression model for the Internet dataset are presented in Table 4. In contrast to the general model, the intercept of Internet model as well as Dependents variable turned out to be insignificant, but the variable Payment Method became significant at the level of 1% (P-value = 0.001). Thus, the odds to churn for the clients having automatic payment via bank transfer over the odds to churn for those who pay via electronic check is 1.480, which means the clients with automatic transfer are less likely to leave the company.
Table 4 Results of the best logistic regression models for the Internet dataset
Note: The first model shows the best model according to the AIC. The second model is the model that was used in the analysis due to multicollinearity in the first model and better AIC than in the third model.
Once again, the Tenure and Contract variables turned out to be significant on 1% level, telling us that time is a very important variable in case of churn prediction. Tenure odds ratio showed almost the same result as in the general model, where an additional month staying in the company we expect to see around 3% decrease in the odds of churn. In contrast, contract results turned out to be a little different. The odds of leaving the company for those who have a month-to-month contract are about 45% higher than the odds for a one year contract compared to 63% for the general model. Odds for the two years contacts also decreased by 16%, showing that clients with a month-to-month contracts are 67% more likely to churn than those with two years contract duration compared to 83% for the general model. These results tell us that adding additional Internet variables to the model decreases the effect of Contract variable, which means that customers tend to attach less importance to the duration of the contract in case of the presence of other important indicators, such as Internet service, streaming TV, and online security. Compared to the general model, odds ratio for the Phone Service and Paperless Billing variables also decreased by 26% and 17%. Now the odds to churn for the clients without a paperless billing option over the odds to churn for those who have this option is 1.409. For the Phone Service variable the odds ratio is equal to 0.701 on 5% level of significance. In terms of percent change, we can say that the odds for the clients who have phone service to leave are around 30% lower than the odds for those who have not.
Подобные документы
Improving the business processes of customer relationship management through automation. Solutions the problem of the absence of automation of customer related business processes. Develop templates to support ongoing processes of customer relationships.
реферат [173,6 K], добавлен 14.02.2016Analysis of the peculiarities of the mobile applications market. The specifics of the process of mobile application development. Systematization of the main project management methodologies. Decision of the problems of use of the classical methodologies.
контрольная работа [1,4 M], добавлен 14.02.2016The concept and features of bankruptcy. Methods prevent bankruptcy of Russian small businesses. General characteristics of crisis management. Calculating the probability of bankruptcy discriminant function in the example of "Kirov Plant "Mayak".
курсовая работа [74,5 K], добавлен 18.05.2015Selected aspects of stimulation of scientific thinking. Meta-skills. Methods of critical and creative thinking. Analysis of the decision-making methods without use of numerical values of probability (exemplificative of the investment projects).
аттестационная работа [196,7 K], добавлен 15.10.2008Impact of globalization on the way organizations conduct their businesses overseas, in the light of increased outsourcing. The strategies adopted by General Electric. Offshore Outsourcing Business Models. Factors for affect the success of the outsourcing.
реферат [32,3 K], добавлен 13.10.2011Сущность CRM-систем - Customer Relationship Management. Преимущества клиенториентированного подхода к бизнесу. Формы функционирования и классификация CRM-систем. Основные инструменты, которые включает в себя технология управления отношениями с клиентами.
реферат [30,9 K], добавлен 12.01.2011Рассмотрение концепции Customer Relationship Management по управлению взаимоотношениями с клиентами. Возможности CRM-систем, их влияние на эффективность бизнеса. Разработка, реализация и стоимость проекта внедрения CRM-системы для ЗАО "Сибтехнология".
дипломная работа [5,5 M], добавлен 15.09.2012Description of the structure of the airline and the structure of its subsystems. Analysis of the main activities of the airline, other goals. Building the “objective tree” of the airline. Description of the environmental features of the transport company.
курсовая работа [1,2 M], добавлен 03.03.2013Value and probability weighting function. Tournament games as special settings for a competition between individuals. Model: competitive environment, application of prospect theory. Experiment: design, conducting. Analysis of experiment results.
курсовая работа [1,9 M], добавлен 20.03.2016About cross-cultural management. Differences in cross-cultural management. Differences in methods of doing business. The globalization of the world economy and the role of cross-cultural relations. Cross-cultural issues in International Management.
контрольная работа [156,7 K], добавлен 14.04.2014