Machine Learning based clustering algorithms and graph analysis as approach for improving anti-fraud model in credit institutions

Development and implementation of new technologies in banking services. The ability to improve the existing anti-fraud model used in real banking practice. Clusters and their impact on the assessment of the existing anti-fraud model. Data Chart Analysis.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 10.12.2019
Размер файла 2,5 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

Federal state autonomous educational institution for higher education national research university

"Higher school of economics"

Machine Learning based clustering algorithms and graph analysis as approach for improving anti-fraud model in credit institutions

Alla Beatris Felises Ganpantsurova

Moscow, 2019

Abstract

In recent years, the development and implementation of new technologies in banking services has been proceeding at extremely fast pace, with both positive (for example, faster and easy accessible online services) and negative effects (the scale of fraud has increased significantly).

Therefore, forward-thinking organizations keep exploring ways to increase the efficiency of anti-fraud model they use. Among the most effective tools stands out machine learning methods.

This paper explores the possibility of improving the existing anti-fraud model used in real banking practice. The aim of this work is to improve the quality of the above mentioned model to more accurately identify fraud.

To achieve the purpose of the work, for a start, loan applications were clustered on a machine learning based methods. The following clustering methods were used: k-means, k-prototypes and DBSCAN. A comparison between these methods was made with respect to the corresponding metrics: Elbow, Silhouettes and IV (Information Value). Then it was investigated how the resulting clusters affect existing anti-fraud model score.

Then, the analysis of the graphs based on data obtained from applications was carried out. The following metrics, calculated on the basis of social graphs, were used: centrality metrics (degree centrality, betweenness centrality, closeness centrality), PageRank, cluster number and size, shortest paths to the fraud essence.

Both of the above mentioned approaches for improving the model were compared to each other. banking technologies clusters

The work resulted in some improvement of the original anti-fraud model.

Keywords: fraud, anti-fraud model, loan applications, machine learning, clustering, k-means, k-prototypes, DBSCAN, graph, centrality metrics, PageRank, shortest paths to the fraud essence.

В последние годы внедрение новых технологий в сфере банковских услуг идет чрезвычайно быстрыми темпами, что приносит как положительные эффекты (например, легкость и доступность использования банковских сервисов онлайн), так и отрицательные (значительно возросший объем попыток мошенничества).

В связи с этим современные кредитные организации заблаговременно ищут пути усовершенствования используемых в банках анти-фрод моделей. При этом все большую популярность получает метод машинного обучения.

В данной работе исследуется возможность усовершенствования используемой в реальной банковской практике анти-фрод модели для выявления попыток мошенничества при подаче заявки на кредит. Цель данной работы - попытаться улучшить качество вышеуказанной модели для более точного выявления фрода.

Для реализации этой задачи сначала была проведена кластеризация заявок потенциальных заемщиков с использованием методов машинного обучения. Были выбраны наиболее подходящие методы кластеризации (k-means, k-prototypes and DBSCAN). Для сравнительного анализа данных методов были использованы метрики Elbow, Silhouettes и IV (Information Value). Полученные результаты (кластеры) были исследованы на предмет того, как они влияют на имеющуюся антифрод-модель.

Затем, на основе данных полученных из заявок потенциальных заемщиков был проведен анализ графов. Были использованы следующие вычисленные на основе графов метрики: centrality metrics (degree centrality, betweenness centrality, closeness centrality), PageRank, номер и размер кластера, кратчайшие пути до фродовой сущности.

В итоге было проведено сравнение этих двух подходов по улучшению модели (кластеризации и анализа графов).

Как результат работы, было предложено некоторое усовершенствование существующей анти-фрод модели.

Ключевые термины: фрод, анти-фрод модель, заявки потенциальных заемщиков, машинное обучение, кластеризация, k-means, k-prototypes, DBSCAN, графы, centrality metrics, PageRank, кратчайшие пути до фродовой сущности

Table of contents

Introduction

1. Context description

1.1 Fraud, anti-fraud, types of loan application fraud

1.2 Loan applications in credit institutions

1.3 Machine learning techniques and its application in anti-fraud models

2. Clustering

2.1 Overview of selected clustering methods

2.2 Metrics for comparing clustering methods

2.3 Clusters in anti-fraud model

3. Graph Analysis

3.1 Graph metrics in anti-fraud model

4. Discussion

Bibliography

Appendix

Introduction

This paper is devoted to find the ways to improve the existing anti-fraud model using techniques such as Machine Learning based clustering and graph analysis.

The object of the work - loan applications' data.

The subject matter of the work - data analysis to identify factors or metrics in order to enhance the anti-fraud model.

Methods used - clustering and graph analysis.

Fraud has always been a serious problem for business of credit institutions.

Nowadays, since the processes of digitization and migration of banking services have led to the fact that a loan application can be submitted online in a couple of clicks, the number of loan application fraud is constantly growing.

Therefore, business is interested in continuous improvement of anti-fraud models: they must be flexible enough to solve this kind of problem, namely - to resist fraudsters who are trying to use new strategies to break through the anti-fraud barrier.

For this purpose, many researchers of anti-fraud solutions increasingly begin to use different techniques such as machine learning and graph analysis.

So it seemed interesting to see how much machine learning algorithms and graph analysis can affect the quality of the anti-fraud model. Therefore, it was decided to take the existing model (obtained on the basis XGBoost - an open-source software library which provides the gradient boosting framework for C++, Java, Python and R), which is used in real bank, and try to improve it using machine learning methods and results of graph analysis.

Thus, the aim of this work is to improve the quality of the above mentioned model to more accurately identify fraud.

The work is organized as follows:

Section "2. Context description" gives some terms, definitions and examples for a more detailed acquaintance with the context of this work.

Section "3. Clustering" discusses clustering (methods, metrics for comparing) and demonstrates the results of clustering existing data. The obtained algorithms are added to the existing model to see if it has improved.

Section "4. Graph Analysis" demonstrates graphs based on data from loan applications and shows the connections between nodes marked as fraud.

This Section solves the following tasks: determining entities involved in the formation of graphs, forming hypotheses, enumerating metrics that can be collected from a graph, and building a correlation matrix to select metrics that will be used in the model. Then, all obtained metrics are added to the model in order to ensure that the desired effect is achieved.

And finally, Section "5. Discussion" summarizes the results of the work: demonstrates a comparison of both approaches (clustering and graph analysis) among themselves, selects the approach that can provide a possibility of more substantial improvement of the model; and demonstrates the final result - improving the quality of the original anti-fraud model.

1. Context description

1.1 Fraud, anti-fraud, types of loan application fraud

Fraud as an attempt to gain unauthorized access to cash has always existed and never stopped. The concept of "fraud" implies "an intention on the part of some party or individual presumably planning to commit fraud… it is usually less important whether or not intentional fraud has occurred, or some erroneous information was introduced into system by customer". [1, subsection "Fraud" vs. "Erroneous" Claims, Information, etc"].

Recently, any business that provides an access to some financial transactions online is increasingly vulnerable to attack by fraudsters.

The topic of fraud detection is relevant to many industries including banking sector, insurance, e-commerce, and many more.

To prevent fraud business began to use anti-fraud models that have a different focus (for example, the protection objects can be payments, loans, insurance payments, payments on tariffs for traffic and so on and so forth).

Since in this paper it will be considering the loan application fraud in credit institutions, examples of this kind of fraud are given below:

"Traditionally application fraud is at the first-party level, where the fraudster is the customer"[2]: usually а client provides some fake information to deceive credit scoring model of the bank (for example, overstates their income).

"Application fraud can also occur at the third-party level, where the fraudster steals the identity of a customer and applies for a new credit product"[2].

Further it is advisable to consider more closely the topic of applications for credit.

1.2 Loan applications in credit institutions

A loan application is a request for a loan provision. It requires that certain information must be provided by potential borrowers.

Loan Application Information

"In all types of credit applications the information requested by banks as lenders is typically the same: a lending decision will be made after a hard credit inquiry, based on a borrower's credit score and credit history"[3, subsection "Credit application information"].

Loan Application Process

The application for credit can be filled in different ways: for conservative borrowers who seek more personal interaction and are used to the service in the bank branch, traditional banks offer their branches across the nation with offline customer services.

In recent years, with the arrival of new technologies in the lending market, the application process has moved to online. For example, "credit card applications are typically processed through an online credit application often providing the borrower with an immediate approval"[3, subsection "Credit application processes"], decisively accelerating and simplifying the process. Neobanks and fintech companies increasingly offer online lending options available for borrowers through a fully automated credit application. Classical banks "have also followed this trend adding many new online lending services for both consumers and businesses"[3, "Credit application processes"].

This trend opens up new opportunities for lending to a wider range of borrowers. However, while expanding access to credit is one side of the coin, an increase in fraud attempt - is another, and the implications can be shocking.

This is why new machine learning-based methods are needed to improve models and prevent the increase in fraud related to simplifying the application procedure for a loan.

1.3 Machine learning techniques and its application in anti-fraud models

In practice, Artificial Intelligence is a group of technologies that help facilitate the discovery and analysis of information for the purpose of making predictions and recommendations, support decision making, facilitate interactions, and automate certain responses[4, subsection "Investment possibilities with Artificial Intelligence"].

As for Machine learning, this is "a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention".[5]

There are many types of machine learning (Figure 1) and they are used for different purposes.

Figure 1. Classification of machine learning techniques [6, p.18]

"Machine learning has various iterations, including

? supervised learning,

? unsupervised learning

? deep and reinforcement learning

The purpose of supervised learning is to establish a relationship between two datasets and to use one dataset to forecast the other. The purpose of unsupervised learning is to try to understand the structure of data and to identify the main drivers behind it. The purpose of deep learning is to use multi-layered neural networks to analyze a trend, while reinforcement learning encourages algorithms to explore and find the most effective strategies" [7, section 5].

Supervised learning assumes the existence of a "target": the model should adapt to the "target" and learn from it. As for the features of unsupervised learning, it has no "target", and the model must classify the data by itself, looking for dependencies in the data.

The object chosen for this study is anti-fraud model for credit applicants, which is based on boosting (XGBoost) and used in real every day bank practice. In the following sections will be considered how clustering, based on unsupervised learning, and graph analysis will affect the quality of this model.

2. Clustering

2.1 Overview of selected clustering methods

Typically, the anti-fraud model is formed on the basis of some "rules" such as: if client has loans with the same address in different banks in the last month, this client seems to be cheater. Such rules are collected by bank experts for many years and currently the amount of these rules equals approximately two hundreds.

So, using these rules with the help of a boosting it was created a model (so called a classification model of the decision tree).

This paper will look at how clustering can affect the quality of an existing model. For this purpose, according to the bank experts' opinion, it was decided to take the following significant data for clustering clients, such as:

? Sex

? Age

? City

? Credit amount

? Income

? Good category

An example of such data sample is given in the Appendix 1.

Considering the fact that this study deals with a large data set (about 1 000 000 credit applicants' records), different approaches to clustering were chosen using k-means, k-prototypes and DBSCAN methods.

K-means algorithm has been chosen since it is rather simple and it is one of the most frequently used clustering methods.

It was also decided to use k-prototypes algorithm, as far as the application of this method is always justified when we are dealing with mixed data (with numeric and categorical values). Moreover, DBSCAN was used as an opposite approach to clustering in comparison with k-means model.

Let's start with the simplest method of clustering - the k-means method.

K-means

"The k-means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields" [8, subsection 2.3.2. K-means].

However, despite the fact that k-means based methods are quite effective for processing large data sets, their main disadvantage is numeric data only limitation. Thus, to work with categorical data, it is necessary to turn them into numerical ones (dummy variables). So the variable "sex" was converted to binary values 0 and 1, as for the other two categorical attributes, the most populated cities and the most frequent categories of products for credit were selected for dummy variable conversion.

Since it will be necessary to use both categorical and numeric attributes for clustering, it is proposed to use another method based on the k-means paradigm but eliminating its disadvantage (only numeric data processing).

"The algorithm clusters objects with numeric and categorical attributes in a way similar to k-means. Because objects are clustered against k-prototypes instead of k-means of clusters, it is so called the k-prototypes algorithm"[9, p.1].

K-prototypes

It can be described as proposed in Zhexue Huang's work [9, p.6]:

"(1) Select k initial prototypes from a data set X, one for each cluster.

(2) Allocate each object in X to a cluster whose prototype is the nearest to it according to equation:

d(Xi, Ql) =(xijr- qljr)2 + (xijc,qljc)

Where xijr and qljr are values of numeric attributes, whereas xijc and qljc are values of categorical attributes for object i and the prototype of cluster l; mr and mc are the numbers of numeric and categorical attributes; гl is a weight for categorical attributes for cluster l.

Update the prototype of the cluster after each allocation.

(3) After all objects have been allocated to a cluster, retest the similarity of objects against the current prototypes. If an object is found such that its nearest prototype belongs to another cluster rather than its current one, reallocate the object to that cluster and update the prototypes of both clusters.

(4) Repeat (3) until no object has changed clusters after a full cycle test of X".

It should be noted that this method is very time-consuming. In order to train the model, even on a shorter datasample (about 200,000 records) and for a small number of clusters, it took about 3 hours.

Another one clustering method used in this work is DBSCAN.

DBSCAN

"The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster" [8, subsection 2.3.7. DBSCAN].

Thus, first the procedure of t-SNE is performed (t-distributed Stochastic Neighbor Embedding, see Appendix 2), then proceed to DBSCAN clustering.

The graph below (Figure 2) shows the results of clusterization procedure DBSCAN.

Figure 2. DBSCAN clustering

There were found 20 clusters with eps = 2.5. Unfortunately, we cannot see at least one complete cluster on the graph, the clusters turned out to be torn to pieces.

2.2 Metrics for comparing clustering methods

Elbow

Elbow is one of methods to validate the number of clusters. "The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE). Like this:

var sse = {};

for (var k = 1; k <= maxK; ++k) {

sse[k] = 0;

clusters = kmeans(dataset, k);

clusters.forEach(function(cluster) {

mean = clusterMean(cluster);

cluster.forEach(function(datapoint) {

sse[k] += Math.pow(datapoint - mean, 2);

});

});

}" [10]

Figure 3. Selecting optimal number of clusters with the Elbow Method

On this graph (Figure 3), the values of the Elbow metrics are calculated for a different number of clusters from 3 to 100 by the k-means method. According to the graph for the given metric, the optimal number of clusters is about 54.

Despite Elbow method is widely used, it is only a sensible measure for spherical clusters, so it is not suitable for DBSCAN model. Similar reasonings apply for most internal measures: most are designed around centroid-based cluster models, not arbitrarily shaped clusters. This metric also cannot be applied for k-prototypes method since there are categorical data.

Silhouettes

"If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

a: The mean distance between a sample and all other points in the same class.

b: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient s for a single sample is then given as:

" [8, subsection 2.3.9.5. Silhouette Coefficient]

In the previous section, the metric Elbow was calculated for k-means model, and it was found that the optimal number of clusters was about 54. Silhouette coefficient was also calculated for different number of clusters from 3 to 100.

Figure 4. Selecting optimal number of clusters with the Silhouette

It can be seen from the graph above (Figure 4) that, according to maximum values (0,76-0,79) of the Silhouette coefficient, the optimal number of clusters is in the same range (from 54 to 56).

For comparison, the Silhouette coefficient was calculated for DBSCAN model (Figure 5), and it appeared significantly lower than for k-means model.

For n_clusters = 20 The average silhouette_score is : 0.442

Figure 5. Silhouette for DBSCAN

However, such metric as Elbow and Silhouette cannot be used for k-prototypes clustering due to categorical variables presence in dataset. Therefore, another one metric was used - Information Value for comparing models with each other.

Information Value

"Information Value is one of the most popular metrics in banking and marketing spheres useful to select important variables in a predictive model. It helps to rank variables on the basis of their importance. The IV is calculated using the following formula :

IV = (% of non-events - % of events)*WOE

Where WOE is Weight of Evidence - another metric which helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events" [11].

The values IV can be interpreted within the rule of thumb described below (Table 1).

Table 1. Information Value interpretation [11, section "Information Value"]

Information Value

Predictive Power

< 0.02

useless for prediction

0.02 to 0.1

weak predictor

0.1 to 0.3

medium predictor

0.3 to 0.5

strong predictor

> 0.5

suspicious or too good to be true

On the graph below (Figure 6) it can be seen that the number of clusters with strong predictive power (values IV are from 0.3 to 0.5) varies from 54 to 86 approximately.

However, after summing up the results of the three metrics for k-means model, it can be assumed that the optimal number of clusters should be approximately 54.

Figure 6. Selecting optimal number of clusters with IV

This metric was also used for k-prototypes but not for all combination of cluster numbers due to the fact that this procedure takes too much time. The largest Information Value belongs to 80 clusters, and this significantly differs from k-means algorithm results. As for DBSCAN, IV = 0,34 that also indicates strong prediction power.

To sum up, there were calculated different estimation metrics for three selected models but the results are too different, that is why it was decided to check the impact of clustering on existing anti-fraud model and use it as an additional metric.

2.3 Clusters in anti-fraud model

After calculating different metrics for particular clustering models, it became clear that these metrics are not sufficient to clustering models comparison. In order to verify the quality of clustering results, it was decided to insert the cluster numbers into the anti-fraud model and check the delta of the model's score AUC (Area Under the Curve (i.e., Receiver Operating Characteristic ROC curve)).

The AUC calculation for k-means clusters provided in Appendix 3.

The process was organized as follows:

First, data with customer credit applications received in 2016 were taken for fitting model. According to these data, the centers of clusters were calculated. Initially, K-means algorithm was taken. Then the 2017 data were taken for predicting which cluster the client will fall into.

After defining the cluster number for each client the dataset was divided into two parts: training and testing data sample. The training data firstly was used for boosting procedure, where the variables were that "rules", and the target was fraud indicator (1 - true, 0 - false). Then the predictions were calculated on the test data, as well as the score indicator of the model AUC. Then everything was done the same way but among the variables there was also the clustering number where the client was added to. For the k-means after the score of the second version of the model (with the cluster number) was obtained, it was observed that the delta between two AUC scores was only 0.0002. As for the k-prototypes algorithm, the delta is larger and approximately equals to 0.01, which indicates a slight improvement in the existing anti-fraud model. DBSCAN resulted with delta equal to 0.00007, which certainly cannot be evidence of any kind of significant improvement.

So, in this case, we can say that k-prototypes is the most suitable clustering method that increases score of existing anti-fraud model.

In the next section, will be considered a different approach to improve the anti-fraud model, namely - a graph analysis.

3. Graph Analysis

"A graph is a collection of points, called vertices (or nodes), and lines between those points, called edges (or links)"[12].

Graph analysis is actively used as a tool in many areas such as social networks, marketing, epidemiology and so on.

As for the object of this work - data loan applications - they can also be represented as graphs.

Graph analysis can reveal hidden connections between loan applicants, and thus contribute to a more accurate identification of fraudsters.

Building graph

For graph analysis and metrics calculation new functions and algorithms were created on the basis of "igraph" library in R (see Appendix 10).

Initially there was formed complex adjacency matrix consisted from 4 entities:

? Loan application ID

? Applicant ID

? Employee that was reviewing this application

Hypothese: The employee may be in collusion with the client.

? Client's phone number

Hypothese: Cheaters can reuse phone numbers

For clarity, in order to understand how our graph will look like, at first a small region (Primorsky Krai) and a period of 6 months (02.2018 - 08.2018) were selected. And according to data from this region for the above period was built the graph (see Figure 7). The Graph consists from 2724 nodes and 5498 connections between them.

Figure 7. Graph built on application data

Due to large amount of connections it was quite difficult to make a visual analysis. Therefore, it was decided to build different subgraphs for further analysis.

Subgraphs below at Figure 8 show that some fraudulent clients could be found through banking employees that were accepting their applications.

Figure 8. Subgraphs with connections through employees

Also, if the application has the same phone number as in the application of another person, then this may be a trigger indicating an increased likelihood of fraud (see Figure 9).

Figure 9. Subgraphs with connections through employees and phone numbers

Degree distribution shows an inversely proportional relationship between the number of nodes and the number of connections: an increase in the number of connections leads to a decrease in the number of nodes (see Figure 10).

Figure 10. Degree distribution

Metrics description

In this subsection were calculated centrality metrics: degree centrality, betweenness centrality, closeness centrality, PageRank, shortest paths to fraud, number and size of cluster.

Degree centrality is the number of links that a node has.

ij (1)

where - edge from node to neighbour node j

On the graph below there can be seen loan officers who have received the largest number of loan applications but unfortunately there cannot be noticed any fraud (see Figure 11).

Figure 11. Degree centrality

Betweenness centrality - number of shortest paths going through the node.

(2)

where - the total number of shortest paths from node to node

- number of shortest paths passing through node .

Here again can be only noticed the employees through which the shortest paths pass.

Figure 12. Betweenness centrality

Closeness centrality - minimal number of steps in path to reach others nodes in graph (see Appendix 4).

(3)

where d(y,x) is the distance between vertices x and y

Unfortunately, the graphs with centrality metrics shown above do not allow to see a clear connection between the values of metrics and fraud.

"PageRank measures the transitive influence or connectivity of nodes.

It can be computed by either iteratively distributing one node's rank (originally based on degree) over its neighbours or by randomly traversing the graph and counting the frequency of hitting each node during these walks" [13].

The PageRank indicates the importance of nodes in the graph. The presence of important vertices in the graph means less likelihood that the client is a cheater.

Shortest paths to fraud

A node can have several paths to another node in the graph. However, in this part of the work, the most important thing is to find the shortest distance to fraudulent node. If the node itself, for which the metric is considered, is fraudulent, then the shortest distance to the next fraudulent node is considered.

Hypothesis is following: the shorter the path to the fraud entity, the more probability is that the client is fraudulent.

Number of cluster and its size

In the case of graph, clusterization will be defined as a search of strongly connected components in graph. The figure below represents the most connected part in the whole graph (see Figure 13).

Figure 13. Strongly connected component in graph

After calculation of all metrics above it was decided to build correlation matrix showing the correlation between these values to define the influence of metrics to fraud.

Figure 14. Correlation matrix

From the correlation matrix (see Figure 14), one can see that the fraud presence is influenced by such metrics as the shortest paths to fraud, cluster size and PageRank. Therefore, the next subsection focuses on how these metrics affect the current anti-fraud model.

3.1 Graph metrics in anti-fraud model

The process of calculating graph metrics for training and test datasets differs from the process described in section 3 (subsection 3.3. Clusters in anti-fraud model). As previously mentioned, a graph is a representation of the connections between entities. In our case, new loan applications sent from customers are connected with applications received in the past period, so it would be incorrect just to divide the entire sample of applications in two parts: the training dataset and test dataset.

Therefore, it was decided to separate part of data for training and calculate the metrics on this sample; and for the test dataset, it was decided to sort loan applications by arrival time, and then to build a cycle: each new application is added to the training dataset sample, a graph is constructed and metrics are calculated for this application. The next loan application is added to the previous application and to the training dataset, and the same process is repeated. Thus, the process of online learning model is simulated.

To verify whether the resulting metrics improve the current model, it will be necessary to calculate the AUC for the current model. Then it will be necessary to add new metrics to train data, select parameters for boosting (depth, number of trees) (see Appendix 8). The algorithm obtained with training dataset should be applied then to test dataset.

After that, by calculating the AUC from the new model, the two AUCs can be compared with each other, and thus it will become obvious that the model with the new metrics is improved and gives more accurate results. Metrics are added alternatively. At first shortest paths to fraud are added, and the the results are following:

? The current model AUC score is 0.743 (to accelerate the process of calculating the AUC, it was decided to take the final score calculated on the current model instead of 200 rules).

? After adding to the model the shortest paths to the fraud AUC increased and equaled 0.797.

? So Gini coefficient (that equals 2*AUC - 1) raised by 0.108

Figure 15. Shortest paths influence on target in anti-fraud model

The graph with SHAP Values (SHapley Additive exPlanations) above shows how exactly the factors affect the fraud level: the shorter the path from the application to another fraud application, the higher the likelihood that this application is from a fraudster (see Figure 15).

After the shortest path variable has been added to the model, the following PageRank variable is added, expecting that predictive power of model with new factors will not decrease. In fact, that is exactly what happens: the results became slightly better:

? After adding to the model PageRank AUC increased and equaled 0.814.

? Gini coefficient raised by 0.034

Figure 16. Shortest paths influence on target in anti-fraud model

Despite the fact that results have improved, the influence of PageRank on the fraud cannot be unambiguously described due to some concentration of results with fraud between SHAP values where there is low probability of fraud (see Figure 16).

Further, if the cluster number is added to the model, then the indicator, though slightly, increases to 0.825.

The graph below shows a completely logical picture: the smaller the cluster to which application belongs, the greater the likelihood of fraud. It can be also seen that the influence of PageRank in this model became more evident but still it disproves hypothesis formulated in subsection 4.2 Metrics description. Figure 17 demonstrates that the presence of important vertices means more probability of fraud.

Figure 17. Factors influence on target in anti-fraud model

Summing up the fourth section can be concluded that not all metrics revealed a correlation with the presence of fraud: only shortest paths to fraud, PageRank and cluster size have correlation. Degree, betweenness and closeness centrality did not show any connection with fraud presence that is why they were not added to existing model. However, the metrics with the identified connections with fraud showed an improvement of the fraud detection result.

4. Discussion

Over the last years society and business are widely discussing the ongoing digital revolution and the risks it brings. For banking business, one of the most serious risks is the increased amount of fraud due to the opening of online access to credit for new underbanked clients.

The way out of this situation should be an improvement of fraud prevention models using machine learning methods. For businesses where speed, scale, and efficiency are paramount, there is no other solution if they want to reduce fraud significantly.

So this work was motivated by the following problem: to improve the existing antifraud-model, which used in real banking practice, by credit applications clustering and graph analysis.

As a result of clustering, the following conclusions were obtained.

Verifying the quality of the anti-fraud model by k-means method, using all three metrics (Elbow, Silhouette and IV Information Value), we obtained the optimal number of clusters - about 54. DBSCAN clustering allows obtaining 20 clusters.

As for k-prototypes, then due to the fact that we have categorical data in the data set, the only metric that we could use it was Information Value, and according to this metric 80 clusters have strong predictive power.

Since selected clustering methods provide extremely different results and the metrics cannot guarantee the most correct comparison of models, we decided to put clustering values obtained by different methods into the final anti-fraud model and see how far the model has improved.

The best results were shown by k-prototypes clustering - it increased AUC score of the model by 0.01. This gives us a slight improvement, but nevertheless it is an improvement in the accuracy of the model.

In graph analysis there were calculated following metrics: degree centrality, betweenness centrality, closeness centrality, shortest paths to fraud, pageRank and cluster number and size. As a result, it should be noted that some metrics have demonstrated no correlation with fraud: these are metrics like degree, betweenness and closeness centrality. However, the metrics with the identified connections with fraud (shortest paths to fraud, PageRank and cluster size) showed a rather significant improvement of the fraud detection result.

Comparing with the previous approach, metrics collected from the graph improved the AUC model by 0.082, which is a significantly better result.

Thus, using two approaches to improve the anti-fraud model (clustering and graph analysis), a good result was obtained through the second one. Using graph analysis has greatly helped in this research and can be recommended for improving the fraud detection model which are usually used in credit institutions.

In continuation of this work in the future, it will be possible to further complicate the graph and add other entities to it (addresses, IP, device fingerprint and other) in order to reveal even more connections between bank customers and to strengthen predictive power of anti-fraud model.

Bibliography

1. StatSoft, Fraud detection, Electronic Statistics Textbook. Tulsa, 2013, URLhttp://www.statsoft.com/Textbook/Fraud-Detection

2. SAS, Banking application fraud: The enemy at the gates, Types of application fraud, White paper, p.2-3, URLhttps://www.sas.com/content/dam/SAS/bp_de/doc/whitepaper1/ff-wp-banking-application-fraud-2329159.pdf

3. Investopedia, Personal finance, Credit application, 2018, URLhttps://www.investopedia.com/terms/c/credit-application.asp

4. JPMorgan Chase & Co, Innovations in finance with Machine Learning, Big Data and Artificial Intelligence, Summary of the latest research and trends, 2018, URLhttps://www.jpmorgan.com/global/research/machine-learning

5. SAS, Machine learning: What it is and why it is matters, 2019, URLhttps://www.sas.com/en_us/insights/analytics/machine-learning.html

6. Marko Kolanovic, Rajesh T. Krishnamachari, Big Data and AI Strategies, Machine Learning and Alternative Data Approach to Investing, J.P. Morgan, 2017, URLhttps://www.cfasociety.org/cleveland/Lists/Events%20Calendar/Attachments/1045/BIG-Data_AI-JPMmay2017.pdf

7. Efinancialcareers, J.P. Morgan's massive guide to machine learning, 2017, URLhttps://news.efinancialcareers.com/dk-en/285249/machine-learning-and-big-data-j-p-morgan

8. Scikit-learn, Machine Learning in Python, Pedregosa et al., JMLR 12, 2011, URLhttp://scikit-learn.org/stable/modules/clustering.html#clustering

9. Zhexue Huang, Clustering large data sets with mixed numeric and categorical values, In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1997, URLhttps://grid.cs.gsu.edu/~wkim/index_files/papers/kprototype.pdf

10. Robert Gove's Block, Using the elbow method to determine the optimal number of clusters for k-means clustering, 2017, URLhttps://bl.ocks.org/rpgove/0060ff3b656618e9136b

11. Deepanshu Bhalla, Listen Data, Weight of evidence (WOE) and information value (IV) explained, 2015, URLhttps://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

12. Study.com, Graphs in Discrete Math: Definition, Types & Uses, 2019, URLhttps://study.com/academy/lesson/graphs-in-discrete-math-definition-types-uses.html

13. Neo4j, The Neo4j Graph Algorithms User Guide v3.5, 5.1. The PageRank algorithm, URLhttps://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/

14. PWC, Forensic Services Practice, Fraud. A guide to its prevention, detection and investigation, 2008, URLhttps://www.pwc.com.au/consulting/assets/risk-controls/fraud-control-jul08.pdf

15. Alex Lipton, David Shrier, Alex Pentland, Connection Science & Engineering Massachusetts Institute of Technology, Digital Banking Manifesto: the End of Banks?, 2016, URLhttps://www.getsmarter.com/career-advice/wp-content/uploads/2017/07/mit_digital_bank_manifesto_report.pdf

16. Arthur David, Sergei Vassilvitskii, k-means++: The advantages of careful seeding, 2007, URLhttp://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

17. Albert-Laszlo Barabasi, Network science, Cambridge University Press, 2015, URL http://networksciencebook.com/

18. D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge University Press, 2010, URLhttp://www.cs.cornell.edu/home/kleinber/networks-book/

19. Eric Kolaczyk, Gabor Csardi, Statistical Analysis of Network Data with R. Springer, 2014.

20. Mark Newman. "Networks: An Introduction", Chapters 6,8 Oxford University Press, 2010, URLhttp://math.sjtu.edu.cn/faculty/xiaodong/course/Networks%20An%20introduction.pdf

21. S.E. Schaeffer, Graph clustering, Comp. Sci. Rev., Vol. 1, p 27-64, 2007, URLhttp://www.leonidzhukov.net/hse/2018/sna/papers/GraphClustering_Schaeffer07.pdf

22. V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre. Fast unfolding of communities in large networks, J. Stat. Mech. P10008, 2008, URLhttp://www.leonidzhukov.net/hse/2018/sna/papers/blondel2008.pdf

23. Didier Lavion, Global Economic Crime and Fraud Survey 2018, Pulling fraud out of the shadows, PWC, URLhttps://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey-2018.pdf

24. D.S. Rawat, Dinesh Anand, Current fraud trends in the financial sector, ASSOCHAM, PWC, 2015, URLhttps://www.pwc.in/assets/pdfs/publications/2015/current-fraud-trends-in-the-financial-sector.pdf

25. Felises Ganpantsurova Alla Beatris, Credit applicants' clustering model for fraud prevention based on machine learning algorithms, Course work, HSE, 2018

Appendix

Datasample

AUC calculation for k-means clusters

In [110]: import xgboost as xgb

?t = read_sql("""select * from vp_fps_clust""", refresh=True)

features = d.columns[3:]

feat=features+['score_fps']

?

def score(params):

seed=777

cv_res=xgb.cv(params, xgb.DMatrix(t[feat],t['flag_3pd30']), maximize=True,

early_stopping_rounds=20, num_boost_round=1000, nfold=5, seed=seed, verbose_eval=False)

w=np.array(cv_res[cv_res['test-auc-mean']==cv_res['test-auc-mean'].max()])[0]

score=w[0]

std=w[1]

best_iter=cv_res[cv_res['test-auc-mean']==cv_res['test-auc-mean'].max()].index[0]

print(score, std,params['max_depth'],params['subsample'],params['eta'], params['gamma'],

params['min_child_weight'],best_iter)

return (score,std, best_iter, seed)

best=(0,)

for max_depth in [6]:

for eta in [0.05]:

for sub in [0.7]:

for g in [0]:

for m in [1]:

params = {

'eta': eta,

'max_depth': max_depth,

'objective': 'binary:logistic',

'booster': 'gbtree',

'subsample': sub,

'gamma': g,

'min_child_weight': m,

'eval_metric': 'auc'

#'tree_method': 'approx'

}

res=score(params)

if res[0]>best[0]:

best=res + (params,)

Out [110]:(0.77680360000000004, 0.0039207245554871871, 6, 0.7, 0.05, 0, 1, 202)

In [111]:

t = cv_score(df=t, params=best_params, num_trees=202, features=feat, target = 'flag_3pd30', n_splits=5,

name_score='raw_fps_score')

In [112]:

roc_auc_score(t.flag_3pd30, t.raw_fps_score)

Out[112]:

0.77618592419599719

In [113]:

roc_auc_score(t.flag_3pd30, t.raw_clust_score)

Out[113]:

0.77635355848420351

Closeness centrality

Eigenvector centrality

Bonacich power centrality

Alpha centrality

AUC graphs for XGB parameters tuning

XGB parameters tuning

Graph metrics calculation

Размещено на Allbest.ru


Подобные документы

  • Процессоры Duron на ядре Spitfire (Model 3), Morgan (Model 7), Applebred (Model 8), Mobile Duron Camaro. Схема материнской платы EP-8KHAL+. Микросхема "Северный мост". Звуковой чип ALC201A. Конфигурация системной памяти. Регулятор заглушки шины RT9173.

    курсовая работа [3,6 M], добавлен 26.03.2013

  • The need for Colvir's functional modules to avoid the costs of training and to facilitate modification and interaction of system components. Description and practical use of Citrix server and CyberPlat - integrated universal banking online payments.

    доклад [505,3 K], добавлен 05.09.2011

  • Концептуальна модель бази даних, визначення зв’язків між ними, атрибутів сутностей їх доменів. Створення ORM source model та Database model diagram для бази даних "Автотранспортне підприємство". Генерування ddl-скрипта для роботи в СУБД SQL-Server.

    курсовая работа [47,3 K], добавлен 17.10.2013

  • IS management standards development. The national peculiarities of the IS management standards. The most integrated existent IS management solution. General description of the ISS model. Application of semi-Markov processes in ISS state description.

    дипломная работа [2,2 M], добавлен 28.10.2011

  • История развития корпорации Intel, ее финансовые показатели и планы на будущее. Основные программные продукты: C++ Compiler for Linux и for Windows, Visual Fortran Compiler for Windows, VTune Performance Analyzer. Защита информации Intel Anti-Theft.

    реферат [20,6 K], добавлен 02.04.2010

  • Technical methods of supporting. Analysis of airplane accidents. Growth in air traffic. Drop in aircraft accident rates. Causes of accidents. Dispatcher action scripts for emergency situations. Practical implementation of the interface training program.

    курсовая работа [334,7 K], добавлен 19.04.2016

  • Component Object Model. Объектная модель Microsoft. Пути решения проблемы повторного использования кода. Понятие интерфейса. Двоичный стандарт для программных компонентов. Многоразовое использование программного обеспечения.

    контрольная работа [16,2 K], добавлен 01.08.2007

  • Machine Translation: The First 40 Years, 1949-1989, in 1990s. Machine Translation Quality. Machine Translation and Internet. Machine and Human Translation. Now it is time to analyze what has happened in the 50 years since machine translation began.

    курсовая работа [66,9 K], добавлен 26.05.2005

  • Информационный поиск: векторная модель (vector-space model). Ранжирование документов по мере их соответствия запросу. Традиционные методы оценки эффективности поиска. Концептуальное индексирование. Разрешение многозначности. Board: значения и иерархия.

    презентация [95,2 K], добавлен 01.09.2013

  • Модель взаимодействия открытых систем Open Systems Interconnection Reference Model. Основные особенности модели ISO/OSI. Характеристики физических сигналов, метод кодирования, способ подключения. Канальный уровень модели ISO/OSI. Передача и прием кадров.

    презентация [52,7 K], добавлен 25.10.2013

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.