Credit

scoring in retail banking

Predicting

creditworthiness of borrowers

Prashant

Dimri

Masters

in Data Analytics

Dublin

City University

Dublin,

Ireland

Abstract- Credit scoring techniques

are used for determining the creditworthiness of a

borrower that is to determine whether to give a loan to a borrower or not based

on credit scoring. The higher the score, better it is for banks to give a loan

to borrowers. The aim of the paper is to develop a retail credit scoring model using

techniques like logistic regression, clustering, and propensity scoring method

and to investigate on various things like on how the incorporation of more

variables improve the accuracy of the model, top five factors influencing the

risk, to see how important credit and demographic variables are, what cut off

probability to choose that have an effect on number of correct and non-correct

events thus influencing the confusion matrix. Keywords- Credit Scoring, Logistic

Regression, Retail banking, Credit Risk

1. Introduction

The aim of commercial banks is to give credit to

borrowers. Credit risk arises when borrower defaults in repayment of loan which

can cause because of various reasons like insolvency of borrowers, willful

default (when borrower intentionally doesn’t pay) etc. History signifies that

ineffective credit risk management can lead any banks or financial institution

to bankruptcy. So, it is imperative for banks or financial institution to

observe and accumulate information about potential borrowers and to review

performance of an accepted borrower over time as well to maintain solvency,

thus quality of loan is very important for survival and for profitability. So

there comes a credit scoring which helps in reducing cost and time in decision

making thus improving the profitability of banks. The need for a formal process

for credit scoring first started in 1960’s when there was boom in the credit

card business and automatic decision-making process became vital for business

growth (Trainer, 2015). Credit

scoring is a method of mitigating the probability of a default among customers,

which can thus maximize the profitability of a bank or financial institutions

by minimizing the ensuing risk to them. Techniques like regression analysis,

logistic regression, support vector machine, decision trees, neural network,

etc. are widely used in building credit scoring models (Pointon, 2011).

Anderson (2007) broke credit

scoring into two parts. Credit means buy now, pay later and scoring refer to

numerical tool to rank order cases according to real or perceived quality to

discriminate them and to ensure objective and consistent decision. So, credit

scoring is simply the use of statistical models to transform data into numerical

form for better decision making. Credit score determine how risky borrower is.

The higher the score, better it is for banks to give loans to borrowers. The

aim of this paper is to build appropriate retail credit scoring model to

predict creditworthiness of borrower and so able to see which are the most

important variables in decision making process.

2. Literature Review

Mostly credit scoring is done with non-retail loans as

the data tends to be more readily available. Also, the amount of money for lending

in the given to non-retail sector tend to be higher than the retail sector (Kocenda

and Vojtek, 2009). In retail lending, various socio-demographic variables along

with credit bureau variables of customers are taken to make a prediction about the

client’s portfolio. Through this, credit scoring is developed for estimating the

probability of default on retail loans. Blazy and Weill (2006) have stated that riskier loans are to be

collatarized else should not be financed.

According to Basel II capital accord (Basel Comittee

on Banking Supervision, 2015), any loan which is not repaid within 90 days will be considered as a Non-Performing

Loan. The Initial decision to sanction a loan is normally based on judgmental

approach by just analyzing the details on the application form of the borrowers (Pope). It is based on the so-called 5 C’s Principle

of Character, Capital, Capacity, Collateral, and Condition.

With the help of a statistical model, credit scoring

converts data based on these traditional criteria of 5 C’s of credit into

numerical form to make credit decision that is to determine whether future

customers will default or not. As Credit Scoring Model tends to reduce the time and

cost spent by the loan officer on loan assessment, hence decreasing the default

ratio, it is far better than traditional approach of loan assessment (Caire and Kossmann, 2003). Hand and Henley (1997) applied and reviewed various statistical techniques like logistic

regression, neural networks, and recursive partitioning, etc. for building

credit scoring model. They came to conclusion that apart from classification of

customers into good and bad based on their initial application characteristics,

there are also various statistical challenges in credit scoring like loan

review functions (to know when to approach customers for repayment of their

loans), fraud, questions like when and how to act on delinquent loan, etc.

D.J.Hand (2005) examined that in predictive models whereby

scorecards were used to assign customers to classes thus leading to the proper

course of action being taken based on a customer’s predicted score being above

or below a given threshold. Common statistical measures like Gini Coefficient,

Kolmogorov-Smirnov statistic, etc. may not use relevant information about the

magnitude of scores thus leading to possible misclassification leading to

degradation in decision quality. It was Anderson in (2007) who proposed that

credit scoring term be divided into two segments, credit and scoring. Credit is

taking money now and paying it later. Scoring is numbers given to determine the

customer’s creditworthiness that is whether customer is worthy enough to have a

loan or not. Higher the credit score better the customer is to have a loan.

Dinh and Kleimeier (2007)

proposed a credit scoring model

for Vietnamese retail loans which made them conclude that credit risk modeling

help banks to reduce time, cost spent on loan assessment and thus helps in

increase in the profitability of the business. Hasan (2016) made a retail credit scoring model on scarce data to find

the probability of default on retail loans and concluded that even with scarce

data, construction of a model can be achieved thus helping the decision makers

in expediting the credit appraisal process. Kocenda and Vojtek (2009) concluded that taking account of

socio-demographic factors was imperative during the time spent giving credit

and along these lines such factors ought not to be rejected from credit scoring

model determination. Hand and Zhou (2009) studied two behavioral classifications (settle

immediately versus not settle immediately and make some repayment versus make

no repayment.) and prediction was made using rule whether in which class each

customer belongs. The aim was to construct a rule that will allow objects to be

assigned to one of the classes (here 0 and 1). The rule is constructed from

past data for a sample of objects. In this, there are two fundamental aspects

of classification rules that were considered when performance was evaluated.

The first was the score distribution for two classes as 0 and 1. A second

aspect was choice of classification threshold (t) such that objects with scores

greater than t are predicted to belong to class 1 and to class 0 otherwise.

Misclassification arose when object with score above t belong to class 0 and

object with score below t belong to class 1. Performance is vital in choosing a

rule appropriately and thus getting accurate predictions of future behavior of

customers. Bekhet and Eletter (2014) proposed

two credit scoring models (Logistic regression and Radial basis function)

utilizing data mining techniques to help advance choices for the Jordanian

banks. Advance application assessment would enhance credit choice viability and

control advance office tasks and in addition spare time and cost for analysis

and concluded that logistic regression model was slightly better than radial

basis function model in terms of overall accuracy rate, but radial basis

function was good in identifying those customers who might default. Karwa?ski Grzybowska (2015) examined

that model are not only meant for finding the probability of default but also

to identify the risk drivers, so propensity scoring method is used to detect

risk drivers using logistic regression, random forest and gradient boosting.

Hussain A. Abdou, et all (2016) compared

performances of various models by using ROC curves and Gini coefficients which

were used for evaluation criteria and Kolmogorov test which was used for

robustness with using different techniques- Logistic regression, Classification

and regression tree and cascade correlation neural network (CCNN). They found

out that CCNN was superior to other techniques. Also, variables like previous

occupation, borrower’s account functioning, guarantees, other loans and monthly

expenses were identified as key variables for forecasting and decision-making

processes of a credit policy.

3. Research

Plan

My research is on consumer credit risk of credit cards.

As per the research on various papers of credit risk management, techniques

like Logistic Regression can be used to build a data scoring model that is for

building predictive models to determine the credit card risk on the dataset.

Different types of model validation to check for the accuracy of the model like

gain/lift curve, concordance/discordance ratio, ROC curve can be done. Gain

chart is used to determine how much better one can do with predictive models

than without it. Gain chart is somewhat like propensity scoring match where

observation is equally divided into 10 equal groups and then cumulative number

of actual events are taken and then outcome of the chart should be like

predictive outcome should come higher than observed outcome for better model

accuracy. Then there will be confusion matrix which has four things in it- True

and false positive, true and false negative. Further, ROC curve which is curve

defining your true positive and false positive. Greater the number of true

positives and true negatives, much better our model is. Once the model is fine,

it will help in combating the risk associated with new data and thus improves a

quality of loans.

Clustering can be done where segmentation can be done on

the dataset and thus homogenous clusters can be made and next credit scoring

can be done on each homogenous segment. Though it can lead to additional cost

due to development, implementation, maintenance etc but at the same time

improves performance.

4. Conclusions

In this

paper, retail credit scoring model will be built on dataset having 150 data

points with 11 variables which is divided into 2 parts- borrower credit bureau

and borrower demographic bureau. Based on this predictive model, we will be

able to find the creditworthiness of a borrower whether borrower should be

given a loan or not. Various techniques like logistic regression, clustering

and propensity scoring model will be used. Further different types of model validation to check for the accuracy of the

model like gain/lift curve, concordance/discordance ratio, ROC curve can be

done.