Its quite a long time since my last post. It has been a busy time for me and a lot of things changed. One obvious change was the overdue change of my blog title from Ipython to Jupyter notebooks. Also has my work life shifted over the time from a Quant to a Data Scientist role. The most recent change is that relocated to Bangkok this year. I am still working in the finance industry and now lead a small Data Science team in the Operational Risk area.

After getting used to the new life here in South East Asia I’ve decided to continue my blog. But there will be slight change in the topics, I plan to write more about Machine Learning and Deep Learning in Python and R and less about pricing and XVAs. But I will have also the chance to look into some pricing model validation in my new role, so there is the chance that there will be some quant related postings coming as well.

In the next three coming posts, we will see how to build a fraud detection (classification) system with TensorFlow. We will start to build a logistic regression classifier in SciKit-Learn (sklearn). In the next step will build a logistic regression classifier in TensorFlow from scratch. In the 3rd post we will add a hidden layer to our logistic regression and build a neural network.

You can find the complete source code on GitHub or on kaggle.

For this example we use public available real world data set. You can find the data on kaggle. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, there are no more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA.

# Install requirements

Easiest way is to install the Anaconda Python distribution from Anaconda Cloud. Windows, Mac and Linux is supported. I use the Python 3.6 64 bit Version. Most of the required packages come already with the basic installation. To install Keras and TensorFlow open the Anaconda prompt (shell) and install the missing packages via conda (package manager, similar to apt-get in ubuntu):

conda install -c conda-forge keras tensorflow

# Fraud detection with logistic regression in Scikit-Learn

First we load a required libraries and functions

This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load in import numpy as np import pandas as pd import tensorflow as tf from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix import seaborn as sns import matplotlib.pyplot as plt # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory import os print(os.listdir("../input")) # Any results you write to the current directory are saved as output.

Pandas is used for loading the data and a powerful libraries for data wrangling. If you are not familiar with pandas check out the tutorials on the pandas project website. numpy is the underlying numerical library for pandas and scikit-learn. seaborn and matplotlib are used for visualisation.

## Load and visualize the data

First we load the data and try to get an overview of the data.

We load the csv-file with the command `read_csv`

and store it as data-frame in our memory.

credit_card = pd.read_csv('../input/creditcard.csv')

Next we try to get an overview of the fraud vs non-fraud distribution, we going to use the seaborn countplot function to produce bar chart.

f, ax = plt.subplots(figsize=(7, 5)) sns.countplot(x='Class', data=credit_card) _ = plt.title('# Fraud vs NonFraud') _ = plt.xlabel('Class (1==Fraud)')

We can not even see the bar chart of the fraud cases. As we can see we have mostly non-fraudulent transactions. Such a problem is also called inbalanced class problem.

99.8% of all transactions are non-fraudulent. The easiest classifier would always predict no fraud and would be in almost all cases correct. Such classifier would have a very high accuracy but is quite useless.

For such an inbalanced classes we could use over or undersampling methods to try to balance the classes (see inbalance-learn for example: https://github.com/scikit-learn-contrib/imbalanced-learn), but this out of the scope of todays post. We will come back to this in a later post.

As accuracy is not very informative in this case, the AUC (Aera under the curve) a better metric to assess the model quality. The AUC score is in a two class classification class equal to the probability that our classifier will detect a fraudulent transaction given one fraudulent and genuine transaction to choice from. Guessing would have a probability of 50%.

We create now the feature matrix **X** and the result vector **y**. We drop the column *Class* from the data frame and store it in a new data frame **X** and we select the column Class as our vector **y**.

X = credit_card.drop(columns='Class', axis=1) y = credit_card.Class.values

Due to the construction of the dataset (PCA transformed features, which minimizes the correlation between factors), we dont have any highly correlated features. Multicolinearity could cause problems in a logisitc regression.

To test for multicolinearity one could look into the correlation matrix (works only for non categorical features, which we do today) or run partial regressions and compare the standard errors or use pseudo-R^2 values and calculate Variance-Inflation-Factors.

corr = X.corr() mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio f, ax = plt.subplots(figsize=(11, 9)) sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Short reminder of Logistic Regression

In Logisitic Regression the logits (logs of the odds) are assumed to be a linear function of the features

Solving this equatation for yields to

The parameters can be derived by Maximum Likelihood Estimation (MLE). The likelihood for a given observation is

To find the maximum of the likelihood is equivalent to the minimize the negative logarithm of the likelihood (loglikelihood).

which is numerical more stable. The log-likelihood function has the same form as the cross-entropy error function for a discrete case.

So finding the maximum likelihood estimator is the same problem as minimizing the average cross entropy error function.

In SciKit-Learn uses by default a coordinate descent algorithm to find the minimum of L2 regularized version of the loss function (see. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).

The main difference between L1 (Lasso) and L2 (Ridge) regulaziation is, that the L1 prefer a sparse solution (the higher the regulazation parameter the more parameter will be zero) while L2 enforce small parameter values.

## Train the model

### Training and test set

First we split our data set into a train and a validation set by using the function `train_test_split`

.

np.random.seed(42) X_train, X_test, y_train, y_test = train_test_split(X, y)

### Model definition

scaler = StandardScaler() lr = LogisticRegression() model1 = Pipeline([('standardize', scaler), ('log_reg', lr)]) model1.fit(X_train, y_train)

As preperation we standardize our features to have zero mean and a unit standard deviation. The convergence of gradient descent algorithm are better. We use the class `StandardScaler`

. The class `StandardScaler`

has the method `fit_transform()`

which learn the mean and standard deviation $\sigma_i$ of each feature $i$ and return a standardized version . We learn the mean and sd on the training data. We can apply the same standardization on the test set with the function `transform()`

.

The logistic regression is implemented in the class `LogisticRegression`

, we will use for now the default parameterization. The model can be fit using the function `fit()`

. After fitting the model can be used to make predicitons `predict()`

or return the estimated the class probabilities `predict_proba()`

.

We combine both steps into a Pipeline. The pipline performs both steps automatically. When we call the method `fit()`

of the pipeline, it will invoke the method `fit_and_transform()`

for all but the last step and the method `fit()`

of the last step, which is equivalent to `lr.fit(scaler.fit_transform(X_train), y_train)`

If we invoke the method `predict()`

of the pipeline its equvivalent to `lr.predict(scaler.transform(X_train))`

.

### Training score and Test score

`confusion_matrix()`

returns the confusion matrix, C where $C_{0,0}$ are the true negatives (TN) and $C_{0,1}$ the false positives (FP) and vice-versa for the positives in the 2nd row. We use the function `accurary_score()`

to calculate the accuracy our models on the train and test data. We see that the accuracy is quite high (99,9%) which is expected in such an unbalanced class problem. With the method `roc_auc_score()`

can we get the area under the receiver-operator-curve (AUC) for our simple model.

y_train_hat = model1.predict(X_train) y_train_hat_probs = model1.predict_proba(X_train)[:,1] train_accuracy = accuracy_score(y_train, y_train_hat)*100 train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100 print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat)) print('Training accuracy: %.4f %%' % train_accuracy) print('Training AUC: %.4f %%' % train_auc_roc)

Confusion matrix: [[213200 26] [ 137 242]] Training accuracy: 99.9237 % Training AUC: 98.0664 %

y_test_hat = model1.predict(X_test) y_test_hat_probs = model1.predict_proba(X_test)[:,1] test_accuracy = accuracy_score(y_test, y_test_hat)*100 test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100 print('Confusion matrix:\n', confusion_matrix(y_test, y_test_hat)) print('Training accuracy: %.4f %%' % test_accuracy) print('Training AUC: %.4f %%' % test_auc_roc)

Confusion matrix: [[71077 12] [ 45 68]] Training accuracy: 99.9199 % Training AUC: 97.4810 %

Our model is able to detect 68 fraudulent transactions out of 113 (recall of 60%) and produce 12 false positives (<0.02%) on the test data.

To visualize the Receiver-Operator-Curve we use the function `roc_curve`

. The method returns the true positive rate (recall) and the false positive rate (probability for a false alarm) for a bunch of different thresholds. This curve shows the trade-off between recall (detect fraud) and false alarm probability.

If we classifiy all transaction as fraud, we would have a recall of 100% but also the highest false alarm rate possible (100%). The naive way to minimize the false alarm probability is to classify all transaction as genuine. **

fpr, tpr, thresholds = roc_curve(y_test, y_test_hat_probs, drop_intermediate=True) f, ax = plt.subplots(figsize=(9, 6)) _ = plt.plot(fpr, tpr, [0,1], [0, 1]) _ = plt.title('AUC ROC') _ = plt.xlabel('False positive rate') _ = plt.ylabel('True positive rate') plt.style.use('seaborn') plt.savefig('auc_roc.png', dpi=600)

Our model classify all transaction with a fraud probability => 50% as fraud. If we choose the threshold higher, we could reach a lower false positive rate but we would also miss more fraudulent transactions. If we choose the thredhold lower we can catch more fraud but need to investigate more false positives.

Depending on the costs for each error, it make sense to select another threshold.

If we set the threshold to 90% the recall decrease from 60% to 45%. while the false positve rate is the same. We can see that our model assign some non-fraudulent a very high probability to be fraud.

y_hat_90 = (y_test_hat_probs > 0.90 )*1 print('Confusion matrix:\n', confusion_matrix(y_test, y_hat_90)) print(classification_report(y_test, y_hat_90, digits=6))

If we set the threshold down to 10%, we can detect around 75% of all fraud case but almost double our false positive rate (now 25 false alarms)

Confusion matrix: [[71064 25] [ 25 88]] precision recall f1-score support 0 0.9996 0.9996 0.9996 71089 1 0.7788 0.7788 0.7788 113 avg / total 0.9993 0.9993 0.9993 71202

## Where to go from here?

We just scratched the surface of sklearn and logistic regression. For example we could spent much more time with the

- feature selection / engineering (which is a bit hard without any background information about the features),
- we could try techniques to counter the data inbalance and
- we could use cross-validation to fine tune the hyperparameters or
- try a different regularization (Lasso/Elastic Net) or
- try a different optimizer (stochastic gradient descent or mini-batch sgd)
- adjust class weights to adjust the decision boundary (make missed frauds more expansive in the loss function)
- and finally we could try different classifer models in sklearn like decision trees, random forrests, knn, naive bayes or support vector machines.

But for now we will stop here and we will implement in the next part the logisitc regression model with stochastic gradient descent in TensorFlow and then extend it to a neural net and we will come back to these points at a later time. But in the mean time feel free to play with the notebook and try to change the parameter and see how the model will change.

So long…

This was quite the change! Best of luck with the new job.

LikeLike

Thank you very much.

LikeLike