Fast Monte-Carlo Pricing and Greeks for Barrier Options using GPU computing on Google Cloud Platform in Python

In this tutorial we will see how to speed up Monte-Carlo Simulation with GPU and Cloud Computing in Python using PyTorch and Google Cloud Platform. Motivated from my experience developing a RNN for anomaly detection in PyTorch I wanted to port the option pricing code from my previous posts from TensorFlow to PyTorch. PyTorch offers similar to TensorFlow auto-gradients, also known as algorithmic differentiation, but the programming style is quite different to TensorFlow. We will see how easy it is to run our code on a GPU with PyTorch. The calculation of a NPV and all first order Greeks of a simple Down-and-Out Option is 45 times faster (7.3 s vs. 160ms) on a Nvidia K80 compared to a calculation on the CPU of my MacBook Pro 15 from 2016 with a 2.6 GHz i7 and 16 GB Ram.

  • Update 1 The purpose of this example is to illustrate how to use Algorithmic Differentiation and GPU Computing with PyTorch in Python. There are more appropriate pricing models and methods for Barrier Options. This is a very naive approach in Black Scholes setting without taking any volatility smile into account.

The notebooks are available on GitHub https://github.com/mgroncki/IPythonScripts/PricingPyTorch

First we will look into analytical pricing of plain vanilla options and compare a pure Numpy versus a PyTorch and the previous TensorFlow implementation before we implement the Monte-Carlo Simulation in PyTorch.

Analytical Plain Vanilla Pricing

Lets start with plain vanilla call option in a Black Scholes World.

Maturtiy: 1 year
Spot : 100
Strike : 101
Volatility: 30.0 %
Risk free rate: 1.0 %

Numpy Implementation

We are using the same implementation as in the TensorFlow notebook

## Plain Vanilla Call in TensorFlow

def blackScholes_py(S_0, strike, time_to_expiry, implied_vol, riskfree_rate):
    S = S_0
    K = strike
    dt = time_to_expiry
    sigma = implied_vol
    r = riskfree_rate
    Phi = stats.norm.cdf
    d_1 = (np.log(S_0 / K) + (r+sigma**2/2)*dt) / (sigma*np.sqrt(dt))
    d_2 = d_1 - sigma*np.sqrt(dt)
    return S*Phi(d_1) - K*np.exp(-r*dt)*Phi(d_2)

Screenshot 2018-11-18 at 16.38.38

And as expected its super fast. No surprises here. Next we will implement the same formula in PyTorch.

PyTorch Implementation

There are only minimal code changes compared to the numpy version required. In the actual pricing function we just need to replace np with torch and exchange the cdf function to use the PyTorch one and we have to convert our input into torch.tensor.

def blackScholes_pyTorch(S_0, strike, time_to_expiry, implied_vol, riskfree_rate):
    S = S_0
    K = strike
    dt = time_to_expiry
    sigma = implied_vol
    r = riskfree_rate
    Phi = torch.distributions.Normal(0,1).cdf
    d_1 = (torch.log(S_0 / K) + (r+sigma**2/2)*dt) / (sigma*torch.sqrt(dt))
    d_2 = d_1 - sigma*torch.sqrt(dt)
    return S*Phi(d_1) - K*torch.exp(-r*dt)*Phi(d_2)

Screenshot 2018-11-18 at 16.41.06

We see now significant speed difference between Numpy or PyTorch.

Seems the PyTorch version is even faster as the pure numpy version. How can we use the auto-grad function in PyTorch to get the greeks?

Greeks in PyTorch

We just need to call the .backward() operator of the tensor which stores the prices and we can access the greeks with the .grad properity.

%%timeit
S_0 = torch.tensor([100.],requires_grad=True)
K = torch.tensor([101.],requires_grad=True)
T = torch.tensor([1.],requires_grad=True)
sigma = torch.tensor([0.3],requires_grad=True)
r = torch.tensor([0.01],requires_grad=True)
npv_pytorch = blackScholes_pyTorch(S_0, K, T, sigma, r)
npv_pytorch.backward()

639 µs ± 9.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Its almost 2.5-3 times slower but it gives us five greeks. A naive finite-difference approximation would costs us at least 6 calculations and would be only an numerical approximation. Here we have ‘exact’ derivates.

Second order greeks in Pytorch

The 2nd order greeks are a bit tricky and not so straight-forward. We need
to create a computational graph of the gradient. We use the function .grad() from the autograd module. And can then call the backward method on it.

gradient = torch.autograd.grad(npv_pytorch, S_0, create_graph=True)
delta, =  gradient
delta.backward(retain_graph=True)
print('Delta: ', delta)
print('Gamma', S_0.grad)

gamma_plot.png

Now let’s recap the results from the TensorFlow implementation.

TensorFlow implementation

Using the same code as in the original notebook (but I removed the calculation of the 2nd order greeks. There is a bit of overhead for constructing the computational graph.

def blackScholes_tf_pricer():
    # Build the static computational graph
    S = tf.placeholder(tf.float32)
    K = tf.placeholder(tf.float32)
    dt = tf.placeholder(tf.float32)
    sigma = tf.placeholder(tf.float32)
    r = tf.placeholder(tf.float32)
    Phi = tf.distributions.Normal(0.,1.).cdf
    d_1 = (tf.log(S / K) + (r+sigma**2/2)*dt) / (sigma*tf.sqrt(dt))
    d_2 = d_1 - sigma*tf.sqrt(dt)
    npv =  S*Phi(d_1) - K*tf.exp(-r*dt)*Phi(d_2)
    greeks = tf.gradients(npv, [S, sigma, r, K, dt])
    def execute_graph(S_0, strike, time_to_expiry, implied_vol, riskfree_rate):
        with tf.Session() as sess:
            res = sess.run([npv, greeks], 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt: time_to_expiry})
        return res
    return execute_graph 

Screenshot 2018-11-18 at 16.48.52.png

We are roughly factor 1000 times slower in TensorFlow. Maybe my implementation is just bad. Any feedback how to improve it would be very appreciated.

First Observation

So if anyone has already some pricing routines in Python using Numpy it seems to be much easier to port them to PyTorch instead to TensorFlow. And porting the code wouldn’t have any significant negative impacts on the speed.

Now we want to come to the core of this tutorial.

Monte Carlo Pricing for Single Barrier Option

Our example option is a down-and-out barrier with

Maturtiy: 2 year
Spot : 100
Strike : 110
Volatility: 20.0 %
Risk free rate: 3.0 %
Barrier at 90

It’s the same option as in my previous post and we gonna use the same Numpy implementation

def monte_carlo_down_out_py(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier, steps, samples):
    stdnorm_random_variates = np.random.randn(samples, steps)
    S = S_0
    K = strike
    dt = time_to_expiry / stdnorm_random_variates.shape[1]
    sigma = implied_vol
    r = riskfree_rate
    B = barrier
    # See Advanced Monte Carlo methods for barrier and related exotic options by Emmanuel Gobet
    B_shift = B*np.exp(0.5826*sigma*np.sqrt(dt))
    S_T = S * np.cumprod(np.exp((r-sigma**2/2)*dt+sigma*np.sqrt(dt)*stdnorm_random_variates), axis=1)
    non_touch = (np.min(S_T, axis=1) > B_shift)*1
    call_payout = np.maximum(S_T[:,-1] - K, 0)
    npv = np.mean(non_touch * call_payout)
    return np.exp(-time_to_expiry*r)*npv

For a simulation with 100.000 paths and 1.000 time steps we need approximately 4.8 sec.

Screenshot 2018-11-18 at 17.49.35.png

And again the PyTorch implementation looks very familiar. We have to replace the random number generator and replace the numpy functions with torch functions. There are two cumberstones, there is no equivalent to np.maximum, so we have to mask the negative payoff ourself and set it to zero and we have to convert the boolean tensor into a float tensor.

def monte_carlo_down_out_torch(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier, steps, samples):
    stdnorm_random_variates = torch.distributions.Normal(0,1).sample((samples, steps))
    S = S_0
    K = strike
    dt = time_to_expiry / stdnorm_random_variates.shape[1]
    sigma = implied_vol
    r = riskfree_rate
    B = barrier
    # See Advanced Monte Carlo methods for barrier and related exotic options by Emmanuel Gobet
    B_shift = B*torch.exp(0.5826*sigma*torch.sqrt(dt))
    S_T = S * torch.cumprod(torch.exp((r-sigma**2/2)*dt+sigma*torch.sqrt(dt)*stdnorm_random_variates), dim=1)
    non_touch = torch.min(S_T, dim=1)[0] > B_shift
    call_payout = S_T[:,-1] - K
    call_payout[call_payout<0]=0
    npv = torch.mean(non_touch.type(torch.FloatTensor) * call_payout)
    return torch.exp(-time_to_expiry*r)*npv

The Monte-Carlo Simulation including calculating the pathwise greeks take the same time as the pure NPV Monte-Carlo Simulation in Numpy.

Screenshot 2018-11-18 at 17.02.13.png

Second Observation

Running Monte-Carlo Simulations in PyTorch on a CPU seems to be the same speed as Numpy implementation (double duration but calculate also greeks in the same time).

Now we want to make things really fast and run it on a GPU.

Monte-Carlo Simulation of Barrier Options on GPUs

GCP Setup of GPU computing

Since I have no Nvidia GPU in my laptop we will run our experiment on Google Cloud Platform. You can easily signup for GCP and you can receive 300 USD free credit for 12 months. To use GPU computing you need to check in which zones GPUs are available. First time users need to request the GPU usage first, the approval takes usually less than 1 day. You can follow the tutorial here:
View story at Medium.com

Just replace the step 8 with the AISE PyTorch NVidia GPU Notebook.

The costs for instance with a Nvidia K80 is less than 2 USD per hour, and our example notebook will run only a few minutes.

Once you the virtual machine is started you can connect directly to the Jupyter notebook server and work in your browser as it would run on our local machine.

In my GitHub repository there is a dedicated notebook BarrierOptionGPU.ipynb which we can upload to the server.

Cuda Implementation of our Monte-Carlo Simulation

Again we need only minimal changes to our code. We need to change the random number generator to run on the CUDA device torch.cuda.FloatTensor(steps, samples).normal_() and after casting our boolean tensor we need to move it to the GPU non_touch = non_touch.type(torch.cuda.FloatTensor) and we have to copy all our inputs to the GPU as well torch.tensor([...],requires_grad=True, device='cuda').

def monte_carlo_down_out_torch_cuda(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier, steps, samples):
    stdnorm_random_variates = torch.cuda.FloatTensor(steps, samples).normal_()
    S = S_0
    K = strike
    dt = time_to_expiry / stdnorm_random_variates.shape[1]
    sigma = implied_vol
    r = riskfree_rate
    B = barrier
    # See Advanced Monte Carlo methods for barrier and related exotic options by Emmanuel Gobet
    B_shift = B*torch.exp(0.5826*sigma*torch.sqrt(dt))
    S_T = S * torch.cumprod(torch.exp((r-sigma**2/2)*dt+sigma*torch.sqrt(dt)*stdnorm_random_variates), dim=1)
    non_touch = torch.min(S_T, dim=1)[0] > B_shift
    non_touch = non_touch.type(torch.cuda.FloatTensor)
    call_payout = S_T[:,-1] - K
    call_payout[call_payout<0]=0
    npv = torch.mean(non_touch * call_payout)
    return torch.exp(-time_to_expiry*r)*npv

Screenshot 2018-11-18 at 17.21.42.png

The pricing and greek calculation is 45 times faster than on CPUs.

Conclusion

Writing and coding this tutorial was great fun for me. PyTorch feels for me much easier and cleaner to use for writing pricing algorithm compared to TensorFlow, which maybe will change with TensorFlow 2.0 which is a major redesign. With PyTorch it’s very easy to implement Monte-Carlo Simulations with Adjoint Greeks and running the code on GPUs is seamless even without experience in GPU code in C++. I wonder how fast a C++/CUDA implementation would be? Maybe something for future post.

So long…

Advertisements

Fraud detection: Behavioural modeling and unsupervised anomaly detection with deep learning

Fraud detection is the like looking for a needle in a haystack. The behaviour of a fraudster will differ from the behaviour of a legitimate user but the fraudsters will also try to conceal their activities and they will try to hide in the mass of legitimate transactions. Machine Learning can help to spot these transactions but supervised learning methods may have problems to detect complete new fraud patterns and most of the data isn’t labeled at all, so we can’t apply supervised learning methods.

This tutorial will show how to reuse ideas from language modeling and apply deep learning, recurrent neural networks (LSTM) and embedding layers in particular to learn behavioural patterns/profiles from transactions and detect anomalies in these patterns (which could be a fraudulent transaction). The basic idea is to use a neural network to learn a lower dimensional representation of the input and then apply a classical outlier detection method on this. This approach doesn’t rely on labeled data. The network is implemented in Python using PyTorch.

First one off-topic comment

I decided to clean up my GitHub repository and split it by topics. So there is a new dedicated repository about my fraud detection blog posts (https://www.github.com/mgroncki/frauddetection) and there will another one about Quantitative Finance. Since I don’t want to break all the links in my old post, I will keep the old ones repos and mark them as legacy repos. Maybe there will be also a third repo about general data science topics later. But now let’s come back to the topic.

Problem description

Assuming we have transactional data (e.g. payment history, a log file of an application or website) and we want to identify suspicious (unusual) activities. We will use an RNN to learn user profiles based on their transactional behaviour and search for anomalies in these profiles. We will use an example with artificial data to train and test the network.

In our example the users can login in our system and can perform 5 different actions (action_1, …, action_5) we log all activities together with the user id, time/date of the activity and session id. An example session/activity look like this:

login -> action_1 -> action_2 -> action_5 -> logout

We have two different kind of users (e.g. supervisor and regular staff or retail and wholesale customer, etc) who differ in their behavior.

We simulate two hundred sessions for two hundred users (80% role A and 20% role B) using two different discrete Markov processes.

Brief reminder: In a Markov process the probability of the next action (state) depends only on the current action (state) and the Markov chain can be represented with a stochastic matrix where the entry in the i-th row and j-th colum is transition probability from state i to state j.

Here the transition matrices for our example:

actions = ['start', 'end',
'action_1', 'action_2', 'action_3',
'action_4', 'action_5']

# Normal behavior Role 1
np.array([
[0.00, 0.00, 0.20, 0.20, 0.20, 0.20, 0.20],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.01, 0.09, 0.30, 0.30, 0.15, 0.15],
[0.00, 0.60, 0.05, 0.10, 0.05, 0.05, 0.15],
[0.00, 0.50, 0.05, 0.25, 0.05, 0.10, 0.05],
[0.00, 0.60, 0.01, 0.10, 0.10, 0.10, 0.09],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
]),

# Normal behavior Role 2
np.array([
[0.00, 0.00, 0.20, 0.10, 0.10, 0.30, 0.30],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.10, 0.20, 0.20, 0.20, 0.10, 0.20],
[0.00, 0.70, 0.05, 0.05, 0.05, 0.05, 0.10],
[0.00, 0.70, 0.05, 0.05, 0.05, 0.10, 0.05],
[0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
]),

The transition probabilities of both user roles differ slightly and are even the same for some states.

Let’s now talk about the fraudster in our example.

Two percent of our users are potential fraudsters and for each session there is a 20% chance that the potential fraudster will commit fraud. If he is in the state ‘Fraud’ the session will be sampled from the fraud transition matrix.

np.array([
[0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.20, 0.70, 0.025, 0.025, 0.025, 0.025],
[0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05],
[0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05],
[0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
])

As fraudster have a much higher probability to perform action_1 repeatably or return to action_1 from other states (e.g. search customer information to steal data).

In total we have 40.000 transaction of which 111 transactions are fraudulent committed by 3 users out of 200 users.

The Jupyter notebook ‘SampleData’ in the project folder will generate this data sample and can be easily modified.

This example is just for educational purposes and it is by construction so simple that we could spot the fraudulent behaviour with very simple methods using feature engineering (e.g. users with highest count of action_1 in average), but in real world applications we would have maybe 100 or 1000 of different actions with more than two different type of users and the fraudulent behaviour would become more complex.

So how can a neural network learn the typical behaviour of our users?

Language models, Word Embeddings and how user sessions/activities are related to it

To determine if a sequence of actions (activity or session) is an anomaly we need to know the underlying distribution. Our five actions can be compared to words in a natural language and user sessions to sentences or paragraphs in a text. So maybe we can solve our problems with techniques which are used in NLP (natural language processing).

In a language model the probability of the next word is depending on the previous words in the same context. During the training of a language model one try to predict the next word given the previous words. It’s like fill in the blank in a text. While minimizing the prediction error the model will learn the conditional probabilities. It’s a supervised learning task and recurrent neural networks with embedding layers are very successful applied to it. So the basic idea is to use a language model network with an embedding layer and feed our sequences into it and use the latent representation (embeddings) to derive user profiles.

There are several papers in which the authors transferring the idea of embeddings and RNNs from NLP into context of user profiling for recommendation systems.

For more details on recurrent networks, language models, embeddings (word2vec) have a look here:

Network design and training

For our simple example we use a very simple network architecture:

  • 3 dimensional Embedding Layer
  • 2 layer LSTM network with 20 nodes per layer and 20% dropouts
  • Decoder layer

This is a very simple RNN compared to recent state of the art networks in natural language processing. So there is much space left for improvements. But this design is sufficient to present the idea of this approach.

We split our data in 80% training and 20% validation data and train in mini batches of 100 users sessions per mini batch. Since the sessions differ in their length we apply zero-padding (filling the time series).

We train the network for 20 epochs using RMSProp and learning rate decay with an initial learning rate of 0.05.

Implementation in PyTorch

You can download the complete source code from GitHub (https://github.com/mgroncki/FraudDetection).

This is my first PyTorch project and the implementation was quite simple and the documentation of the project is very good and there are many good tutorials available.

My code it’s based on the example on the official PyTorch language model example (https://www.github.com/pytorch/examples/tree/master/word_language_model) with some modification/simplifications:

First we prepare the sequential data and transfer the action strings into integers. We do it in this example by hand create a dictionary of action to id and apply this dictionary to our list of actions. There also NLP libraries which provide these functionalities (e.g. gensim).


logfile['SessionActivityInt'] = logfile.SessionActivity.map(lambda ls: np.array([action2id[a] for a in ls]+[action2id['start']]))

The result is a column of list of sessions, represented as integers.
In the next step we create a function which generates the mini batches and pad the sequences and convert everything to PyTorch Tensors and copy it to a GPU if available.

def split_train_test(input, device, prop=0.8, seed=42):
    np.random.seed(42)
    mask = np.random.uniform(size=input.shape[0])0].sum() / y[y>0].size(0)
    return y_probs, y_predict, y, loss, acc

def get_batch(i, batch_size, input):
    '''
    Takes a column/list of activity tensors of variable lenght
    and returns the padded i-th minibatch of batch_size activities 
    '''
    data = input[i*batch_size : (i+1) * batch_size]
    data = sorted(data, key=len, reverse=True)
    x = nn.utils.rnn.pad_sequence([x[:-1] for x in data])
    y = nn.utils.rnn.pad_sequence([y[1:] for y in data])
    return x, y

We build the network following the example from the official PyTorch example with some slights modifications. Adding support for padding sequences.

class BehaviourNet(nn.Module):
    '''
    Very simple network consisting of an embedding layer, LSTM layers and a decoder with dropouts
    '''
    def __init__(self, n_actions=6, embedding_size=3, n_nodes=6, n_layers=2, dropout=0.2, 
                 padding_idx=0, initrange=0.5):
        super(VerySimpleBehaviorNet, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.embedding = nn.Embedding(n_actions, embedding_size, padding_idx)
        self.rnn = nn.LSTM(embedding_size, n_nodes, n_layers, dropout=dropout)
        self.decoder = nn.Linear(n_nodes, n_actions)
        self.init_weights(initrange)
        self.n_nodes = n_nodes
        self.n_layers = n_layers
        
    def init_weights(self, initrange=0.1):
        self.embedding.weight.data.uniform_(-initrange, initrange)
        # Set the first row to zero (padding idx)
        self.embedding.weight.data[0,:] = 0
        print(self.embedding.weight)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)
        
    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (weight.new_zeros(self.n_layers, batch_size, self.n_nodes),
                weight.new_zeros(self.n_layers, batch_size, self.n_nodes))
    
    def forward(self, input, hidden):
        emb = self.dropout(self.embedding(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.dropout(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

We are going to use the standard cross-entropy loss function, which offers support for padded sequences, so there is no worry during the training but for the evaluation we want also to calculate the accuracy of the model on the validation data set and there we need to mask the padded time steps and exclude from the calculation.

def training(model, optimizer, scheduler, loss_function, data, batch_size, n_actions, clipping=0.5):
    model.train()
    n_batch = int(np.ceil(len(data) // batch_size))
    hidden = model.init_hidden(batch_size)
    scheduler.step()
    total_loss = 0.0
    for batch in range(n_batch):
        hidden = tuple(h.detach() for h in hidden)
        x,y = get_batch(batch, batch_size, data)
        optimizer.zero_grad()
        output, hidden = model(x, hidden)
        output_flatten = output.view(-1, n_actions)
        y_flatten = y.view(-1)
        loss = loss_function(output_flatten, y_flatten)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clipping)
        optimizer.step()
        total_loss += loss
    return total_loss / n_batch

def evaluate(model, loss_function, data, n_actions):
    model.eval()
    batch_size = len(data)
    hidden = model.init_hidden(batch_size)
    x,y = get_batch(0, batch_size, data)
    output, hidden = model(x, hidden)
    output_flatten = output.view(-1, n_actions)
    y_flatten = y.view(-1)
    loss = loss_function(output_flatten, y_flatten)
    y_probs = nn.Softmax()(output)
    y_predict = t.argmax(output, 2)
    y_predict[y==0]=0
    acc = (y_predict==y).double()[y>0].sum() / y[y>0].size(0)
    return y_probs, y_predict, y, loss, acc

Whats left is the training loop:

modelname = 'model_1'
model = VerySimpleBehaviorNet(initrange=10, n_layers=2, n_nodes=20, n_actions=len(id2action)).to(device)
loss_func = nn.CrossEntropyLoss(ignore_index=0)
optimizer = t.optim.RMSprop(model.parameters(), lr=0.05)
scheduler = t.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

for epoch in range(20):
    training_loss = training(model, optimizer, scheduler, loss_func, train, 100, n_actions=len(id2action))
    y_prob, y_pred, y_true, test_loss, test_acc = evaluate(model, loss_func, test, n_actions=len(id2action))
    print(f'Epoch {epoch}\nTrain Loss : {training_loss} \t Val loss: {test_loss} \t Val Acc {test_acc}')

Results

First we have a look into the learned latent representation of the actions.

Screenshot 2018-11-11 at 15.02.31

At this moment the interpretation of this representation is, because of the nature of the data, quite difficult (no meaningful actions and random transitions probabilities).

But if we convert a sequence of activities into the same latent space by converting each action into the corresponding embedding vector and calculate the average of all actions in one sequence in each dimension we can observe a quite interesting pattern.

We observe that Role A and Role B user have overlapping transactions but fraudulent user sessions are more likely on the left upper corner.

Screenshot 2018-11-11 at 15.02.39

If we now average all sessions of a user in this 3 dimensional space, we can see that our network learned a representation of the activities which allows us to identify what role each user has and more important the three fraudsters are clearly outliers (red):

Screenshot 2018-11-11 at 15.02.47

Remark / Conclusion

In our example the network was able learn a meaningful representation of the user actions which can then be used to identify the fraudulent users and the suspicious transactions with classical outlier detection methods (in our simple case just through visual inspection). I think the idea of using embeddings to represent user profiles very promising. But be aware in this simple example you can reach the same result with much simpler methods, which would preferable (e.g. counting the number of actions in one session and normalise the vector and a apply a PCA on it). This is a very simple approach and there is much space for further extensions (e.g. add user embedding analog to document2vec) or we can reuse the embeddings in other models (transfer learning).

Using PyTorch for this project was very very straight forward (comparable to using numpy) and much easier to debug compared to the low level api of TensorFlow and good fun. But TensorFlow 2.x will address some of the issues (e.g. default eager mode, cleaner api, etc). Although I am a big fan of TensorFlow and the Estimator API, especially in connection with Google Cloud Platform, it will be definitely not the last time that I used PyTorch.

Thats its for today, and i hope you enjoyed reading this post.

On my bucket list for the next posts is are

  • porting the signature detection project from KNIME to Python and train it on the GCP and

  • extending this network and try it on real world data

  • exploring GANs and variational auto-encoders for fraud detection

  • tutorial about up and down sampling methods to handle imbalance data

Haven’t decided whats coming next, so if you have any comments or questions please drop me a message.

So long…

Signature Verification with deep learning / transfer learning using Keras and KNIME

In the previous posts we applied traditional Machine Learning methods and Deep Learning  in Python and KNIME to detect credit card fraud, in this post we will see how to use a pretrained deep neural networks to classify images of offline signatures into genuine and forged signatures. A neural network like this could support experts to fight cheque fraud.  We will use the VGG16 network architecture pertained on ImageNet. The technique we are going to apply is called transfer learning and allows us to use a deep network even if we have only limited data, as in our case.  The idea for this post is based on the paper ‘Offline Signature Verification with Convolutional Neural Networks‘ by Gabe Alvarez, Blue Sheffer and Morgan Bryant. We combine native KNIME nodes for the data preparation and extend the workflow with Python code in some nodes using Keras for designing and the network and the transfer learning process.

We will use the same data source for our training set: The signature collection of the ICDAR 2011 Signature Verification Competition (SigComp2011) which contains offline and online signature samples. The offline dataset comprises PNG images, scanned at 400 dpi, RGB color. The data can be downloaded from http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2011_Signature_Verification_Competition_(SigComp2011).

Data Preparation

0206012_02.png

The offline signatures images have all different size, so in the first step we resize them to have all the same size 150×150 and divide all pixel by 255 to scale our features in the [0,1] space.

With the List Files node we filter all images (pngs) in our training set folder.

Screen Shot 2018-08-26 at 11.33.46

Screen Shot 2018-08-26 at 11.34.03

In the next step (Rule Node) we mark the first 123 images are forged or fraudulent and the remaining images are genuine signatures.

Screen Shot 2018-08-26 at 11.34.12

In the following Metanode we read all images from the files using the image reader (table) node and then standardise the features and resize all images.

Screen Shot 2018-08-26 at 11.35.03

Screen Shot 2018-08-26 at 11.37.24.png

Screen Shot 2018-08-26 at 11.34.43

Screen Shot 2018-08-26 at 11.34.51

In the next step we perform a 80/20 split into training and test data and apply a data augmentation step (using Keras ImageDataGenerator to create more training data) which finalise our data preparation. We will cover the data augmentation in one of the coming posts.

Deep Learning network

We will load the weights of a VGG16 network pertained on ImageNet classification tasks using Keras Applications and strip of the last layers and replace them with our new dense layers and then train and fine tune the parameters our data. Instead of training the complete network which would require a lot more data and training time we will use pretrained weights and leverage the previous experience of the network. We just learn the last layers on our classification task.

Screen Shot 2018-08-26 at 11.44.08.png

Step 1: Load the pretrained network

We use the DL Python Network Creator Node and need to write a few line of Python code to load the network. We are using Keras, which will automatically download the weights.

Screen Shot 2018-08-26 at 11.50.58.png

Step 2: Replace Top Layers and freeze weights

In the next step we modify the network, we add 3 Dense Layer with Dropouts and ReLu and a Sigmoid activation at the end and freeze the weights of the original VGG network and again we need to write a few line of Python code:

Screen Shot 2018-08-26 at 11.53.44.png

Step 3: Train Top Layers

Then we train the new layers for 5 epochs with the Keras Network Learner Node:

This slideshow requires JavaScript.

Step 4 and 5: Unfreeze and fine tune

We modify the resulting network and unfreeze the last layers of the VGG16 network to fine-tune the pre-learned weights (3 layers) and train the network for another 10 epochs.

Screen Shot 2018-08-26 at 12.00.17

Screen Shot 2018-08-26 at 12.00.43

Apply Network and Test Results

In the last step we apply the network to the test data, convert the predicted probability into a class (p>0.5 -> Forged) and calculate the Confusion Matrix and AUC curve.

Screen Shot 2018-08-26 at 12.03.06

Confusion Matrix:

Screen Shot 2018-08-26 at 12.03.20

AUC Curve:

Screen Shot 2018-08-26 at 12.03.35

The results look very impression, actually a bit to good. We are maybe overfitting the data, since the test data may contains signatures (genuine and forged) from the same reference authors (since there are only 10 reference authors in the complete training set). The author of the paper noted that the performance of their network is very good on signatures of persons whose signatures has been seen in the training phase but that on unseen data it’s only little bit better then a naive baseline. In one of the next posts we will to check the network performance on new unseen signatures and try to train and test the model also on the SigComp2009 data which have signatures of 100 authors and we will look into detail in the data augmentation and maybe we compare this network to some more other network architectures.

So long…

 

 

 

 

Fooling Around with KNIME cont’d: Deep Learning

In my previous post I wrote about my first experiences with KNIME and we implemented three classical supervised machine learning models to detect credit card fraud. In the meantime I found out that the newest version of KNIME (at this time 3.6) supports also the deep learning frameworks TensorFlow and Keras. So I thought lets revisit our deep learning model for the fraud detection and try to implement in KNIME using Keras without writing one line of Python code.

Install the required packages

The requirement is you have Python with TensorFlow and Keras (you can install it with pip or conda, if you using the anaconda distribution) on your machine. Then you need to install the Keras integration extensions in KNIME, you can follow the official tutorial on https://www.knime.com/deeplearning/keras.

Workflow in KNIME

The first part of the workflow is quite similar to the previous workflow

Screen Shot 2018-08-04 at 20.30.33.png

We load the data, remove the time column, split the data in train, validation and test sets and normalise the features.

The configuration of the column filter node is quite straight forward, we specify the columns we want to include and exclude (no big surprise in the configuration).

Screen Shot 2018-08-04 at 20.32.33.png

Design of the deep network

For building our very simple 3 layer network we need 3 different new nodes, the Keras Input-Layer-Node, the Dense-Layer-Node and the DropOut-Node:

Screen Shot 2018-08-04 at 20.38.09.png

We start with the input layer and we have to specify the dimensionality of our input, in our case we have 29 features, we can also specify here the batch size.

Screen Shot 2018-08-04 at 20.39.24

The next layer is a dense (fully connected) layer. We can specify we number of nodes, the activation function.

Screen Shot 2018-08-04 at 20.40.41.png

After the dense layer we apply a drop-out with an dropout-rate of 20%, also the configuration is here quite straightforward.

Screen Shot 2018-08-04 at 20.43.28.png

We add then another dense-layer with 50 node, another dropout and the final layer with one node and the sigmoid activation function (binary classification: fraud or non-fraud).

The last layer of the network, the training data and the validation set are input to the Keras-Network-Learner Node.

 

Screen Shot 2018-08-04 at 20.45.08

 

We set the input, the target variable choose the loss function, optimisation method and number of epochs.

Screen Shot 2018-08-04 at 20.46.29Screen Shot 2018-08-04 at 20.46.36

Screen Shot 2018-08-04 at 20.47.02

We can specify a own loss function if we want or need to.

Screen Shot 2018-08-04 at 20.46.50

We can select an early stop strategy as well:

Screen Shot 2018-08-04 at 20.48.24.png

With the setting above the training will be stopped if the validation loss will no decrease more than 0.001 for at least 5 epochs.

During the training we can monitor the training process:

Screen Shot 2018-08-04 at 20.49.06Screen Shot 2018-08-04 at 20.49.11Screen Shot 2018-08-04 at 20.49.44Screen Shot 2018-08-04 at 20.50.26Screen Shot 2018-08-04 at 20.51.14

The trained model and the test data are the input for the DL-Network-Executer-Node which will use the trained network to classify the test set.

Screen Shot 2018-08-04 at 21.00.16

The results are plugged in a ROC-Curve-Node to asses the model quality.

Screen Shot 2018-08-04 at 21.01.28

And here the complete workflow:

Screen Shot 2018-08-05 at 07.49.47.png

Conclusion

It was very easy and fast to implement our previous model in KNIME without writing any line of code. Nevertheless the user still need to understand the concepts of  deep learning in order to build the network and understand the node configurations. I really liked the feature of the real time training monitor. In my view KNIME is a tool which can help to democratize data science within an organisation. Analyst who can not code in Python or R can have access to very good deep learning libraries for their data analytics without the burden to learn a new programming language, they can focus on understanding the underlying concepts of deep learning, understand their data and choosing the right model for the data and understand the drawbacks and limitations of their approach instead of spending hours learning programming in a new language. Of course you need to spent time to learn using KNIME, which is maybe for some people easier than learning programming.

I plan to spent some more time with KNIME in the future and I want to find out how to reuse parts of the workflow in new workflow (like the data preparation, which is almost exactly the same as in the previous example) and how to move model into production. I will report about it in same later posts.

So long…

 

Fooling around with KNIME

At the moment I am quite busy with preparing a two training course about ‘Programming and Quantitative Finance in Python’  and ‘Programming and Machine Learning in Python’ for internal trainings at my work, so I haven’t had much free time for my blog. But I spent the last two nights fooling around with KNIME, an open-source tool for data analytics / mining (Peter, I took inspiration from your blog to name today’s post) and I want to share my experience. In the beginning I was quite sceptical and my first thought was ‘I can write code faster then drag-n-drop a model’ (and I still believe it). But I wanted to give it a try and I migrated my logistic regression fraud detection sample from my previous blog posts into a graphical workflow.

Screen Shot 2018-07-19 at 21.03.13

In the beginning it was bit frustrating since I didn’t know which node to use and where to find all the settings. But the interface and the node names are quite self-explaining so after exploring some examples and watching one or two youtube videos I was able build my first fraud detection model in KNIME.

To work with a classifier we need to transfer the numerical variable Class (1=Fraud, 0=NonFraud) into a string variable. This was not obvious for me and coming from Python and SkLearn it felt a bit wired and unnecessary. After fixing that, I split the data in a training and test set with the Partioning node. With a right-click on the node we can adjust the configuration and can change the it to the common 80-20 split.

Screen Shot 2018-07-19 at 21.34.39

In the next step the data will be standardized (using a normalizer node). We can select which features / variable and how we want to scale them. We have plenty of settings to choose from, a nice feature is the online help (node description) on the right side of the UI, which describes the different parameter.

Screen Shot 2018-07-19 at 21.37.38

I have capsuled the model fitting, prediction and scoring into a meta-node to make the workflow look cleaner and more understandable. With a double-click on the meta-node we can open the sub-workflow.

Screen Shot 2018-07-19 at 21.12.52

I am fitting three different models (again encapsulated into meta-nodes) and combine the results (the AUC Score) into one table and write it and export it as a csv-file.

Lets have a detailed look into the logistic regression model meta-node.

Screen Shot 2018-07-19 at 21.16.06

The first node is the so-called learner node. It fits the model on the training data. In the configuration we can select the input features, target column, the solver algorithm, and advanced settings like regularizations.

Screen Shot 2018-07-19 at 21.45.22

To make predictions we use the predictor node. The inputs are the fitted model (square input) and the standardized test set. Into the normaliser (apply)  node, we feed the test set and the fitted normalizer (as far as I understand that is equivalent to use the transform method of a Scaler in Sklearn after fitting it before, please correct me if I am wrong). The prediction output will be used to calculate the AUC curve (in the configuration setting of the prediction node we have to add the predicted probabilities as an additional output, in the default settings is to output only the predicted class). We export the plot as an SVG file and the auc score (as a table with an extra column for the model name) is the output of our meta-node.

We can always investigate the output/result of one step, e.g. of the last node:

Screen Shot 2018-07-19 at 21.48.41

Screen Shot 2018-07-19 at 21.56.12.png

Or the interactive plot of the AUC node:

Screen Shot 2018-07-19 at 22.18.33

Or the model parameter output of the learner node:

Screen Shot 2018-07-19 at 21.57.15.png

The workflow for the other two models is quite similar. With copy and pasting the Logistic Regression meta-node, it was just replacing the learner and predictor node and adjusting the configurations.

To execute the complete workflow we just need to press the run/play button in the menu.

There is still much to discover and explore and try. For example there are node for cross-validation and feature selection which I haven’t tried yet and so many other nodes (e.g plotting, descriptive statistics and the Python and R integration nodes). And I haven’t tried to move a model into production, but I read that it should not be that difficult with KNIME (they promote it as a platform to create data science applications). I spent just a couple hours with it, so please forgive me if I didn’t use the right name for some of the nodes, setting, menus or features in KNIME.

What is my impression after playing with it for a couple hours?

I still believe that writing code is the faster option for me, but I have to admit that I like it more and more. And its not really fair comparison (years of Python programming vs couple hours experimenting with a new tool).  Its a nice tool for prototyping models without writing a line of code. If you are not familiar with a ML library yet, its a good and fast way to build models. But here is no free lunch either, instead of learning a syntax and the architecture of a library you have to learn to use the UI and find all the settings.

It’s an open-source software and so far I haven’t encountered any limitation (e.g. other tools limit the numbers of rows you can use in a free version) but I’ve just scratched the surface.

In my opinion one big advantage is the visualization of the model. The model is easy to understand and can easily be handed over to some other developers or engineer. Everyone knows that working with other people’s code can be a sometimes a pain and having a visual workflow can eliminate that pain. But I believe the workflows can become messy as well. Its a tool which can be used by analysts and business user who want to explore and analyse their data, generate insights and use  the power of standard machine learning and data mining algorithm without being forced to learn programming first.

The first impression is surprisingly good and I will continue playing with it and I want to figure out how to run my own Python script in a node and maybe even more important how to move a model into production.

I will report about it in a later post.

So long…

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part III

After a series of posts about exotic option pricing (Asian, Barriers and Bermudans) with TensorFlow and finding optimal hedging strategies with deep learning (using a LSTM network to learn a delta hedge) I will come back to our credit card fraud detection case. In the previous part we have build a logistic regression classifier in TensorFlow to detect fraudulent transactions. We will see that our logistic regression classifier is equivalent to a very simple neural network with exactly one layer with one node and sigmoid activation function. We will extend this simple network to to a deep neural network by adding more hidden layers. We will use the low level API of TensorFlow to build the networks. At the end of the post we will use Keras’s high level API to build a same network with just a few lines of code.

We will continue to use the same data apply the same transformation which we are using since the first part of this series.

As usual you can find the notebook on my GitHub repository.

Deep learning / neural networks in a nutshell

An artificial neural network (ANN) is collection of connected nodes. In the first layer of the network the input of our nodes are the input features. In following layers the output of previous nodes are the input to the nodes in the current layer. If we have more than 1 hidden layer we can call the network a deep neural network.

neural-network

The picture is generated by a latex script written by Kjell Magne Fauske (http://www.texample.net/tikz/examples/neural-network/) released under Creative common license. Thanks for that.

The output of the node is the composition of the dot or (scalar) product of a weights vector and the input vector and an activation function. Let be X the vector of input features and w_i the weights vector of the node i, then the output of this node is given by

output_i =  \phi(X^Tw_i+ b_i),

with an activation function \phi and bias b_i.

If a layer consists more of one node the layer can be represented as a matrix multiplication. Such a layer is often called linear or dense layer. Typical choices for activation functions are tanh, relu, sigmoid function.

activation_functions

As we can see from this formula a dense layer with one node and sigmoid function as activation is our logisitc regression model. The matrix product will be the logit and the output of the activation function will be the probability as in a logistic regression model.

Lets review the logistic regression example in a neural network setting, lets start a function which constructs the computational graph for a dense (linear) layer given a input, activation function and number of nodes.

def add_layer(X, n_features, n_nodes, activation=None):
    """
    Build a dense layer with n_features-dimensional input X and n_nodes (output dimensional).

    Parameters:

        X : 2D Input Tensor (n_samples, n_features)
        n_features = number of features in the tensor 
        n_nodes = number of nodes in layer (output dimension)
        activation = None or callable activation function 

    Output:

        Operator which returns a 2D Tensor (n_samples, n_nodes)

    """
    weights = tf.Variable(initial_value=tf.random_normal((n_features,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
    bias = tf.Variable(initial_value=tf.random_normal((1,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
    layer = tf.add(tf.matmul(X, weights), bias)
    if activation is None:
        return layer
    else:
        return activation(layer)

We wrapping our training and prediction functions in a class. The constructor of this class builds the computational graph in TensorFlow. The the function create_logit will build the computational graph to compute the logits (in the logisitc regression case: one layer with one node and the identity as activation function). We will override this function at a later point to add more layers to our network.

class model(object):
    def __init__(self, n_features, output_every_n_epochs=1, name='model'):
        self.input = tf.placeholder(tf.float32, shape=(None, n_features))
        self.true_values = tf.placeholder(tf.float32, shape=(None,1))
        self.training = tf.placeholder(tf.bool)
        self.logit = self.create_logit()
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.true_values, 
                                                                           logits=self.logit))
        self.predicted_probs = tf.sigmoid(self.logit)
        self.output_every_n_epochs = output_every_n_epochs
        self.name = name
        self.saver = tf.train.Saver()    



    def create_logit(self):
        return  add_layer(self.input, 30, 1)

    def evaluate_loss_and_probs(self, sess, X, y, training=False, output=False):
        loss, probs = sess.run([self.loss, self.predicted_probs], {self.input : X, 
                                                                   self.true_values : y.reshape(-1,1),
                                                                   self.training : training})
        probs.reshape(-1)
        y_hat = (probs > 0.5).reshape(-1)*1
        auc = roc_auc_score(y, probs)
        precision = precision_score(y, y_hat)
        recall = recall_score(y, y_hat)
        fp = np.sum((y!=y_hat) & (y==0)) 
        fpr = fp / (y==0).sum()
        if output:
            print('Loss: %.6f \t AUC %.6f \t Precision %.6f%% \t Recall %.6f%% \t FPR %.6f%%' % (loss, auc, precision*100, recall*100, fpr*100))
        return loss, probs, y_hat, auc, precision, recall



    def train(self, sess, X, y, n_epochs, batch_size, learning_rate):
        init = tf.global_variables_initializer()
        sess.run(init)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate)
        train = optimizer.minimize(self.loss)
        n_samples = X.shape[0]
        n_iter = int(np.ceil(n_samples/batch_size))
        indices = np.arange(n_samples)
        training_losses = []
        training_aucs = []
        for epoch in range(0,n_epochs):
            np.random.shuffle(indices)
            for i in range(n_iter):
                idx = indices[i*batch_size:(i+1)*batch_size]
                x_i = X[idx,:]
                y_i = y[idx].reshape(-1,1)
                sess.run(train, {self.input : x_i, 
                                 self.true_values : y_i,
                                 self.training : True})
            output=False
            if (epoch % self.output_every_n_epochs)==0:
                print(epoch, 'th Epoch')
                output=True
            loss_train_epoch, predict_train_epoch, y_hat, auc_train_epoch, _, _ = self.evaluate_loss_and_probs(sess, X, y, False, output)
            training_losses.append(loss_train_epoch)
            training_aucs.append(auc_train_epoch)
        with plt.xkcd() as style:
            plt.figure(figsize=(7,7))
            plt.subplot(2,1,1)  
            plt.title('Loss')
            plt.plot(range(n_epochs), training_losses)
            plt.xlabel('# Epoch')
            plt.subplot(2,1,2)
            plt.title('AUC')
            plt.plot(range(n_epochs), training_aucs)
            plt.xlabel('# Epoch')
            plt.tight_layout()
            plt.savefig('training_loss_auc_%s.png' % self.name, dpi=300)
        self.saver.save(sess, "./%s/model.ckpt" % self.name)

    def restore(self, sess):
        self.saver.restore(sess, "./%s/model.ckpt" % self.name)

Apply this function to build our Logisitc Regression model

np.random.seed(42)
lr = model(30, 10, 'lr')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
    lr.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set:')
    _, probs_lr, y_hat_lr, _, _, _ = lr.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

training_loss_auc_lr

0 th Epoch
Loss: 0.007944   AUC 0.980217    Precision 86.538462%    Recall 56.675063%   FPR 0.015388%
10 th Epoch
Loss: 0.004231   AUC 0.984984    Precision 87.591241%    Recall 60.453401%   FPR 0.014948%
Validation set:
Loss: 0.003721   AUC 0.977169    Precision 89.041096%    Recall 68.421053%   FPR 0.014068%

Backpropagation

In the previous parts we have seen how we can learn the weights (parameter) of our logistic regression model. So we know how to train a network with one layer but how can we train a network with more than one layer?

The concept is called Backpropagation and is basically the application of the chain rule. In the first phase (feed forward phase) the the input is feed into the network through all layers and the loss is calculated. Then in the 2nd or backward phase, the weights are updated recursevly from the last layer to the first.

At the last layer the derivate of the loss is straight forward. For the calculation of the weights in the inner or hidden layers we need the previous calculated derivates.

With the calculated gradients we can apply again a gradient descent method to optimize our weights.

The power of TensorFlow or other deep learning libraries as PyTorch are again the auto gradients. We dont need to worry to calculate the gradients by ourself.

A detailed deriviation of the backpropagation algorithm with an example for a quadratic loss function can be found on wikipedia.

First deep network

Now its time for our first deep neural network. We will add 4 layers with 120, 60, 30 and 1 node.

class model2(model):

    def create_logit(self):
        layer1 = add_layer(self.input, 30, 120)
        layer2 = add_layer(layer1, 120, 60,)
        layer3 = add_layer(layer2, 60, 30)
        layer4 = add_layer(layer3, 30, 1)
        return layer4

np.random.seed(42)
dnn1 = model2(30, 10, 'model1')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
    dnn1.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn1, y_hat_dnn1, _, _, _ = dnn1.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

The performance of this network is not really good. Actually is quite bad for the complexity of the model.
The AUC on the validation set is worse than the AUC from the logistic regression.

roc_auc_model1

roc_auc_model1_detail

For low FPRs the logistic regession almost always outperforms the deep neural network (DNN). A FPR of 0.1 % means that in we will have 1 false positive in 1000 transactions. If you have millions of transactions even such a low fpr can affect and your customers. In very low FPRs (less than 0.0001) the DNN have a slightly higher true positive rate (TPR).

The problem is that we use the identity as activation function. The logit is still a linear function of the input.
If we want to capture non linear dependencies we have to add a non-linear activation function.
Let’s try the RELU.

Time for non-linearity

class model2b(model):

    def create_logit(self):
        layer1 = add_layer(self.input, 30, 120, tf.nn.relu)
        layer2 = add_layer(layer1, 120, 60, tf.nn.relu)
        layer3 = add_layer(layer2, 60, 30, tf.nn.relu)
        layer4 = add_layer(layer3, 30, 1)
        return layer4

np.random.seed(42)
dnn1b = model2b(30, 10, 'model1b')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn1b.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn1b, y_hat_dnn1b, _, _, _= dnn1b.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

Another popular choice is tanh. We compare both activation functions with the logistic regression:

roc_auc_model1c

roc_auc_model1c_detail

We see that both non linear models outperforms the logistic regression. For low FPRs the TPR is signifanct higher.
Assume we would accept a FPR of 0.01 %, then the Recall of our DNN is around 80% vs 50% for the logistic regression.
We can detect much more fraudulent transactions with the same rate of false alarms.

Using TensorFlow layers

Instead of building the computational graph our self (weights, bias tensor, etc) we can use TensorFlow Layers. The function tf.layers.dense build a linear or dense layer. We can specify the number of nodes, the input and the actication function (similar to our own function).

In the next layer we use the TensorFlow function and add on more layers.

class model3(model):

    def create_logit(self):
        layer1 = tf.layers.dense(self.input, 240, activation=tf.nn.tanh)
        layer2 = tf.layers.dense(layer1, 120, activation=tf.nn.tanh)
        layer3 = tf.layers.dense(layer2, 60, activation=tf.nn.tanh)
        layer4 = tf.layers.dense(layer3, 30, activation=tf.nn.tanh)
        layer5 = tf.layers.dense(layer4, 1)
        return layer5

np.random.seed(42)
dnn2 = model3(30, 10, 'model2')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn2.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn2, y_hat_dnn2, _, _, _= dnn2.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)
0 th Epoch
Loss: 0.003000   AUC 0.986239    Precision 82.428941%    Recall 80.352645%   FPR 0.029897%
10 th Epoch
Loss: 0.002036   AUC 0.992393    Precision 95.626822%    Recall 82.619647%   FPR 0.006595%
20 th Epoch
Loss: 0.001598   AUC 0.995232    Precision 93.989071%    Recall 86.649874%   FPR 0.009673%
30 th Epoch
Loss: 0.001273   AUC 0.996695    Precision 99.137931%    Recall 86.901763%   FPR 0.001319%
Validation set
Loss: 0.002425   AUC 0.980571    Precision 91.764706%    Recall 82.105263%   FPR 0.012309%

The model didn’t improve to the previous one. Maybe we are now overfitting. One way to prevent overfitting in DNN are dropouts. Dropouts deactive a proportion of nodes during training randomnly. So we prevent our neural network to memorize the training data. Lets add dropout layers to the previous model.

Lets use a dropout rate of 20%.

class model4(model):

    def create_logit(self):
        layer1 = tf.layers.dense(self.input, 120, activation=tf.nn.tanh)
        layer1 = tf.layers.dropout(layer1, 0.2, training=self.training)
        layer2 = tf.layers.dense(layer1, 60, activation=tf.nn.tanh)
        layer2 = tf.layers.dropout(layer2, 0.2, training=self.training)
        layer3 = tf.layers.dense(layer2, 30, activation=tf.nn.tanh)
        layer3 = tf.layers.dropout(layer3, 0.2, training=self.training)
        layer4 = tf.layers.dense(layer3, 1)
        return layer4

np.random.seed(42)
dnn3 = model4(30, 10, 'model3')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn3.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn3, y_hat_dnn3, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

auc_models_2

auc_models_2_detail

We see that all our deep learning model outperform the LR model on the validation set. The difference in AUC doesn’t seems very big, but especially for very low FPR the recall is much higher. Where the model with the dropout (DNN3) performs slightly better than the others.

Lets go for model 3 (4 layers with dropout) and let see the AUC Score of the model on the test data.

with tf.Session() as sess:
    dnn3.restore(sess)
    print('Test set')
    _, probs_dnn3_test, y_hat_dnn3_test, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_test, y_test, False, True)
Test set
Loss: 0.001825   AUC 0.991294    Precision 97.619048%    Recall 83.673469%   FPR 0.003517%

This model performs very well on our test set. We have a high Recall with a very low FPR at a threshold of 50%.

Keras

The library Keras offers a very convinient API to TensorFlow (but it also supports other deep learning backends).

We can build the same model in just 6 lines of code. For many standard problems there are predefined loss functions, but we can also write our own loss functions in Keras.

For the model training and the prediction we only need one line of code each.

model = keras.Sequential()
model.add(keras.layers.Dense(120, input_shape=(30,), activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(60, activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(30, activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(X_train, y_train, epochs=31, batch_size=100)

probs_keras = model.predict(X_test)

Conclusion

In this part we saw how to build and train a deep neural network with TensorFlow using the low level and mid level API and as an outlook we saw how easy the model development is in a high level API like Keras.

For this fraud detection problem a very simple deep network can outperform a classical machine learning algorithm like logistic regression if we looking into the low false positive rate (FPR) regions. If we can accept higher false positive rates all models perform similar.

To decide for a final model, one need to specify the costs of a FP (a genuine transaction which we maybe block or at least investigate) and FN (a fraudulent transaction which we miss), so we can balance the trade-off between Recall (detection power) and FPR.

Our fraud detection problem is as we know a imbalance class problem. We can maybe improve the quality of the logistic regression with use of over-/undersampling of the majority class. Or we can use try other ‘classical’ machine learning methods like random forests or boosting trees, which often outperform a logisitc regression.

Another interesting unsupervised deep learning method to detect anomalies in transactions are auto-encoders.

I think I will cover these topics in later posts. So stay tuned.

As usual you can find the notebook on my GitHub repo, so please download the notebook and play with the model parameter, e.g one could change numbers of epochs we train, apply adaptive learning rates or add more layer or change the numbers of nodes in each layer and play with dropouts to find better models and please share your ideas and results.

So long…

Option hedging with Long-Short-Term-Memory Recurrent Neural Networks Part I

In the last two posts we priced exotic derivates with TensorFlow in Python. We implemented Monte-Carlo-Simulations to price Asian Options, Barrier Options and Bermudan Options. In this post we use deep learning to learn a optimal hedging strategy for Call Options from market prices of the underlying asset. This approach is purely data-driven and ‘model free’ as it doesnt make any assumptions of the underlying stochastic process.

We follow and adopt the ideas presented in the paper ‘Deep Hedging’ (https://arxiv.org/abs/1802.03042) by Hans Bühler, Lukas Gonon, Josef Teichmann, Ben Wood. The model can easily extended to incorporate transaction costs and trading limits.

In this part we will train a four layer Long-Short-Term-Memory (LSTM) Recurrent neural network (RNN) to learn a optimal hedging strategy given the individual risk aversion of the trader (we will minimize the Conditional Value at Risk also known as the Expected Shortfall of the hedging strategy) and derive an lower bound for a price which the risk-averse trader should charge if the trader follows the optimal hedging strategy.

For simplicty we use synthetic market prices of the underlying gernerated by a Black Scholes Process with a drift of zero (risk free interest is zero) and a volatility of 20%. We will train the network on one option on these simulated market values which matures in one month and has a moneyness of one (\frac{S}{K}=1).
This can easily adopted to more spohisticated models (see the reference paper for a Heston Model example) or even use real market data.

We will compare the effectivness of the hedging strategy and compare it to the delta hedging strategy using the delta from the Black Scholes model. We will evaluate the influence of the risk aversion of the trader on the hedging strategy and we will see how well the network can generalize the hedging strategy if we change the moneyness of the option, the drift (should have no effect in theory) and the volatility of the underlying process.

Start Spoiler This simple network is very good in pricing this particular option even on not observed market paths, but it will fail to generalize the strategy for options with different levels of moneyness or volatilities (but the Black Scholes strategy fails as well if we use a wrong volatility for pricing). End Spoiler

In the comming parts we will try to improve the network by changing the network architecture (e.g. simple MLP or Convolutional Networks) or using other and more features and try some approach to generate training data from market data or try transfer learning methods (since we dont have millions of training paths of one underlying) and include transaction costs.

Brief recap: Delta / Delta Hedge

The delta of an option is the sensitivity of the option price to a change of underlying stock’s price.

The idea of the delta hedging if to immunise a portfolio of option against changes in the market price of the underlying. If you have a long position of delta units of the underlying stock and you are one option short then your portfolio is not sensitive to changes of the market price (but its only a local approximation, if the price change the delta will change (2nd order greeks) and you need to adjust your position).

Black, Scholes and Merton used a (continuous time) self-financing delta hedge strategy to derive their famous pricing formula.

Setting

We are in Black Scholes setting. Current stock price is 100 and the volatility is 20%. The risk free interest rates is 0. We have a Call option with maturity in one month at a strike of 100. We adjust our hedge portfolio daily. Our training set will consists of 500,000 samples.

Our portfolio will consists of one short position of a call and \delta_{t_i} units of the underlying stock.

The PnL of our hedging / trading strategy is:

\sum_{I=0}^{n} S_{t_i} (\delta_{i-1} - \delta_{i}),

with $$\delta_{n}=\delta_{-1}=0.$ At time i we sell our previous position $\delta_{i-1}$ (cash inflow) and buy $\delta_{I}$ stocks (cash outflow). At the maturity $n$ we liquidate our position in stocks. The sum of all these transactions is our profit or loss.

The final value of our portfolio is given by the difference of the option payout and the PnL of our hedging strategy

\Pi = - max(S_T-K,0) + \sum_{I=0}^{n} S_{t_i} (\delta_{i-1} - \delta_{i}).

Under particular assumptions there exists a unique trading strategy and one fair price of the option that almost surely $\Pi + p_0 = 0$ (almost surely). One of the assumptions is continuous time trading.

But if we not hedge in continuous time, the quantity hold only in average.

In this example we sampled 10,000 paths from the Black Scholes process and applied the delta hedge strategy with different trading intervals (from once to twice a day).

hedging_error_bs1

hedging_error2.png

We follow the idea of the paper Deep Hedging and try to find a hedging strategy which minimize the CVaR given the risk aversion $\alpha$

E[-\Pi | \Pi \le -VaR_{\alpha}(\Pi)].

We will develop two trading strategies (alpha=0.5 and 0.99) and test them versus the black and Scholes delta hedge strategy.

For our test set we generate 100,000 paths from the same underlying process (not in the training set). We will test the hedging strategy for 3 different options (strike K=100, 95 and 105). Additionally we test the hedging strategies on market paths from a process with a shifted drift and from a process with shifted volatility.

Implementation

The code is as usual in my GitHub repository. Since the network needs approximately two and half hours for training, I also uploaded the pre-trained model. So one can skip the training step and directly restore the models.

We have a lot of helper function to generate the paths, calculate the Black Scholes prices and deltas and evaluate and compare the trading strategies, but the core class is our model.

<br />class RnnModel(object):
def __init__(self, time_steps, batch_size, features, nodes = [62,46,46,1], name='model'):
tf.reset_default_graph()
self.batch_size = batch_size
self.S_t_input = tf.placeholder(tf.float32, [time_steps, batch_size, features])
self.K = tf.placeholder(tf.float32, batch_size)
self.alpha = tf.placeholder(tf.float32)

S_T = self.S_t_input[-1,:,0]
dS = self.S_t_input[1:, :, 0] - self.S_t_input[0:-1, :, 0]
#dS = tf.reshape(dS, (time_steps, batch_size))

#Prepare S_t for the use in the RNN remove the last time step (at T the portfolio is zero)
S_t = tf.unstack(self.S_t_input[:-1, :,:], axis=0)

# Build the lstm
lstm = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.LSTMCell(n) for n in nodes])

self.strategy, state = tf.nn.static_rnn(lstm, S_t, initial_state=lstm.zero_state(batch_size, tf.float32), dtype=tf.float32)

self.strategy = tf.reshape(self.strategy, (time_steps-1, batch_size))
self.option = tf.maximum(S_T-self.K, 0)

self.Hedging_PnL = - self.option + tf.reduce_sum(dS*self.strategy, axis=0)
self.Hedging_PnL_Paths = - self.option + dS*self.strategy
# Calculate the CVaR for a given confidence level alpha
# Take the 1-alpha largest losses (top 1-alpha negative PnLs) and calculate the mean
CVaR, idx = tf.nn.top_k(-self.Hedging_PnL, tf.cast((1-self.alpha)*batch_size, tf.int32))
CVaR = tf.reduce_mean(CVaR)
self.train = tf.train.AdamOptimizer().minimize(CVaR)
self.saver = tf.train.Saver()
self.modelname = name

def _execute_graph_batchwise(self, paths, strikes, riskaversion, sess, epochs=1, train_flag=False):
sample_size = paths.shape[1]
batch_size=self.batch_size
idx = np.arange(sample_size)
start = dt.datetime.now()
for epoch in range(epochs):
# Save the hedging Pnl for each batch
pnls = []
strategies = []
if train_flag:
np.random.shuffle(idx)
for i in range(int(sample_size/batch_size)):
indices = idx[i*batch_size : (i+1)*batch_size]
batch = paths[:,indices,:]
if train_flag:
_, pnl, strategy = sess.run([self.train, self.Hedging_PnL, self.strategy], {self.S_t_input: batch,
self.K : strikes[indices],
self.alpha: riskaversion})
else:
pnl, strategy = sess.run([self.Hedging_PnL, self.strategy], {self.S_t_input: batch,
self.K : strikes[indices],
self.alpha: riskaversion})
pnls.append(pnl)
strategies.append(strategy)
#Calculate the option prive given the risk aversion level alpha
CVaR = np.mean(-np.sort(np.concatenate(pnls))[:int((1-riskaversion)*sample_size)])
if train_flag:
if epoch % 10 == 0:
print('Time elapsed:', dt.datetime.now()-start)
print('Epoch', epoch, 'CVaR', CVaR)
self.saver.save(sess, r"/Users/matthiasgroncki/models/%s/model.ckpt" % self.modelname)
self.saver.save(sess, r"/Users/matthiasgroncki/models/%s/model.ckpt" % self.modelname)
return CVaR, np.concatenate(pnls), np.concatenate(strategies,axis=1)

def training(self, paths, strikes, riskaversion, epochs, session, init=True):
if init:
sess.run(tf.global_variables_initializer())
self._execute_graph_batchwise(paths, strikes, riskaversion, session, epochs, train_flag=True)

def predict(self, paths, strikes, riskaversion, session):
return self._execute_graph_batchwise(paths, strikes, riskaversion,session, 1, train_flag=False)

def restore(self, session, checkpoint):
self.saver.restore(session, checkpoint)

The constructor creates the computational graph, we using the tf.contrib.rnn.MultiRNNCell() cell to stack the LSTM Cells tf.contrib.rnn.LSTMCell().

We can pass the timesteps, batch_size and number of nodes in each layer to the constructor.

At the moment the network is quite simple, we use in standard 4 Layers with 62, 46, 46, and 1 node. For a introduction in RNNs and LSTM I can recommend to read http://colah.github.io/posts/2015-08-Understanding-LSTMs/ or http://adventuresinmachinelearning.com/keras-lstm-tutorial/ but there are plenty of resources online.

Our class provides a function to train the model and to predict a trading strategy.

Results

We compare the average PnL and the CVaR of the trading strategies assuming we can charge the Black Scholes price for the option.

For the first test set (strike 100, same drift, same vola) the results looks quite good.

** alpha = 0.5 **

Screen Shot 2018-06-05 at 22.16.30

rnn_050_box_1

rnn_050_deltas_1

A trader with such a risk aversion should charge is 0.25 above the Black Scholes price.

** alpha = 0.99 **

Screen Shot 2018-06-05 at 22.20.09

rnn_050_box_1

rnn_050_deltas_1

A trader with such a risk aversion should charge about 0.99 above the Black Scholes price.

We see with a higher risk aversion extreme losses will be less likely. And the trader will need a higher compensation to take the risk to sell the option. Both strategies have a lower CVaR and a higher average profit compared to the Black Scholes strategy while the trading strategy of the RNN has a higher volatility as the Black Scholes one.

But what happen if we need a strategy for an option with a different strike:

Alpha = 0.5 and Strike @ 95:

We see that the PnL of the RNN strategy is significantly worse than then BS trading strategy. If we compare the deltas we see that the model assume a strike at 100.

We see a similar picture for higher strikes and different alphas.

If we change the drift of the observe market value, both hedges still hold (as expected). But its a different picture when we change the volatility.

In that case both models fails similar bad:

Conclusion

Overall it is a very interesting application of deep learning to option pricing and hedging and I am very curious about the future developments in this field.

The RNN is able to learn a hedging strategy for a particular option without any assumption of the underlying stochastic process. The hedging strategy outperforms the Black Scholes delta hedge strategy but the neural network fails at generalising the strategy for options at different strike levels. To be fair we have to admit that our training set consisted only of one option at one strike level. In the next part we will try to improve the model with a more diverse training set and add more features to it.

But the next post will be most likely the 3rd part of the fraud detection series (From Logistic Regression to Deep Learning – A fraud detection case study Part I, Part II).

So long…