## Dropout Neural Networks

### Introduction

The term "dropout" is used for a technique which drops out some nodes of the network. Dropping out can be seen as temporarily deactivating or ignoring neurons of the network. This technique is applied in the training phase to reduce overfitting effects. Overfitting is an error which occurs when a network is too closely fit to a limited set of input samples.

The basic idea behind dropout neural networks is to dropout nodes so that the network can concentrate on other features. Think about it like this. You watch lots of films from your favourite actor. At some point you listen to the radio and here somebody in an interview. You don't recognize your favourite actor, because you have seen only movies and your are a visual type. Now, imagine that you can only listen to the audio tracks of the films. In this case you will have to learn to differentiate the voices of the actresses and actors. So by dropping out the visual part you are forced tp focus on the sound features!

This technique has been first proposed in a paper "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov in 2014

We will implement in our tutorial on machine learning in Python a Python class which is capable of dropout.

At first, we will have a look at the weight array between the input and the hidden layer. We called this array 'wih' (weights between input and hidden layer).

Let's deactivate (drop out) the node $i_2$. We can see in the following diagram what's happening:

This means that we have to take out every second product of the summation, which means that we have to delete the whole second column of the matrix. The second element from the input vector has to be deleted as well.

Now we will examine what happens if we take out a hidden node. We take out the first hidden node, i.e. $h_1$.

In this case, we can remove the complete first line of our weight matrix:

Taking out a hidden node affects the next weight matrix as well. Let's have a look at what is happening in the network graph:

It is easy to see that the first column of the who weight matrix has to be removed again:

So far we have arbitrarily chosen one node to deactivate. The dropout approach means that we randomly choose a certain number of nodes from the input and the hidden layers, which remain active and turn off the other nodes of these layers. After this we can train a part of our learn set with this network. The next step consists in activating all the nodes again and randomly chose other nodes. It is also possible to train the whole training set with the randomly created dropout networks.

We present three possible randomly chosen dropout networks in the following three diagrams:

Now it is time to think about a possible Python implementation.

We will start with the weight matrix between input and hidden layer. We will randomly create a weight matrix for 10 input nodes and 5 hidden nodes. We fill our matrix with random numbers between -10 and 10, which are not proper weight values, but this way we can see better what is going on:

```
import numpy as np
import random
input_nodes = 10
hidden_nodes = 5
output_nodes = 7
wih = np.random.randint(-10, 10, (hidden_nodes, input_nodes))
wih
```

We will choose now the active nodes for the input layer. We calculate random indices for the active nodes:

```
active_input_percentage = 0.7
active_input_nodes = int(input_nodes * active_input_percentage)
active_input_indices = sorted(random.sample(range(0, input_nodes),
active_input_nodes))
active_input_indices
```

We learned above that we have to remove the column $j$, if the node $i_j$ is removed. We can easily accomplish this for all deactived nodes by using the slicing operator with the active nodes:

```
wih_old = wih.copy()
wih = wih[:, active_input_indices]
wih
```

As we have mentioned before, we will have to modify both the 'wih' and the 'who' matrix:

```
who = np.random.randint(-10, 10, (output_nodes, hidden_nodes))
print(who)
active_hidden_percentage = 0.7
active_hidden_nodes = int(hidden_nodes * active_hidden_percentage)
active_hidden_indices = sorted(random.sample(range(0, hidden_nodes),
active_hidden_nodes))
print(active_hidden_indices)
who_old = who.copy()
who = who[:, active_hidden_indices]
print(who)
```

We have to change wih accordingly:

```
wih = wih[active_hidden_indices]
wih
```

The following Python code summarizes the sniplets from above:

```
import numpy as np
import random
input_nodes = 10
hidden_nodes = 5
output_nodes = 7
wih = np.random.randint(-10, 10, (hidden_nodes, input_nodes))
print("wih: \n", wih)
who = np.random.randint(-10, 10, (output_nodes, hidden_nodes))
print("who:\n", who)
active_input_percentage = 0.7
active_hidden_percentage = 0.7
active_input_nodes = int(input_nodes * active_input_percentage)
active_input_indices = sorted(random.sample(range(0, input_nodes),
active_input_nodes))
print("\nactive input indices: ", active_input_indices)
active_hidden_nodes = int(hidden_nodes * active_hidden_percentage)
active_hidden_indices = sorted(random.sample(range(0, hidden_nodes),
active_hidden_nodes))
print("active hidden indices: ", active_hidden_indices)
wih_old = wih.copy()
wih = wih[:, active_input_indices]
print("\nwih after deactivating input nodes:\n", wih)
wih = wih[active_hidden_indices]
print("\nwih after deactivating hidden nodes:\n", wih)
who_old = who.copy()
who = who[:, active_hidden_indices]
print("\nwih after deactivating hidden nodes:\n", who)
```

```
import numpy as np
import random
from scipy.special import expit as activation_function
from scipy.stats import truncnorm
def truncated_normal(mean=0, sd=1, low=0, upp=10):
return truncnorm(
(low - mean) / sd, (upp - mean) / sd, loc=mean, scale=sd)
class NeuralNetwork:
def __init__(self,
no_of_in_nodes,
no_of_out_nodes,
no_of_hidden_nodes,
learning_rate,
bias=None
):
self.no_of_in_nodes = no_of_in_nodes
self.no_of_out_nodes = no_of_out_nodes
self.no_of_hidden_nodes = no_of_hidden_nodes
self.learning_rate = learning_rate
self.bias = bias
self.create_weight_matrices()
def create_weight_matrices(self):
X = truncated_normal(mean=2, sd=1, low=-0.5, upp=0.5)
bias_node = 1 if self.bias else 0
n = (self.no_of_in_nodes + bias_node) * self.no_of_hidden_nodes
X = truncated_normal(mean=2, sd=1, low=-0.5, upp=0.5)
self.wih = X.rvs(n).reshape((self.no_of_hidden_nodes,
self.no_of_in_nodes + bias_node))
n = (self.no_of_hidden_nodes + bias_node) * self.no_of_out_nodes
X = truncated_normal(mean=2, sd=1, low=-0.5, upp=0.5)
self.who = X.rvs(n).reshape((self.no_of_out_nodes,
(self.no_of_hidden_nodes + bias_node)))
def dropout_weight_matrices(self,
active_input_percentage=0.70,
active_hidden_percentage=0.70):
# restore wih array, if it had been used for dropout
self.wih_orig = self.wih.copy()
self.no_of_in_nodes_orig = self.no_of_in_nodes
self.no_of_hidden_nodes_orig = self.no_of_hidden_nodes
self.who_orig = self.who.copy()
active_input_nodes = int(self.no_of_in_nodes * active_input_percentage)
active_input_indices = sorted(random.sample(range(0, self.no_of_in_nodes),
active_input_nodes))
active_hidden_nodes = int(self.no_of_hidden_nodes * active_hidden_percentage)
active_hidden_indices = sorted(random.sample(range(0, self.no_of_hidden_nodes),
active_hidden_nodes))
self.wih = self.wih[:, active_input_indices][active_hidden_indices]
self.who = self.who[:, active_hidden_indices]
self.no_of_hidden_nodes = active_hidden_nodes
self.no_of_in_nodes = active_input_nodes
return active_input_indices, active_hidden_indices
def weight_matrices_reset(self,
active_input_indices,
active_hidden_indices):
"""
self.wih and self.who contain the newly adapted values from the active nodes.
We have to reconstruct the original weight matrices by assigning the new values
from the active nodes
"""
temp = self.wih_orig.copy()[:,active_input_indices]
temp[active_hidden_indices] = self.wih
self.wih_orig[:, active_input_indices] = temp
self.wih = self.wih_orig.copy()
self.who_orig[:, active_hidden_indices] = self.who
self.who = self.who_orig.copy()
self.no_of_in_nodes = self.no_of_in_nodes_orig
self.no_of_hidden_nodes = self.no_of_hidden_nodes_orig
def train_single(self, input_vector, target_vector):
"""
input_vector and target_vector can be tuple, list or ndarray
"""
if self.bias:
# adding bias node to the end of the input_vector
input_vector = np.concatenate( (input_vector, [self.bias]) )
input_vector = np.array(input_vector, ndmin=2).T
target_vector = np.array(target_vector, ndmin=2).T
output_vector1 = np.dot(self.wih, input_vector)
output_vector_hidden = activation_function(output_vector1)
if self.bias:
output_vector_hidden = np.concatenate( (output_vector_hidden, [[self.bias]]) )
output_vector2 = np.dot(self.who, output_vector_hidden)
output_vector_network = activation_function(output_vector2)
output_errors = target_vector - output_vector_network
# update the weights:
tmp = output_errors * output_vector_network * (1.0 - output_vector_network)
tmp = self.learning_rate * np.dot(tmp, output_vector_hidden.T)
self.who += tmp
# calculate hidden errors:
hidden_errors = np.dot(self.who.T, output_errors)
# update the weights:
tmp = hidden_errors * output_vector_hidden * (1.0 - output_vector_hidden)
if self.bias:
x = np.dot(tmp, input_vector.T)[:-1,:]
else:
x = np.dot(tmp, input_vector.T)
self.wih += self.learning_rate * x
def train(self, data_array,
labels_one_hot_array,
epochs=1,
active_input_percentage=0.70,
active_hidden_percentage=0.70,
no_of_dropout_tests = 10):
partition_length = int(len(data_array) / no_of_dropout_tests)
for epoch in range(epochs):
print("epoch: ", epoch)
for start in range(0, len(data_array), partition_length):
active_in_indices, active_hidden_indices = \
self.dropout_weight_matrices(active_input_percentage,
active_hidden_percentage)
for i in range(start, start + partition_length):
self.train_single(data_array[i][active_in_indices],
labels_one_hot_array[i])
self.weight_matrices_reset(active_in_indices, active_hidden_indices)
def confusion_matrix(self, data_array, labels):
cm = {}
for i in range(len(data_array)):
res = self.run(data_array[i])
res_max = res.argmax()
target = labels[i][0]
if (target, res_max) in cm:
cm[(target, res_max)] += 1
else:
cm[(target, res_max)] = 1
return cm
def run(self, input_vector):
# input_vector can be tuple, list or ndarray
if self.bias:
# adding bias node to the end of the input_vector
input_vector = np.concatenate( (input_vector, [self.bias]) )
input_vector = np.array(input_vector, ndmin=2).T
output_vector = np.dot(self.wih, input_vector)
output_vector = activation_function(output_vector)
if self.bias:
output_vector = np.concatenate( (output_vector, [[self.bias]]) )
output_vector = np.dot(self.who, output_vector)
output_vector = activation_function(output_vector)
return output_vector
def evaluate(self, data, labels):
corrects, wrongs = 0, 0
for i in range(len(data)):
res = self.run(data[i])
res_max = res.argmax()
if res_max == labels[i]:
corrects += 1
else:
wrongs += 1
return corrects, wrongs
```

```
import pickle
with open("data/mnist/pickled_mnist.pkl", "br") as fh:
data = pickle.load(fh)
train_imgs = data[0]
test_imgs = data[1]
train_labels = data[2]
test_labels = data[3]
train_labels_one_hot = data[4]
test_labels_one_hot = data[5]
image_size = 28 # width and length
no_of_different_labels = 10 # i.e. 0, 1, 2, 3, ..., 9
image_pixels = image_size * image_size
```

```
parts = 10
partition_length = int(len(train_imgs) / parts)
print(partition_length)
start = 0
for start in range(0, len(train_imgs), partition_length):
print(start, start + partition_length)
```

```
epochs = 3
simple_network = NeuralNetwork(no_of_in_nodes = image_pixels,
no_of_out_nodes = 10,
no_of_hidden_nodes = 100,
learning_rate = 0.1)
simple_network.train(train_imgs,
train_labels_one_hot,
active_input_percentage=1,
active_hidden_percentage=1,
no_of_dropout_tests = 100,
epochs=epochs)
```

```
corrects, wrongs = simple_network.evaluate(train_imgs, train_labels)
print("accuracy train: ", corrects / ( corrects + wrongs))
corrects, wrongs = simple_network.evaluate(test_imgs, test_labels)
print("accuracy: test", corrects / ( corrects + wrongs))
```