In this post, we will develop a foundational understanding of deep learning for image classification. Then we will look at the classic neural network architectures that have been used for image processing.

**Image classification in deep learning refers to the process of getting a deep neural network to determine the class of an image on the basis of a set of predefined classes. Usually, this requires the network to detect the presence of certain objects or backgrounds without the need to locate these objects within the image.**

For example, a neural network trained to classify images by the type of pet they contain would be able to detect the presence of a cat in an image and distinguish a cat from other pets.

Early image processing techniques relied on understanding the raw pixels. The fundamental problem with this approach is that the same type of object can vary significantly across images. For example, two pictures of cars might be taken in by cars of different colors and at different angles. It is extremely difficult for a method that relies on understanding single pixels to generalize across the large variety inherent in pictures of objects.

Furthermore, it is extremely inefficient to try to understand single pixels without the surrounding context.

This inefficiency and lack of context are some of the main reasons why traditional fully connected networks are inappropriate for most image classification tasks.

To feed an image to a fully connected network, you would have to stack all the pixels in an image in one column.

Individual pixels are taken out of the context of the surrounding pixels and are fed individually to a neuron that will perform a classification. If you had a grayscale image of 512×512 pixels, you would need 512×512 = 262144 neurons just in your first layer to classify every pixel. Since you connect each pixel to each pixel in the next layer, your number of parameters will explode into millions or billions just for a simple image classification model.

This approach works for simple tasks like distinguishing between 10 handwritten digits. But for anything more complex, you want a method that can incorporate the context, and that can also reuse information across pixels.

Instead of classifying individual pixels directly, convolutional layers slide filters over the image. These filters can understand pixels in the context of the other surrounding pixels to extract features such as edges. The network then only learns to detect the presence of the feature. Non-linear activation functions are used to make detection decisions resulting in a feature map that stores information about the presence of the feature in the image and its correspondence to the filter.

This process entails two main advantages. Firstly, filters are not limited to one group of pixels. They can be reused across the entire image to detect the presence of the same feature in other areas of the image. The result is that parameters are shared and reused.

Secondly, the output of a filter only connects to one field in the parameter map of the next layer as opposed to connecting every node in one layer to every node in the next layer. As a consequence, the number of parameters is dramatically reduced, resulting in sparse connections.

Note that convolutional neural networks do have fully connected layers. You usually find them in the latter part of the network because they ultimately lead into the final classification layer that produces the decision of what class the image belongs to. By the time you reach the later layers, the feature maps produced by the convolutional layers will have reduced the original image to a limited set of features so that a fully connected layer can reasonably be applied to distinguish between those features and decide whether they constitute the desired object.

LeNet is one of the earliest convolutional neural networks that was trained in 1990 to identify handwritten digits in the MNIST dataset.

It only consists of 5 layers (counting a pooling and a convolutional layer as one layer) and has been trained to classify grayscale images of size 32×32. Due to the comparatively small number of parameters and simple architecture, it is a great starting point for building an intuitive understanding of convolutional neural networks.

Here is an illustration of the network architecture from the original paper.

Let’s briefly walk through the architecture.

- We use a 32x32x1 image as input. Since the images are grayscale, we only have one channel. The first convolutional layer applies 6 5×5 filters. As a result of sliding the 6 5×5 filters over the input image, we end up with a feature map of 28x28x6. If you don’t understand why this is th case, I recommend reading my article on convolutional filters first. The convolutional layer uses a tanh activation function.
- After the first convolutional layer we have an average pooling layer with a filter size of 2×2 and a stride of 2. The resulting feature map has dimensions 14x14x6. Again, if you couldn’t follow that reasoning, check out my article on pooling.
- The network has another convolutional layer that applies 16 5×5 filters as well as a tanh activation function resulting in an output map of 10x10x16.
- The convolutional layer is followed by another pooling layer with a filter size of 2×2 and a stride of 2 producing a feature map of 5x5x16
- We have a final convolutional layer with 120 filters of size 5×5 and a tanh activation resulting in a feature map of 1x1x120, which connect neatly to a fully connected layer of 120 neurons.
- Next, we have another fully connected feature map of 84 neurons. The fully connected layers both have tanh activation functions.
- Finally, we have the output layer that consists of 10 neurons and a softmax activation function since we have a multiclass classification problem with 10 different classes (one for each digit).

As you can see, the network gradually reduces the size of the original input image by repeatedly applying a similar series of operations:

- Convolution with a 5×5 filter
- Tanh activation
- Average pooling with a 2×2 filter

These operations result in a series of feature maps until we end up with a 1x1x120 map. This allows us to make the transition to fully connected layers while the rest of the neural network operates like a traditional fully connected network.

Note that the convention of using tanh as an activation function between layers and applying average pooling is not common in modern convolutional architectures. Instead, practitioners tend to use ReLU and max pooling.

AlexNet initiated the deep learning revolution in computer vision by pulverizing the competition in a 2012 computer vision contest. The network is significantly larger than LeNet, which partly explains why there was a gap of more than 20 years between LeNet and AlexNet. In the 1990s, computational power was so limited that training large-scale neural networks wasn’t feasible.

The number of classes AlexNet was able to handle compared to LeNet also increased significantly from a mere 10 to 1000. Consequently, it was also trained on a much larger dataset comprising millions of images.

AlexNet can process full RGB images (with three color channels) at a total size of 227x227x3.

AlexNet relies on similar architectural principles as LeNet. It uses 5 pairs of convolutional layers and pooling layers to gradually reduce the size of the feature maps along the x and y axes while increasing the filter dimension. Ultimately, the last feature map connects to a series of 3 fully connected layers resulting in an output layer that distinguishes between 1000 classes using a softmax activation.

Compared to LeNet, the AlexNet architecture also featured some innovations:

- ReLU activation function which accelerated learning signifcantly compared to tanh
- Applying max pooling instead of average pooling
- Overlapping pooling filters to reduce the size of the network which also resulted in a decrease in error
- Dropout layers after the last pooling layer and the first fully connected layer to improve generalization and reduce overfitting.

The full feature map transformation across the layers looks as follows.

For training, the architecture was split between two GPUs, with half of the layers being trained on one GPU and the other half on the other GPU.

While AlexNet marked a quantum leap in the development of convolutional neural networks, it also suffered from drawbacks.

Firstly, the network had approx. 60 million parameters which make it not only large but also extremely prone to overfitting.

Compared to modern neural network architectures, AlexNet was still relatively shallow. Modern networks achieve great performance through increasing depth (which comes with its own drawbacks, such as exploding and vanishing gradients).

Alexnet applied unusually larger convolutional filters of up to 11×11 pixels. Most recent architectures tend to use smaller filter sizes such as 3×3 since most of the useful features are found in clusters of local pixels.

VGG 16 marked the next large advance in the construction of neural network architectures for image processing. Arriving in 2014, it achieved a 92.7% test accuracy in the same contest that AlexNet had won 2 years prior.

VGG 16 is based on AlexNet, but it made some critical changes. Most notable is the increase in depth to 16 layers and the simplification of the hyperparameter tuning process by using only 3×3 filters.

While there is no definitive answer to this question, several theories have been proposed as to why depth improves the performance of neural networks.

The first explanation relies on the fact that neural networks attempt to model highly complex real-world phenomena. For example, when trying to recognize an object in an image, a convolutional neural network assumes that there must be some function that explains how the desired object differs from other objects and attempts to learn this function.

Each layer in a neural network represents a linear function followed by a non-linear activation to express non-linearities in the data. The more of these layers you combine, the more complex the relationships that the network can model.

Another argument stipulates that more layers force the network to break down the concepts it learns into more granular features. The more granular the features, the more the network can reuse them for other concepts. Thus, depth may improve generalizability. For example, when training a convolutional neural network to recognize cars, one layer in a relatively deep network may only learn abstract shapes such as circles, while a layer in a relatively shallow network may learn higher-level features that are specific to a car, such as its wheels.

For convolutional neural networks, another idea suggests that increasing depth results in a larger receptive field. Since you are stacking multiple small filters on top of each other where each summarizes the content in the preceding filter, the layers later in the network will have a higher bird-view perspective on the image. If you used a shallow network, the layers would need larger filters to capture the entire image.

In practice, most neural network architectures rely on 3×3 or 5×5 filter sizes. Larger filter sizes such as the 11×11 used in AlexNet have fallen out of favor.

Firstly, they result in a larger increase in the parameter space. A filter of size 11×11 adds 11×11 = 121 parameters, whereas a 3×3 filter only adds 9 parameters. Stacking 2 or 3 3×3 filters only results in a total of 18 or 27 parameters which is still much less than 121.

Secondly, each filter comes with a non-linear activation function of the corresponding layer. Stacking multiple filters with non-linear activation functions for each one enables the network to learn more complex relationships than a single larger filter with only one activation function.

The VGG 19 Network takes a picture of size 224x224x3 as input. The structure of the convolutional layers is the same throughout the network. They all use 3×3 filters and same padding with a stride of 1 to preserve the dimensions of the input.

The image is passed through 2 convolutional layers with 64 3×3 filters followed by a Max Pooling layer. The pattern is repeated another time with 2 convolutional layers with 128 filters each followed by a max-pooling layer.

Then we have 3 blocks with 4 convolutional layers and a max-pooling layer per block. The number of filters equals 128 in the first block and is increased to 512 in the latter two blocks. Finally, we have a layer to flatten the input, followed by 3 fully connected layers. All convolutional and fully connected layers use the ReLU activation function except for the last one which uses the softmax to distinguish between the 1000 classes.

The VGG16 Architecture is, in principle, the same as the VGG 19 architecture. The main difference is that the 3rd, 4th, and 5th blocks only have 3 convolutional layers.

]]>In this post, we will cover how to build a simple neural network in Tensorflow for a spreadsheet dataset. In addition to constructing a model, we will build a complete preprocessing and training pipeline in Tensorflow that will take the original dataset as an input and automatically transform it into the format necessary for model training.

*Note: If you already have a jupyter notebook with Tensorflow and Python up and running, you can skip to the next section.*

Before we can build a model in TensorFlow, we need to set up our machine with Python and Tensorflow. I assume that you already have Python installed along with a package manager such as pip or conda. If that isn’t the case, you need to do that before you can continue with this tutorial. Here is a great tutorial on installing Python.

Navigate to the directory where you want to work and download the Titanic Dataset from Kaggle to your working directory. Unzip the package. Inside you’ll find three CSV files.

It is generally good practice to set up a new virtual Python environment and install Tensorflow and your other dependencies into that environment. That way, you can manage different projects that require different versions of your dependencies on the same machine. Furthermore, it is much easier to export the required dependencies.

I am using conda for managing my packages. Open a terminal window (I’m on macOS, on Windows, use an alternative like Putty) and navigate to your working directory.

cd path/to/your/working/directory

Next, I am telling conda to create an environment with Python 3.8.

conda create --name titanic python=3.8 conda activate titanic

Now we need to install our dependencies. Besides Tensorflow, we install some basic Python packages for data manipulation as well as jupyter to run Jupyter notebooks. The standard pip installation of Tensorflow will give you the latest stable version with both GPU and CPU support. For this tutorial, we won’t need GPU support, but once you are running larger workloads, I highly recommend it.

pip install jupyter pandas matplotlib scikit-learn, seaborn pip install tensorflow

I ran into an error while importing NumPy with the standard pip installation. At the time of this writing, the latest NumPy package seems incompatible with Python 3.8, so I did another pip install specifying an earlier version.

pip install numpy==1.21.5

Start the jupyter notebook server by typing the following command into your terminal.

jupyter notebook

Create a new jupyter notebook, open it, and import the libraries we’ve installed in the previous section.

import numpy as np import pandas as pd import seaborn as sns import tensorflow as tf

We will just import the train.csv file as a pandas data frame from the titanic data directory that we’ve downloaded before.

data= pd.read_csv('titanic/train.csv')

We first look at the data to understand its shape.

print(data.shape) data.head()

This is a record of 891 people who traveled on the Titanic when it sank. We want to predict who survived based on the remaining features in the dataset.

Let’s plot the variable “Survived” to see how it is distributed (how many people perished and how many survived).

sns.countplot(data=data,x='Survived')

It seems like more than 500 people died, and somewhat over 300 survived. It is important to keep in mind that the classes are not balanced because the model could achieve well over 60% accuracy by simply predicting that everyone died.

Next, we will use our own intuition to filter the data and select the relevant features. When looking at the data, I’ve guessed that class, fare, age, and sex are probably important predictors. People who traveled in the first class and paid a higher fare were given priority over people in the lower classes. Women and children were probably also more likely to be saved. Of course, this is just my intuition.

*If your goal is to find the best model possible, you wouldn’t just rely on your intuition but do more preliminary analysis, tinker around with the data and investigate if there are correlations between survival and other features.But our goal here is to build a neural network and an input pipeline that automates the whole process. For this tutorial, we don’t care about achieving the highest possible accuracy.*

I’m going to filter the data for the selected variables.

data = data[['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]

Next, we want to get rid of all null values in the data. Pandas makes this easy.

data = data.dropna() print(data.shape) #(714, 4)

*If you want to achieve the highest possible accuracy, you might not want to simply drop the values that contain a null entry since you are throwing away a lot of useful information in the other features.*

Before we can start building the input pipeline, we want to separate the predicted variable from the predictors.

target = data.pop('Survived')

A large part of any machine learning project is spent on data manipulation and preprocessing before the data is suitable for training. If you want to deploy a model into production, you need to automate the process of preprocessing and transforming the data into a format suitable for model training.

If we look at the remaining predictors, we see that Pclass is a categorical feature expressed in integer values. There are a total of three classes, and a passenger falls into one of them. Sex is another categorical variable because passengers are either classified as male or female, and the classification is expressed in strings. The age and fare columns contain continuous numeric values.

Before we feed the values to a neural network, we should normalize the numeric values, and one-hot encode the categorical variables.

Normalization ensures that all numeric features are on a similar scale from 0 to 1, which helps speed up gradient descent. If you are unclear why normalization helps gradient descent, check out my post on data normalization.

One hot encoding creates a new column for every category in a variable. It then fills all row entries of each column with zeros except for the rows that actually contain that category.

For example, if. one row in the original column contained the entry “male”, and the second contained “female”, we would end up with two columns named “male”, and “female” with 1s in the row that contained the original category.

This may look like an unnecessary complication. We one-hot encode data because neural networks need numeric data to learn. But why can’t we just map every string category to a corresponding integer? It is possible in cases where a natural ordering exists in the data. But the network may infer an order that doesn’t exist in this case. For example, we don’t want the network to conclude that males are preferable to females. Giving every category its own column essentially binarizes the problem, which removes the potential for implicit ordering.

The survival status, our predicted variable, is another categorical feature expressed in integer values. If the person has survived, it is indicated by the value 1. Death is indicated by a 0. To be precise, survival is a binary feature, and since it only contains 0s and 1s, we don’t need to apply any transformation.

So we split our feature columns between categorical features, numeric features, and the predicted variable.

categorical_feature_names = ['Pclass','Sex'] numeric_feature_names = ['Fare', 'Age'] predicted_feature_name = ['Survived']

To feed the data frame to TensorFlow specific functions, we need to process it into a data object that Tensorflow can work with. To achieve this, we define a Keras tensor for every column in our dataset that acts as a placeholder for the actual data we pass at runtime.

In the following function we:

- iterate through the columns,
- identify the datatype of the column and assign a corresponding tensorflow datatype,
- create the tensor with the datatype and a yet undefined shape,
- pull the tensors into a dictionary and uniquely identify each by the name of the column.

def create_tensor_dict(data, categorical_feature_names): inputs = {} for name, column in data.items(): if type(column[0]) == str: dtype = tf.string elif (name in categorical_feature_names): dtype = tf.int64 else: dtype = tf.float32 inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype) return inputs inputs = create_tensor_dict(data, categorical_feature_names) print(inputs)

Before we normalize the numeric features, we define a helper function that converts the numeric columns from our pandas’ data frame into tensors of floats and stacks them together into one large tensor.

def stack_dict(inputs, fun=tf.stack): values = [] for key in sorted(inputs.keys()): values.append(tf.cast(inputs[key], tf.float32)) return fun(values, axis=-1)

Next, we create the normalizer. We first filter the numeric feature columns from the dataframe. Then we instantiate Keras’ inbuilt normalizer, tell it to normalize along the last axis of our tensor, and adapt the normalizer to our features which we’ve converted to a dictionary, and then to a tensor using the helper function.

def create_normalizer(numeric_feature_names, data): numeric_features = data[numeric_feature_names] normalizer = tf.keras.layers.Normalization(axis=-1) normalizer.adapt(stack_dict(dict(numeric_features))) return normalizer

Finally, we perform the actual normalization in a new function to which we pass the normalizer. In the previous function, we’ve told the normalizer to expect a stacked dictionary of tensors. Previously, we’ve defined a dictionary that contains placeholder tensors for the model. At runtime, the model will expect an input of that same format. Therefore, we need to bring the dictionary into a format that our normalizer can process.

If we filter the dictionary down to just the placeholders for the numeric features and stack them using the stackdict function, we can pass them to the normalizer.

def normalize_numeric_input(numeric_feature_names, inputs, normalizer): numeric_inputs = {} for name in numeric_feature_names: numeric_inputs[name]=inputs[name] numeric_inputs = stack_dict(numeric_inputs) numeric_normalized = normalizer(numeric_inputs) return numeric_normalized

Now we can create the normalizer and subsequently normalize the data using the two functions. This will result in a tensor in which the numeric features have been stacked along the y-axis. Since our data has two numeric feature columns, the y-axis will have 2 entries.

normalizer = create_normalizer(numeric_feature_names, data) numeric_normalized = normalize_numeric_input(numeric_feature_names, inputs, normalizer) print(numeric_normalized)

Ultimately, we want to collect all preprocessed features together so that we can feed them to the model as one dataset. We start by creating an empty list of preprocessed features, to which we gradually add the features as we process them.

preprocessed = [] preprocessed.append(numeric_normalized)

We’ve processed the numerical features. As a next step, we want to one-hot encode the categorical features. Remember that we have categorical features that are encoded either as strings or as integers. Specifically, we need to distinguish between 3 classes and 2 sexes.

In the following function, we iterate through the columns in our dataframe that contain categorical features. We then check if the data contained is of type string or of type integer. In the first case, we define a lookup function to convert the column into a one-hot encoding on the basis of the contained string values. In the second case, we define an integer lookup function to achieve the same thing.

We then retrieve the placeholder from the inputs dictionary corresponding to the column, apply the previously defined lookup function to it, and retrieve the resulting one-hot encoding of the column.

Ultimately, we collect all one hot encodings in a list that we return from the function.

def one_hot_encode_categorical_features(categorical_feature_names, data, inputs): one_hot = [] for name in categorical_feature_names: value = sorted(set(data[name])) if type(value[0]) is str: lookup = tf.keras.layers.StringLookup(vocabulary=value, output_mode='one_hot') else: lookup = tf.keras.layers.IntegerLookup(vocabulary=value, output_mode='one_hot') x = inputs[name][:, tf.newaxis] x = lookup(x) one_hot.append(x) return one_hot

The result is a list of tensors of categorical one-hot encoded features. We can, therefore directly add it to the list of preprocessed tensors.

one_hot = one_hot_encode_categorical_features(categorical_feature_names, data, inputs) preprocessed = preprocessed + one_hot print(preprocessed)

We now have a list of 3 tensors. But we want to feed one large tensor to the model, which is why we concatenate the results along the y-axis.

preprocesssed_result = tf.concat(preprocessed, axis=-1) print(preprocesssed_result)

*Tensorflow automatically numbers the layers. Depending on how often you have executed the functions, you will see different numbers in the layer names. *

We are now in a position to build a model that automates the preprocessing on the basis of the functions we’ve defined before.

To instruct Keras to perform the preprocessing, we construct a new model. Then we have to pass the input in the expected format (the dictionary of tensor placeholders defined in the beginning) and the output in the expected format. The model now automatically applies all the intermediate functions defined before that are necessary to go from the input to the expected output.

preprocessor = tf.keras.Model(inputs, preprocesssed_result)

To test that the preprocessor works, we pass it the first row of our dataset converted to a dictionary. That way, it should have the expected format as expressed by the placeholder input dictionary. We should get a 1×9 dimensional tensor (1 because we only passed it one row and 9 because we have a total of 9 columns).

preprocessor(dict(data.iloc[:1]))

We use Keras’ sequential API to define the neural network. Since the use case is spreadsheet data, a simple feedforward multilayer perception should be enough. We use two dense hidden layers with 10 neurons each and apply the conventional ReLU activation function. In the output layer, we use the sigmoid activation function since we have a binary classification problem.

network = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1) ])

Next, we tie the preprocessor and the network together into one model.

x = preprocessor(inputs) result = network(x) model = tf.keras.Model(inputs, result)

Take note of the brilliant simplicity of the Keras API: The network and the preprocessor can each be defined as separate models. We can simply link them together into one unified model by passing the output from the preprocessor as an input to the network.

Finally, we compile the model using the Adam optimizer (we could also use another one, but Adam is generally a good default), binary cross-entropy as the loss function (because we have a binary optimization problem), and accuracy as the evaluation metric.

model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

We finally convert the data to a dictionary, so it has the expected format expressed by the placeholder dictionary and fit the model. Feel free to use a different batch size and number of epochs.

history = model.fit(dict(data), target, epochs=20, batch_size=8)

We’ve achieved an accuracy of 80%, which isn’t so bad. We could probably nudge that up a little bit by some more sophisticated feature engineering and playing with the model architecture.

You may have realized that we only trained the model on the training set. In the following function, we split the data between train and test based on a specified proportion. We chose an 80/20 split and fit the model.

def create_train_val_split(data, split): msk = np.random.rand(len(data)) < split train_data = data[msk] val_data = data[~msk] train_target = target[msk] val_target = target[~msk] return (train_data, val_data, train_target, val_target) (train_data, val_data, train_target, val_target) = create_train_test_split(data, 0.8) history = model.fit(dict(train_data), train_target, validation_data=(dict(val_data), val_target), epochs=20, batch_size=8)

The validation accuracy is a few percentage points below the training accuracy. This dataset is fairly small, which is why you should expect some fluctuation in the results when you randomly split the data.

Instead of passing data and outcome separately, you can also pull them together into a TensorFlow dataset. This has the advantage that your model fit function is much more concise as you can specify the batch size and shuffle the data directly on the TensorFlow dataset before you call model fit.

#Alternatively train_ds = tf.data.Dataset.from_tensor_slices(( dict(train_data), train_target )) test_ds = tf.data.Dataset.from_tensor_slices(( dict(test_data), test_target )) train_ds = train_ds.batch(8) test_ds = test_ds.batch(8) history = model.fit(train_ds, validation_data=test_ds, epochs=20)

As part of process automation, we also summarize the data manipulation we’ve performed in disparate steps during the initial analysis in a function. We essentially select the columns on the basis of the selected feature names, remove the columns containing null, and split the predicted variable from the predictors. The following functions are those that we have already defined before.

def preprocess_dataframe(data, categorical_feature_names, numeric_feature_names, predicted_feature_name): all_features_names = categorical_feature_names + numeric_feature_names + predicted_feature_name data = data[all_features_names] data = data.dropna() target = data.pop(predicted_feature_name[0]) return (data, target) def create_tensor_dict(data, categorical_feature_names): inputs = {} for name, column in data.items(): if type(column[0]) == str: dtype = tf.string elif (name in categorical_feature_names): dtype = tf.int64 else: dtype = tf.float32 inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype) return inputs def stack_dict(inputs, fun=tf.stack): values = [] for key in sorted(inputs.keys()): values.append(tf.cast(inputs[key], tf.float32)) return fun(values, axis=-1) def create_normalizer(numeric_feature_names, data): numeric_features = data[numeric_feature_names] normalizer = tf.keras.layers.Normalization(axis=-1) normalizer.adapt(stack_dict(dict(numeric_features))) return normalizer def normalize_numeric_input(numeric_feature_names, inputs, normalizer): numeric_inputs = {} for name in numeric_feature_names: numeric_inputs[name]=inputs[name] numeric_inputs = stack_dict(numeric_inputs) numeric_normalized = normalizer(numeric_inputs) return numeric_normalized def one_hot_encode_categorical_features(categorical_feature_names, data, inputs): one_hot = [] for name in categorical_feature_names: value = sorted(set(data[name])) if type(value[0]) is str: lookup = tf.keras.layers.StringLookup(vocabulary=value, output_mode='one_hot') else: lookup = tf.keras.layers.IntegerLookup(vocabulary=value, output_mode='one_hot') x = inputs[name][:, tf.newaxis] x = lookup(x) one_hot.append(x) return one_hot def create_train_val_split(data, split): msk = np.random.rand(len(data)) < split train_data = data[msk] val_data = data[~msk] train_target = target[msk] val_target = target[~msk] return (train_data, val_data, train_target, val_target)

Using these functions, we can load our data and build our training pipeline that ends with the neural network.

#Load data and specify desired columns data= pd.read_csv('titanic/train.csv') categorical_feature_names = ['Pclass','Sex'] numeric_feature_names = ['Fare', 'Age'] predicted_feature_name = ['Survived'] #Preprocess dataframe (data, target) = preprocess_dataframe(data, categorical_feature_names, numeric_feature_names, predicted_feature_name) #Create tensorflow preprocessing head inputs = create_tensor_dict(data, categorical_feature_names) preprocessed = [] normalizer = create_normalizer(numeric_feature_names, data) numeric_normalized = normalize_numeric_input(numeric_feature_names, inputs, normalizer) preprocessed.append(numeric_normalized) one_hot = one_hot_encode_categorical_features(categorical_feature_names, data, inputs) preprocessed = preprocessed + one_hot preprocesssed_result = tf.concat(preprocessed, axis=-1) preprocesssed_result preprocessor = tf.keras.Model(inputs, preprocesssed_result) #Define the model network = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1) ]) x = preprocessor(inputs) result = network(x) #result model = tf.keras.Model(inputs, result) model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

Now we can split the data frame into training and test set and directly fit the model on the data.

(train_data, val_data, train_target, val_target) = create_train_test_split(data, 0.8) history = model.fit(dict(train_data), train_target, validation_data=(dict(val_data), val_target), epochs=20, batch_size=8)]]>

In this post, we understand the basic building blocks of convolutional neural networks and how they are combined to form powerful neural network architectures for computer vision. We start by looking at convolutional layers, pooling layers, and fully connected. Then, we take a step-by-step walkthrough through a simple CNN architecture.

Layers are the basic building block of neural network architectures. Convolutional neural networks primarily rely on three types of layers. These are convolutional layers, pooling layers, and fully connected layers.

Let’s have a look at each of them.

Fully connected layers are the most elementary layers. They consist of a string of neurons stacked on top of each other.

Each neuron takes x as an input, multiplies it with a weight, and adds a bias.

wx + b

The result is sent through a non-linear activation function such as the ReLU to calculate an output that is sent to the next layer.

relu(wx+b)

**A convolutional layer is a layer in a neural network that applies filters to detect edges and structures in images. By using multiple convolutional layers in succession, a neural network can detect higher-level objects, people, and even facial expressions.**

The main reason for using convolutional layers is their computational efficiency on higher-dimensional inputs such as images. If we wanted to train a fully connected layer to classify images, we would have to roll out the 2D image into a one-dimensional stack of pixels. For example, an image with a dimension of 200×200 would become a stack of 4000 pixels. Our fully connected layer would have to contain 4000 units to learn each pixel value.

In a convolutional layer, we replace the multiplication of x with a weight w with a convolution operation. Instead of w, we use a 2-dimensional filter k that we convolve with our input image x.

For an explanation of how the convolution operation works, check out the post on convolutional filters.

Then, like in a fully connected layer, we add a bias to the result of the convolution c and send the final result through a non-linear activation function such as the ReLU.

a= relu(c+b)

You may have realized that the original image we fed to the convolutional layer had a dimensionality of 6×6, but the output was 4×4. Understanding how the convolution operation changes the dimensions of your input is crucial to getting your convolutional neural networks to work. The change of dimensions in a convolutional layer depends on the size of the filter, the stride, and the padding. The filter changes the image dimensions by a factor of

(m\times m) * (f\times f) = (m-f+1)*(m-f+1)

where m is the image length, and f is the filter length. The stride manipulates the output size by changing the step size with which we move the filter over the image, while the padding enables us to add a border around the image to prevent shrinkage due to the filter size. To understand exactly how these operations influence the size of the output, check out this post on padding, stride, and kernel sizes.

Furthermore, convolutional layers usually slide multiple filters over the image. As a result, their output will contain three dimensions even if you only feed it a 2-d image.

**A pooling layer is a layer in a convolutional neural network that abstracts the features extracted by a convolutional layer and helps make the features invariant to translations. To achieve this, it applies the pooling operation on the output produced by the convolutional layer.**

To learn more about the pooling operation, check out this post on pooling.

The pooling layer is fairly simple. It only slides a filter over the output of the convolutional layer, selects the highest pixel value under the filter (provided we use max-pooling), and produces the output in a multidimensional map p.

The pooling layer also shrinks the output depending on the size of the pooling filter and the stride or step size with which we move the filter. Commonly, the pooling filter has a size of 2 and a stride of 2, resulting in shrinking the image by 50%. So a 6×6 input will result in a 3×3 output.

A convolutional neural network architecture usually consists of a couple of convolutional layers, each of which is followed by a pooling layer. These layers have the purpose of extracting features and shrinking the dimensionality of the output. Towards the end, you will usually find a fully connected layer that rolls out the multidimensional input into one dimension and finally feeds it to the final output layer that is used to return the final classification.

To make this a bit more concrete, let’s have a look at one of the earliest convolutional neural network architectures, the LeNet5.

Compared to modern architectures, the LeNet5 was fairly simple, using 2 convolutional layers, 2 pooling layers, 2 fully connected layers, and an output layer.

LeNet takes a 32×32 grayscale image as an input which means it does not have multiple color channels. The network was trained to recognize handwritten digits on the MNIST dataset. The image is passed to a convolutional layer with 6 filters, a filter size of 5×5, and a stride of 1, and no padding. Due to the six 5×5 filters, the result is a feature map with dimensions equal to 28x28x6. We also call these multidimensional data objects that are passed between layers tensors.

The feature map is passed to a pooling layer. The pooling filter has a dimension of 2×2 and is slid across the 6 channels produced by the convolutional layer using a stride of 2. Accordingly, the feature map shrinks by half along the vertical and horizontal dimensions to 14x14x6. LeNet applied average pooling, while most modern implementations mainly rely on max pooling.

The next convolutional layer applies 16 filters with a size of 5×5, a stride of 1, and no padding. The output along the horizontal and vertical dimensions shrinks by 5-1 to 10×10 while the number of channels of the output map goes up to 16.

The convolutional layer is immediately followed by another average pooling layer using a filter size of 2×2 and a stride of 2. The result is a feature map with a dimensionality of 5x5x16.

The pooling layer is followed by a fully connected layer of 120 neurons. Each of these neurons connects to each of the 5x5x16 = 400 nodes in the previous layer.

Next, we have another traditional fully connected layer that reduces down to 84 nodes.

The fully connected layer finally connects to the output layer containing 10 nodes. Remember that LeNet was trained to distinguish between 10 different digits, which is why we have 10 output nodes.

]]>**Pooling in convolutional neural networks is a technique for generalizing features extracted by convolutional filters and helping the network recognize features independent of their location in the image. **

Convolutional layers are the basic building blocks of a convolutional neural network used for computer vision applications such as image recognition. A convolutional layer slides a filter over the image and extracts features resulting in a feature map that can be fed to the next convolutional layer to extract higher-level features. Thus, stacking multiple convolutional layers allows CNNs to recognize increasingly complex structures and objects in an image.

A major problem with convolutional layers is that the feature map produced by the filter is location-dependent. This means that during training, convolutional neural networks learn to associate the presence of a certain feature with a specific location in the input image. This can severely depress performance. Instead, we want the feature map and the network to be translation invariant (a fancy expression that means that the location of the feature should not matter).

In the post on padding and stride, we discussed how a larger stride in convolution operations could help focus the image on higher-level features. Focusing on the higher-level structures makes the network less dependent on granular details that are tied to the location of the feature. Pooling is another approach for getting the network to focus on higher-level features.

In a convolutional neural network, pooling is usually applied on the feature map produced by a preceding convolutional layer and a non-linear activation function.

The basic procedure of pooling is very similar to the convolution operation. You select a filter and slide it over the output feature map of the preceding convolutional layer. The most commonly used filter size is 2×2 and it is slid over the input using a stride of 2. Based on the type of pooling operation you’ve selected, the pooling filter calculates an output on the receptive field (the part of the feature map under the filter).

There are several approaches to pooling. The most commonly used approaches are max-pooling and average pooling.

In max pooling, the filter simply selects the maximum pixel value in the receptive field. For example, if you have 4 pixels in the field with values 3, 9, 0, and 6, you select 9.

Average pooling works by calculating the average value of the pixel values in the receptive field. Given 4 pixels with the values 3,9,0, and 6, the average pooling layer would produce an output of 4.5. Rounding to full numbers gives us 5.

You can think of the numbers that are calculated and preserved by the pooling layers as indicating the presence of a particular feature. If the neural network only relied on the original feature map, its ability to detect the feature would depend on the location in the map. For example, if the number 9 was found only in the upper left quadrant, the network would learn to associate the feature connected to the number 9 with the upper left quadrant.

By applying pooling, we pull this feature out into a smaller, more general map that only indicates whether a feature is present in that particular quadrant or not. With every additional layer, the map shrinks, preserving only the important information about the presence of the features of interest. As the map becomes small, it becomes increasingly independent of the location of the feature. As long as the feature has been detected in the approximate vicinity of the original location, it should be similarly reflected in the map produced by the pooling layers.

Due to its focus on extreme values, max pooling is attentive to the more prominent features and edges in the receptive field. Average pooling, on the other hand, creates a smoother feature map because it produces averages instead of selecting the extreme values. In practice, max pooling is applied more often because it is generally better at identifying prominent features. In practical applications, average pooling is only used to collapse feature maps to a particular size.

Due to its ability to collapse feature maps, pooling can also help classify images of varying sizes. The classification layer in a neural network expects to receive inputs in the same format. Accordingly, we normally feed images in the same standard size. By varying the offsets during the pooling operation, we can summarize differently sized images and still produce similarly sized feature maps.

In general, pooling is especially helpful when you have an image classification task where you just need to detect the presence of a certain object in an image, but you don’t care where exactly it is located.

The fact that pooling filters use a larger stride than convolutional filters and result in smaller outputs also supports the efficiency of the network and leads to faster training. In other words, location invariance can greatly improve the statistical efficiency of the network.

**Padding describes the addition of empty pixels around the edges of an image. The purpose of padding is to preserve the original size of an image when applying a convolutional filter and enable the filter to perform full convolutions on the edge pixels.**

**Stride in the context of convolutional neural networks describes the process of increasing the step size by which you slide a filter over an input image. With a stride of 2, you advance the filter by two pixels at each step.**

In this post we will learn how padding and stride work in practice and why we apply them in the first place.

When performing a standard convolution operation, the image shrinks by a factor equivalent to the filter size plus one. If we take an image of width and height 6, and a filter of width and height 3, the image shrinks by the following factor.

6 - 3 +1 =4

The reason for the shrinking image is that a 3×3 filter cannot slide all three of its columns over the first two horizontal pixels in the image. The same problem exists with regard to the rows and the vertical pixels.

There are only 4 steps left for the filter until it reaches the end of the image, both vertically and horizontally. As a consequence, the resulting image will only have 4×4 dimensions instead of 6×6. The general formula for calculating the shrinkage of the image dimensions m x m based on the kernel size f x f, can be calculated as follows:

(m\times m) * (f\times f) = (m-f+1)*(m-f+1)

This immediately entails two problems:

- If you perform multiple convolution operations consecutively, the final image might become vanishingly small because the image will shrink with every operation.
- Because you cannot slide the full filter over the edge pixels, you cannot perform full convolutions. As a result you will lose some information at the edges.

The problem becomes more pronounced as the size of the filter increases. If we use a 5 x 5 filter on the 6 x 6 image, we only have space for 2 convolutions.

To address these problems, we can apply padding.

To mitigate the problems mentioned above, we can **pad** our images with additional empty pixels around the edges.

If we apply a 3×3 filter, we can slide it by 6 steps in every direction. The resulting feature map of the convolutional operation preserves the 6×6 dimensions of the original image.

**Same padding is the procedure of adding enough pixels at the edges so that the resulting feature map has the same dimensions as the input image to the convolution operation.**

In the case of a 3×3 filter, we pad each edge with one string of pixels. If we had a 5×5 filter, we would have to pad each edge with two rows/columns of pixels.

In summary, how many pixels you use for same padding depends entirely on the size of the filter. The most commonly used filter sizes are 3×3, 5×5, and 7×7. If your filter size is odd, you can calculate the pixels you need on each side by subtracting 1 from the filter size and dividing the result by 2. The division by 2 is necessary because you want to distribute the pixels evenly on both sides of the image.

padding = \frac{f-1}{2}

**Valid padding means that we only apply a convolutional filter to valid pixels of the input. Since only the pixels of the original image are valid, valid padding is equivalent to no padding at all.**

The stride simply describes the step size when sliding the convolutional filter over the input image. In the previous examples, we’ve always slid the filter by one pixel rightwards or downwards. We’ve used a stride of 1.

With a stride of 2, we would slide the window by two pixels on each step.

Since we are taking larger steps, we will reach the end of the image in fewer steps. As a consequence, the resulting feature map will be smaller since the feature map directly depends on the number of steps we take.

If we slide a few 3×3 filters over a 7×7 image, we can only take two steps until we reach the end of the image. Counting the initial position of the filter as another step, we can only take 3 steps resulting in a 3×3 output map. As demonstrated in the post on convolutional filters, we multiply each pixel value with its corresponding filter value and sum up the products. In the following image, we have a sharp transition from white to black pixels running vertically through the image, indicating that there must be an edge.

We can calculate the length of the output feature map o depending on the filter length size f, the length of the original image m and the stride s as follows.

o = \frac{n-f}{s} + 1

Generally speaking, the smaller the steps you take when sliding the filter over an image, the more details will be reflected in the resulting feature map. It also means that more features will be shared between the outputs since large portions of the filters will overlap.

For example, when applying a 3×3 filter that is always moved by 1 pixel, a filter will share 2/3 of the input pixels with each adjacent filter.

If we increase the step size, fewer parameters are shared between filters, and the feature map is smaller. Applying a larger stride basically has the effect of downsampling the image so that lower level details are obscured.

Furthermore, the more filter operations we want to calculate, the more computational power we need. If our neural network only requires an understanding of higher level features, we can make the learning process computationally more efficient by choosing a larger stride.

These considerations stem from the early days of deep learning when computational power was a major obstacle to efficient neural network training. With the ability to train neural networks on large and extremely potent GPU clusters in the cloud, increasing the stride to improve computational efficiency has become largely unnecessary. In practice, many modern deep learning practitioners use a stride of 1.

Padding and stride are two techniques used to improve convolutions operations and make the more efficient. Same padding is especially important in the training of very deep neural network. If you have a lot of layers, it becomes increasingly difficult to keep track of the dimensionality of the outputs if the dimensions change in every layer. Furthermore, the size of the feature maps will be reduced at every layer resulting in information loss at the borders. This is likely to depress the performance of your neural network.

Stride, on the other hand, has lost its importance in practical applications due to the increase in computational power available to deep learning practitioners.

This post will introduce convolutional kernels and discuss how they are used to perform 2D and 3D convolution operations. We also look at the most common kernel operations, including edge detection, blurring, and sharpening.

**A convolutional filter is a filter that is applied to manipulate images or extract structures and features from an image. Convolutional filters are typically used to blur or sharpen sections of an image or to detect edges in them.**

In the post on the convolution operation, I introduced the convolutional kernel as a grid containing numbers that we slide over another number grid to generate an output.

The convolutional filter is a multidimensional version of the convolutional kernel, although the two terms are often used interchangeably in the computer vision community.

2D convolutions are essential for the processing of 2D data such as images. An image is basically a 2-dimensional grid of pixel values. Standard RGB images have pixel values ranging from 0 to 255 and three channels (red, green, and blue) which adds a third dimension. But to simplify things a bit, we only look at one channel, which leaves us with a 2D grid of pixels which is enough to represent grayscale images.

To manipulate images, we can convolve the image with a 2-dimensional kernel.

As you see in the image, the kernel, in this case, is a smaller 2D grid. To compute the convolution, we slide the kernel over the image and calculate the convolution across two dimensions.

Starting in the upper-left corner, we slide the kernel over the image and perform an element-wise multiplication with the image followed by a summation.

1\times255 + 0\times255 +(-1)\times255 \\ + 1\times255 + 0\times255 +(-1)\times255 \\ +1\times255 + 0\times255 +(-1)\times255 \\ = 0

Next, we slide the kernel to the right and repeat the convolution operation.

1\times255 + 0\times255 +(-1)\times0 \\ + 1\times255 + 0\times255 +(-1)\times0 \\ +1\times255 + 0\times255 +(-1)\times0 \\ = 765

You continue this process, sliding the kernel to the right and downwards until you reach the lower-right corner. In each step, you convolve the kernel with the part of the image.

The results of the convolution operations can be neatly represented in a 4×4 matrix.

As you can see, the two columns in the center contain very high numbers, whereas the pixel values on the margins contain zeros. This indicates that there is a bright vertical edge running through the center. This particular kernel that we have used performs vertical edge detection. What type of operation the kernel performs depends on the numbers used in the kernel and their ordering. We will discuss different types of convolutional kernels later in this article.

Mathematically, we can represent the 2D convolution as follows:

(I * K) (i, j)= \sum_m \sum_n I(m,n)K(i-m,j-n)

This operation is commutative. As a consequence, we can flip the kernel and write it like this.

(K * I) (i, j)= \sum_m \sum_n I(i-m,j-n)K(m,n)

In many practical applications, cross-correlation is used instead of the convolution operation.

(K * I) (i, j)= \sum_m \sum_n I(i-m,j-n)K(m,n)

The cross-correlation is not commutative. In purely mathematical terms, this is an important distinction. But in practice, the distinction doesn’t really matter, which is why the term convolution is often used when referring to cross-correlation.

When performing 3D convolution, you are sliding a 3-dimensional kernel over a 3-dimensional input. The kernel needs to have the same depth as the input. You calculate the convolution of each channel in the kernel with each corresponding channel in the image.

Essentially, you need to perform the 2D convolution operation three times over, and then you sum up the results to get the final kernel output.

In the previous examples, we’ve used 3×3 kernels. While differently sized kernels are used, the size is almost always odd. The reason for using odd kernels is symmetry around the origin. If you are using an evenly sized kernel, there is no clear center point.

Sliding convolutional filters over an image allows you to manipulate an image in various ways. In the remainder of this post, we will go through some of the more commonly used convolutional filters and their effects.

The kernel we’ve used above is a simple vertical edge detector known as the Prewitt Operator.

The Prewitt operator for vertical edge detection appears in the form of the following matrix.

\begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \\ \end{bmatrix}

If we apply the vertical Prewitt operator to a real image, the result looks like this.

There is a strong vertical color contrast between the river and the cliffs, which is prominently visible.

To apply horizontal edge detection, we can rotate the kernel by 90 degrees.

\begin{bmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & -1 \\ \end{bmatrix}

Now, the horizontal edges are more visible.

The Sobel operator emphasizes the edges more than the Hewitt operator by replacing 1’s in the central column with 2’s.

Here is the Sobel operator for vertical edge detection.

\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \\ \end{bmatrix}

For horizontal edge detection, you can use the following kernel.

\begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \\ \end{bmatrix}

The Laplacian filter is an approximation to the 2nd spatial derivative of the image. If that sounds confusing, don’t worry. In practice, it basically means that the Laplacian filter highlights areas where the intensity of the pixel values is changing drastically. Consequently, it is a very popular filter for detecting both horizontal and vertical edges at once.

The LaPlacian is most commonly approximated with the following filter.

\begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0\\ \end{bmatrix}

The filter is frequently combined with Gaussian blurring or smoothing as it amplifies the edges even more.

Blurring is an important technique in image processing that makes the transition between different pixel values smooth rather than sharp. Therefore, the technique is also called smoothing. It is especially useful when you want to shrink the size of an image. Some sharp details will inevitably be lost. With smoothing, you can distribute the color transition over more pixels which preserves the edges even if the image is smaller overall.

The Gaussian filter weighs intensities according to a normal or Gaussian distribution. A Gaussian distribution has the characteristic form of a bell curve. The curve peaks at the center and flattens out the further you get away from the center. Thus, the center of the filter contains the highest value while the values further away are smaller.

The following kernel is a discrete approximation to the Gaussian distribution.

\frac{1}{16} \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1\\ \end{bmatrix}

The box kernel is a simple filter that calculates the mean of the pixels in the area covered by the filter. This also has a smoothing effect.

Contrary to the Gaussian filter, which weighs pixels according to a normal distribution, the box filter weighs all pixels equally. The box filter is faster and easier to calculate than the Gaussian filter.

\frac{1}{9} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1\\ \end{bmatrix}

Filters to sharpen images accentuate edges. They essentially do the opposite of blurring. A common kernel for sharpening images is the following one.

\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0\\ \end{bmatrix}

Convolutional Kernels and Filters are the building blocks of many computer vision applications. More advanced algorithms such as Canny edge detection build on combining several convolutional kernel types such as those used for smoothing and edge detection. Kernels are also at the heart of the most advanced computer vision technologies, such as convolutional neural networks used in deep learning.

]]>A convolution describes a mathematical operation that blends one function with another function known as a kernel to produce an output that is often more interpretable. For example, the convolution operation in a neural network blends an image with a kernel to extract features from an image.

As the name implies, you are sort of wrapping one function, the kernel, around another function. Let’s gain an intuitive understanding of how this works with a simple example.

Suppose you are a tour operator and you are offering a tour that takes 3 days. Guests can start the tour on any day. This means on any given day; you need to take care of the guests that are starting on that day, as well as those who are on their second and third days. On day 1 of their tour, people take two meals at their hotel, and you need to provide them with one meal for the trip. On day 2, it is 2 meals, and on day 3 they take a full day trip, so you need to get them 3 meals.

Let’s say you have 10 people start on day one, 8 people on day 2, and 5 people on day 3, 4 people on day 4. How do you keep track of the number of meals you need to prepare each day?

The first option is to create a separate calculation for every day. On the first day, you have 10 people requiring 1 meal each. On the second day, the 10 people are on their second trip, which requires 2 lunch packages for each, and you have 8 people who start on their first day, etc.

Another more convenient and intuitive way is to represent the number of people arriving per day in a grid and the number of meals per day in another grid.

As you may have recognized, we reversed the grid representing the people arriving per day. So the last day, which has 4 people arriving is listed first, while the first day with 10 people is listed last. You’ll see why we do this in a minute. Now, we can slide the meals per day as a window over the grid that contains the people arriving each day in reverse and calculate the total.

On the first day, the first 10 people go on the first trip requiring 1 meal. We move the first position in the green grid over the last position in the blue grid. Then we multiply the blue entry by the green entry right below it.

On subsequent days we shift the green grid to the left by one position, multiply each blue entry by the green entry below it, and add up the results.

Voilà, we have just convolved the blue grid with the green grid. The green grid that contains the meal plan per day is our **kernel** which we convolve with the blue grid containing the number of people. The result is the number of meals we need to prepare for each day to accommodate all of our clients.

To translate the calculations, we have done into mathematical notation, we need to translate the grids into mathematical functions.

Our tour plan has three days which we denote as d. On each day, we need to provide a different number of meals. Let’s call the function that calculates the number of meals for a given day f. That is, the number of meals is a function f of d. This function is equivalent to the green grid or the kernel in the previous example.

\color{green} f(d)

Then we define another function g that tracks the number of people participating in the tour that day. This function is equivalent to the blue grid.

\color{blue} g(d)

You might remember to calculate the convolution; we reversed the blue grid. This means we need to define g as a function of negative d.

\color{blue} g(-d)

Lastly, we also need to track what day it is in absolute terms to know how many people arrive that day. We introduce the variable t, which is a running count of the days that helps us track how many people arrive each day. **The variable t is not the same as d. Variable d tells us on what day of the tour plan we are, whereas t tells us what day it is in terms of the arriving visitors.**

To find the correct number of people arriving that day as defined in the blue grid, we need to input the current day t into g of negative d.

\color{blue} g(-d + t)

If this is a bit confusing, here is what this function tells us in plain English:

**On day t, we have g(-d + t) people who are on day d of the tour. For example, on day 3 we have 8 people who are on day 2 of their tour.**

**The function f(d) tells us that on day d of the tour we need to prepare f(d) meals per person. For example, on day 2 of the tour, we need to prepare 2 meals per person.**

Multiplying the two functions gives us the total number of meals for one day d of the tour. For day 1 of the tour, we perform the following calculation. But since we want the total number of meals to prepare for that day, we have to sum over all days d of the tour.

This gives us the convolution operation for discrete variables.

\sum_{d}^{D} \color{green} f(d ) \color{blue} g(-d + t)

In the previous example, we used the summation sign because days are discrete, countable instances.

In most textbooks, the convolution operation is defined for continuous functions. Therefore, we need to integrate over the two functions instead of calculating the sum.

\int \color{green} f(d ) \color{blue} g(-d + t)

In most textbooks, the convolution operation the order of d and t is changed in the function g.

\int \color{green} f(d ) \color{blue} g(t-d)

In mathematical notation, we write the convolution of two functions or signals like this.

f * g

The convolution operation is commutative, which means that we can change the order of the functions.

f * g = g*f

**Batch normalization is a technique for standardizing the inputs to layers in a neural network. Batch normalization was designed to address the problem of internal covariate shift, which arises as a consequence of updating multiple-layer inputs simultaneously in deep neural networks.**

When training a neural network, it will speed up learning if you standardize your data to a similar scale and normalize them to have a similar mean and variance. Otherwise, gradient descent might take much more time to move along one dimension where data is on a larger scale as compared to another dimension with a smaller scale. For more information, check out my post on feature scaling and data normalization.

But even with standardized/normalized datasets, your distribution will shift as you propagate the data through the layers in a neural network. The reason is that in each iteration, we update the weights for multiple connected functions computed by the layers simultaneously. But updates to the weights for a particular function happen under the assumption that all other functions are held constant. Over many layers and iterations, this gradually leads to an accumulation of unexpected changes resulting in shifts in the data.

For example, if you have normalized your data to mean 0 and variance 1, the shape of the distribution may change as you propagate through the layers of your neural network.

Mathematically, Goodfellow et al. illustrate this phenomenon with a simple example in their book “Deep Learning”. Ignoring the non-linear activation functions and the bias term, every layer multiplies the input by a weight w. Moving through 3 layers, you get the following output.

\hat y = xw_1w_2w_3

During backpropagation, we subtract the gradient times a learning rate from every weight.

\hat y = x(w_1-\alpha g_1)(w_2 - \alpha g_2)(w_3 - \alpha g_3)

If the weights are larger than 1, the computed term will grow exponentially. Even the subtraction of a small term in earlier layers has the potential to significantly affect updates to the later layers. This makes it uniquely hard to choose an appropriate learning rate.

Batch norm addresses the problem of internal covariate shift by correcting the shift in parameters through data normalization. The procedure works as follows.

You take the output a^[i-1] from the preceding layer, and multiply by the weights W and add the bias b of the current layer. The variable I denotes the current layer.

z^{[i]} = W^{[i]} a^{[i-1]} + b^{[i]}

Next, you would usually apply the non-linear activation function that results in the output a^[i] of the current layer. When applying batch norm, you correct your data before feeding it to the activation function.

*Note that some researchers apply batch normalization after the non-linear activation function, but the convention is to do it before. We stick with the conventional use.*

To apply batch norm, you calculate the mean as well as the variance of your current z.

\mu = \frac{1}{m} \sum_{j=1}^m z_j

When calculating the variance, we add a small constant to the variance to prevent potential divisions by zero.

\sigma^2 = \frac{1}{m} \sum_{j=1}^m (z_j-\mu)^2 + \epsilon

To normalize the data, we subtract the mean and divide the expression by the standard deviation (the square root of the variance).

z^{[i]} = \frac{z^{[i]}-\mu}{\sqrt{\sigma^2}}

This operation scales the inputs to have a mean of 0 and a standard deviation of 1.

An important consequence of the batch normalization operation is that it neutralizes the bias term b. Since you are setting the mean equal to 0, the effect of any constant that has been added to the input prior to batch normalization will essentially be eliminated.

If we want to change the mean of the input, we can add a constant term β to all observations after batch normalization.

z^{[i]} = z^{[i]} + \beta

To change the standard deviation, we similarly multiply each observation with another constant γ.

z^{[i]} = \gamma z^{[i]}

In programming frameworks like Tensorflow, γ and β are tunable hyperparameters that you can set on the BatchNormalization layer.

In practice, we commonly use mini-batches for training neural networks. This implies that we are calculating the mean and variance for each mini-batch when applying batch normalization. Depending on the size of your mini-batch, your mean and variance for single mini-batches may differ significantly from the global mean and variance. For example, if you are using a mini-batch size of 8 observations, it is possible that you randomly pick 8 observations that are far apart and thus give you a higher variance.

This doesn’t matter so much at training time since you are using many mini-batches whose statistical deviations from the global mean and variance average each other out.

At test and inference time, you are typically feeding single observations to the model to make predictions. It doesn’t make sense to calculate the mean and variance for single observations. Due to the problems described in the previous paragraph, you also cannot simply take a mini-batch from the test set and calculate it’s mean and variance. Furthermore, you have to assume that you only get single examples at inference time. So calculating the mean and variance of the entire test dataset is also not an option.

Instead, people commonly either calculate the mean and variance on the entire training set or a weighted average across the mini-batches and use that at test time. If the training set is large enough, the statistics should be representative of the data the model will encounter at test and inference time (otherwise, the whole training process wouldn’t make much sense).

As stated previously, deep neural networks suffer from internal covariate shifts where the distribution of data changes. By normalizing the data after each layer, we effectively rescale the data back to a standard normal distribution (or a distribution with the mean and variance set in the hyperparameters).

In practice, adding batch normalization has been demonstrated to speed up learning by requiring fewer training steps and a larger learning rate. It also has a regularizing effect that, in some cases, makes dropout redundant.

Bringing the shift in the data distribution under control is believed to be the main factor behind batch normalization’s success. However, some researchers argue that this is a misunderstanding, as the authors of the following paper.

]]>In this post, we will look at momentum and Nesterov momentum, two techniques that help gradient descent converge faster.

**Momentum is an optimization technique that helps the gradient descent algorithm converge. Momentum works by incorporating exponentially moving averages of recent observations into the gradient descent update, which helps overcome saddle points in the function space and reduce oscillations due to individual noisy gradients.**

**Nesterov** **momentum is an extension to the momentum algorithm for gradient descent that corrects the trajectory of momentum by incorporating the direction from previously accumulated gradients into the calculation of the current gradient. **

The mechanism behind the momentum algorithm is called an exponentially weighted moving average. As the name implies, you calculate an average of several observations rather than focusing solely on the most recent observation. For example, suppose you wanted to calculate the moving average of the price p of a certain stock over the past five days.

p = [12.30, 11.00, 10.20, 10.50, 11.10 ]

You take the sum of the last five stock prices per day and divide them by 5.

p_{average} = \frac{12.30+ 11.00+ 10.20+ 10.50+ 11.10}{5} = 11.02

**Exponentially weighted averages incorporate previous observations in exponentially declining importance in addition to the current one into our estimates.**

You need to choose a smoothing parameter β that determines how much weight you give to the current observations as compared to the previous ones. Then you can calculate the exponentially weighted average according to the following formula.

v_{current} = \beta v_{current-1} + (1-\beta) p_{current}

The parameter v is the weighted average from the previous step. In our stock price example, we can calculate the moving average from day one and day two. Let’s set β equal to 0.2. Since we are only on day two, the moving average of the first day is equivalent to the price of the first day.

v_{current} = 0.2 \times 12.30 + (1-0.2) \times 11 = 11.26

V current is now the moving average of the first two observations. In the next step, we add the third observation and use v current from the previous step.

v_{current} = 0.2 \times 11.26 + (1-0.2) \times 10.20 = 10.41

In summary, this process of recursively updating the current v is equivalent to a series in which the contribution of each point p declines exponentially with every timestep.

- On the first day the contribution of p1 to v1 is equivalent to 1.
- On the second day, the contribution of p1 to v2 is equivalent to β
- On the third day, the contribution of p1 to v3 is equivalent to β^2
- On the fourth day, the contribution of p1 to v4 is equivalent to β^3

Since β < 1, we have exponential decline rather than exponential growth.

If we plot the moving average vs. the stock prices over the five days, we see that the average smoothes out the fluctuation in the price.

Here we are using an β value of 0.2, which means we give a weight of 80% to the current value while the previous values cumulatively only have a weight of 20%. The smoothing is even more pronounced if we increase the alpha value, thus giving more weight to the previous values. Here is the plot if we give the last values a weight of 50%.

*Note that other explanations might multiply the weighted average v with (1-β) and the current value p with β. In this case, β denotes the weight you give to the current value, while in our version, it denotes the weight you give to the weighted average of the previous value. But the principle is the same.*

Stochastic gradient descent and mini-batch gradient descent tend to oscillate due to the stochasticity that comes from selecting only a small subset of the whole training dataset. The oscillations become especially problematic around local optima, where the cost function tends to have an elongated form.

Gradient descent takes much longer to converge and may end up jumping around the local optimum without fully converging. We need a mechanism that makes gradient descent move faster along the elongated side of the cost function and prevents it from oscillating too much. An approach to dampen out the oscillations and make gradient descent move faster along the longer x-axis is to add momentum to the gradient.

It is a bit like pushing a ball down an inclined and slightly bumpy surface. If you let the ball roll down and accelerate on its own, it might jump around a lot more due to all the bumps than if it arrived with some initial velocity. In the latter case, it would be less likely to be thrown off course due to more momentum based on the initial velocity.

We add momentum by adding an exponentially weighted average to the gradient based on the previous gradient values.

Recall the original gradient update rule. You take the derivative of the cost function with respect to the model parameters θ and subtract it from θ multiplied by a learning rate.

*In the context of a neural network, theta usually represents the weight and the bias.*

w_{new} = \theta - \alpha\frac{dJ}{d\theta}

We replace the current gradient with a velocity term called v, which represents a sum of the current gradient multiplied by its learning rate and the moving average over the previous gradients multiplied by the parameter *β* that determines how much weight you assign to the weighted average.

v_{current} = \beta v_{current-1} + \alpha\frac{dJ}{d\theta}

Then, we update the parameters θ with the current velocity term instead of the current gradient.

\theta_{current} = \theta _{current-1}- v_{current}

Now stochastic gradient descent is more likely to look like in the following illustration with fewer oscillations and a smoother path towards the minimum.

The more successive gradients point in the same direction, gradient descent will develop more momentum in that direction. If the gradients point in different directions, then you will run into more oscillations.

Nesterov momentum is an enhancement to the original momentum algorithm that allows gradient descent to look at the next gradient and apply a course correction.

How does this work?

In the original momentum algorithm, we first calculate the gradient with respect to the current θ, and then we add the velocity accumulated from the previous steps.

v_{current} = \beta v_{current-1} + \alpha\frac{dJ}{dθ}

But we know the velocity from the previous time steps **before** we calculate the gradient of the current time step. Since the next step is a combination of the previous velocity and the current gradient, we already know a large part of what the next step is going to look like. The basic idea of Nesterov momentum is to incorporate this knowledge into the calculation of the next gradient.

Recall that the cost J is a function of the model parameter θ, which is why we can calculate the gradient with respect to θ. In the Nesterov momentum update, we subtract the accumulated velocity from the parameter θ before calculating the gradient.

v_{current} = \beta v_{current-1} + \alpha\frac{dJ(\theta - \beta v_{current-1})}{dθ}

Then we apply the standard update.

\theta_{current} = \theta _{current-1}- v_{current}

This has the effect of reigning in the gradient if it veers too much off its trajectory.

]]>In this post, we will discuss the three main variants of gradient descent and their differences. We look at the advantages and disadvantages of each variant and how they are used in practice.

**Batch gradient descent uses the whole dataset, known as the batch, to compute the gradient. Utilizing the whole dataset returns a gradient that points directly towards the local or global optimum, resulting in a smooth descent. Due to its computational inefficiency, batch gradient descent is rarely used in practice.**

**Stochastic gradient descent refers to the practice of performing gradient descent using single samples. Since the samples are randomly selected, the gradients will oscillate, making stochastic gradient less likely to get stuck in local minima than batch gradient descent. Furthermore, we don’t have to wait until the gradient on the entire dataset has been calculated, which makes it much more suitable for practical applications with big datasets.**

**Mini batch gradient descent is the practice of performing gradient descent on small subsets of the training data. Using several samples will reduce the oscillations inherent in stochastic gradient descent and leader to a smoother learning curve. At the same time, it is more efficient than batch gradient descent in practice since we don’t have to calculate the gradient of the entire dataset at once.**

The aim of all variants of gradient descent is to optimize the parameters of a function with the goal of reducing the cost. The algorithm attempts to achieve this by calculating the gradient of the cost function with respect to the parameters. The gradient will point in the direction of the steepest ascent.

By going in the opposite direction and subtracting the gradient from the current parameter values, we approach the optimal parameter values that minimize the cost function.

As soon as we move one step in the direction of the steepest descent, the gradient will change. Therefore, we need to iteratively recalculate the gradient after every couple of steps. If we take too many steps in one direction or our step size is too big, we may overshoot the minimum.

*If you want a more detailed introduction to gradient descent, check out my posts on the math behind gradient descent and a step-by-step explanation of backpropagation with gradient descent.*

Assume we want to learn a function f that takes our input data x and a parameter vector theta as inputs and aims to produce an output y.

f(x, \theta) \; \\y

The inputs x and expected outputs y represent our dataset, and each contains a number I samples.

x = [x_1, x_2, ...,x_I] \\ y = [y_1, y_2, ...,y_I]

We want to optimize the parameters theta using gradient descent. So we define a cost function J that quantifies the difference between the output produced by f and the actual value y.

J(f(x, \theta) ,\;y)

In batch gradient descent, we calculate the gradient of the cost function over the entire dataset.

\nabla J(f(x, \theta) ,\;y)

Then we subtract the gradient from our current value of theta multiplied by a step size alpha (we use alpha to scale down the gradient and prevent us from overshooting the minimum).

\theta = \theta - \alpha \nabla J(f(x, \theta) ,\;y)

In practice, batch gradient descent is not a viable option due to two reasons.

Since we are calculating the gradient of the entire training dataset, gradient descent leads us smoothly to the nearest minimum. If the training problem is convex, meaning that there is only one minimum, this isn’t a problem. But imagine we had a function with multiple minima like the following.

Batch gradient descent might get stuck in the nearest local minimum without finding the global minimum. This is equivalent to getting stuck between two hills in the dark on the way into the valley. To get to the valley, you first have to climb over another hill. The strategy of always going downhill until you reach the valley breaks down at this point.

As we will see, stochastic gradient descent and mini-batch gradient descent are better equipped to get gradient descent out of local minima.

Now imagine our dataset consisted of 100 000 training examples with 5 features. Calculating the gradient across all of these examples and features requires 500 000 calculations at training every iteration.

Remember that we have to recalculate the gradient after every step we take. When training neural networks, we commonly take several hundred or thousands of steps called training iterations.

Let’s say we use 1000 iterations. In total, we would have to perform 500 million calculations.

5 \times 100\;000 \times 1000 = 500 \;000\;000

This approach would take a very long time to converge. Furthermore, you are likely to run into memory limitations depending on the hardware you are using since the entire dataset has to be loaded into memory and used for computation.

Stochastic gradient descent calculates gradients on single examples rather than the whole batch. So instead of using the entire data vectors x and y, you pick a random sample consisting of x_i and y_i, calculate the gradient, and perform the update.

\theta = \theta - \alpha \nabla J(f(x_i, \theta) ,\;y_i)

Instead of 500 million calculations, we only have to perform 5000 calculations which is much faster.

Furthermore, the random choice of a training example introduces randomness into the training process. Since you are not calculating the gradient over the entire dataset, you won’t get a gradient that smoothly leads you straight to the next minimum. Single examples may have different gradients, so the loss might even increase temporarily, and the learning graph showing the loss over several iterations will oscillate much more.

In aggregate, you will still move towards the minimum because the average of the steps you take will point you towards the minimum. Furthermore, the oscillations in the gradient values will help you get out of local minima.

If you were walking down a mountain at night and you had a device that calculated the slope across the entire trail from the peak to the valley, it would always point you straight downwards. If you followed the guidance of the device and you got stuck between two hills, you would be stuck there forever. According to your device, you would have to go straight down rather than first climbing over the next hill. That is the problem with batch gradient descent.

Now imagine your device always calculated the slope based on some random section of the trail. Since some sections of the trail will temporarily lead uphill, the device will sometimes tell you to go uphill. Thus, it will help you get out of the situation where you are stuck between two hills and eventually lead you to the valley.

Thus, stochastic gradient descent effectively addresses the computational inefficiency and tendency to get stuck in local minima that we encountered with batch gradient descent. Because it calculates gradients on single samples, it is also appropriate for online learning in a production system, where you continuously have new data flowing in.

But due to the stochasticity and the resulting fluctuations in the cost, stochastic gradient descent has a harder time converging to the exact minimum. Instead, it is likely to overshoot the minimum repeatedly without properly settling down.

An approach to address this problem is to gradually reduce the learning rate or step size as gradient descent converges on the minimum. That way, the steps become smaller the closer the algorithm gets to the minimum, making it less likely to overshoot.

Mini batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent that avoids the computational inefficiency and tendency to get stuck in the local minima of the former while reducing the stochasticity inherent in the latter. Accordingly, it is most commonly used in practical applications.

Rather than performing gradient descent on every single example, we randomly pick a subset of size n of the batch (without replacement) and perform gradient descent.

\theta = \theta - \alpha \nabla J(f(x_{i:i+n}, \theta) ,\;y_{i:i+n})

The size of the minibatch is usually chosen to the power of two, such as 32, 64, 128, 256, 512, etc. The more memory you have available on your hardware, the larger mini-batches you can pick although the ability of the model to generalize might suffer if you pick larger mini-batches.

Since we are using several training examples to calculate the gradient, the stochasticity is reduced, and the algorithm is less likely to overshoot. Still, overshooting can occur if the learning rate is too big, while learning will become slow if it is too small. Choosing an appropriate learning rate remains a challenge with mini-batch gradient descent, which is more art than science and mainly comes down to experience.

]]>In this post, we develop an understanding of why gradients can vanish or explode when training deep neural networks. Furthermore, we look at some strategies for avoiding exploding and vanishing gradients.

**The vanishing gradient problem describes a situation encountered in the training of neural networks where the gradients used to update the weights shrink exponentially. As a consequence, the weights are not updated anymore, and learning stalls.**

** The exploding gradient problem describes a situation in the training of neural networks where the gradients used to update the weights grow exponentially. This prevents the backpropagation algorithm from making reasonable updates to the weights, and learning becomes unstable.**

Suppose you have a very deep network with several dozens of layers. Each layer multiplies the input x with a weight w and sends the result through an activation function. For the sake of simplicity, we will ignore the bias term b.

a_1 = \sigma(w_1x) \rightarrow a_2 = \sigma(w_2a_1) \rightarrow a_3 = \sigma(w_3a_2)

Further, suppose the activation function just passes through the term without performing any non-linear transformations. Essentially, every layer in the neural network just multiplies another weight to the current term.

Assume the weights are all equal to 0.6. This means with every additional layer; we multiply the weight by itself.

By the time you get to the third layer, you need to take the weight to the power of 3.

w^3 = 0.6^3 = 0.21

Now assume you have a network with 15 layers. Now your third weight equals w to the power of 15.

w^{15} = 0.6^{15} = 0.00047

As you can see, the weight is now vanishingly small, and it further shrinks exponentially with every additional layer.

The reverse is true if your initial weight is larger than 1, let’s say 1.6.

w^{15} = 1.6^{15} = 281.47

Now, you have a problem with exploding gradients.

Every weight is actually a matrix of weights that is randomly initialized. A common procedure for weight initialization is to draw the weights randomly from a Gaussian distribution with mean 0 and variance 1. This means roughly 2/3 of the weights will have absolute values smaller than 1 while 1/3 will be larger than 1.

In summary, the further your values are from 1 or 0, the more quickly you run into either vanishing or exploding gradients when you have long computational graphs, such as in deep neural networks.

The previous example was mainly intended to illustrate the principle behind vanishing and exploding gradients and how it depends on the number of layers using forward propagation. In practice, it affects the gradients of the non-linear activation functions that are calculated during backpropagation.

Let’s have a look at an example using a highly simplified neural network with two layers.

During backpropagation, we move backward through the network, calculating the derivative of the cost function J with respect to the weights in every layer. We use these derivatives to update the weights at every step. For more detailed coverage of the backpropagation algorithm, check out my post on it.

The derivatives are calculated and backpropagated using the chain rule of calculus. To obtain the gradient of the weight w2 in the illustration above, we first need to calculate the derivative of the cost function J with respect to the predicted value y hat.

\frac{dJ}{d\hat y}

Y hat results from the activation function in the last layer, which takes z2 as an input. So we need to calculate the derivative of Y hat with respect to z2.

\frac{d\hat y}{ dz_2}

Z2 is a product of the weight w2 and the output from the previous layer a2. Accordingly, we calculate the derivative of u2 with respect to w2.

\frac{ dz_2} {dw_2}

Now we can calculate the gradient needed for adjusting w2, the derivative of the cost with respect to the w2, by chaining the previous three derivatives together using the chain rule.

\frac{dJ}{dw_2} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz_2} \frac{ dz_2} {dw_2}

Remember that the second gradient in the chain is the derivative of the activation function. Let’s assume we use the logistic sigmoid as an activation function. If we evaluate the derivative, we get the following result:

\frac{d\hat y}{ dz_2} = \sigma(z_2)(1-\sigma(z_2)) = \frac{1}{1 + e^{z_2}}( 1 - \frac{1}{1 + e^{z_2}})

Let’s assume the value z2 equals 5. Evaluating the expression, we get the following:

\frac{d\hat y}{ dz_2} = \sigma(5)(1-\sigma(5)) \approx 0.0066

As you see, the gradient is pretty small. If we have several of these expressions in our chain of derivatives, the gradient quickly shrinks to a number so small that it isn’t very useful for learning anymore.

To obtain the gradient of the first weight, we need to go even further back with the chain rule.

\frac{dJ}{dw_1} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz_2} \frac{ dz_2} {da_1}\frac {da_1}{ dz_1} \frac {dz_1}{ dw_1}

Now we need to evaluate the derivatives of two activation functions. You can easily imagine what our gradient would look like if we had 15, 20, or even more layers and had to differentiate through all their respective activation functions.

We’ve seen in the previous example that the sigmoid activation function is prone to creating vanishing gradients, especially when several of them are chained together. This is due to the fact that the sigmoid function saturates towards 0 for large negative or towards 1 for large positive values.

As you may know, most neural network architectures nowadays use the rectified linear unit (ReLU) rather than the logistic sigmoid function as an activation function in the hidden layers.

The ReLU function returns simply the input value if the input is positive and 0 if the input is negative.

f(z) = \begin{dcases} 0 \;for \; z < 0\\ z \;for \; z \geq 0\\ \end{dcases}

The derivative of the ReLU is 1 for values larger than 0. This effectively addresses the vanishing gradient problem because multiplying 1 by itself many times over still gives you 1. You cannot differentiate the negative section of the ReLU function because it is 0. So as a convention, the derivative of negative values is just set to 0.

\frac{df(z)}{dz} = \begin{dcases} 0 \;for \; z < 0\\ 1 \;for \; z \geq 0\\ \end{dcases}

If you do need the negative section to be differentiable, you can use the leaky ReLU.

Another way to address vanishing and exploding gradients is through weight initialization techniques. In a neural network, we initialize the weights randomly. Certain techniques such as He initialization and Xavier initialization ensure that the weights are close to 1.

For every layer, weights are commonly sampled from a normal distribution with a mean of 0 and a standard deviation equal to 1.

\mu = 0 \;and\; \sigma = 1

This results in weights that are all over the place in a range from roughly -3 to 3.

He initialization samples weights from a normal distribution with the following parameters for mean and standard deviation, where n is the number of neurons in the preceding layer.

\mu = 0 \;and\; \sigma = \sqrt{\frac{1}{n_{l}}} \;or\; \sigma = \sqrt{\frac{2}{n_{l}}}

Xavier initialization uses the following parameter for weight initialization:

\mu = 0 \;and\; \sigma = \sqrt{\frac{2}{n_{l} + n_{l+1}}}

The adjusted standard deviation helps constrain the weights to a range that makes exploding and vanishing gradients less likely.

Another simple option is gradient clipping. This way, you just define an interval within which you expected the gradients to fall. If the gradients exceed the permissible maximum, you automatically set them to the maximum upper bound of your interval. Similarly, if they fall below the permissible minimum, you automatically set them to the lower bound.

]]>In this post, we will introduce dropout regularization for neural networks. We first look at the background and motivation for introducing dropout, followed by an explanation of how dropout works conceptually and how to implement it in TensorFlow. Lastly, we briefly discuss when dropout is appropriate.

Dropout regularization is a technique to prevent neural networks from overfitting. Dropout works by randomly disabling neurons and their corresponding connections. This prevents the network from relying too much on single neurons and forces all neurons to learn to generalize better.

Deep neural networks are arguably the most powerful machine learning models available to us today. Due to a large number of parameters, they can learn extremely complex functions. But this also makes them very prone to overfitting the training data.

Compared to other regularization methods such as weight decay, or early stopping, dropout also makes the network more robust. This is because when applying dropout, you are removing different neurons on every pass through the network. Thus, you are actually training multiple networks with different compositions of neurons and averaging their results.

One common way of achieving model robustness in machine learning is to train a collection of models and average their results. This approach, known as ensemble learning, helps correct the mistakes produced by single models. Ensemble methods work best when the models differ in their architectures and are trained on different subsets of the training data.

In deep learning, this approach would become prohibitively expensive since training a single neural network already takes lots of time and computational power. This is especially true for applications in computer vision and natural language processing, where datasets commonly consist of many millions of training examples. Furthermore, there may not be enough labeled training data to train different models on different subsets.

Dropout mitigates these problems. Since the model drops random neurons with every pass through the network, it essentially creates a new network on every pass. But weights are still shared between these networks contrary to ensemble methods, where each model needs to be trained from scratch.

The authors who first proposed dropout (Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov) explain the main benefit of dropout as reducing the occurrence of coadaptations between neurons.

Coadaptions occur when neurons learn to fix the mistakes made by other neurons on the training data. The network thus becomes very good at fitting the training data. But it also becomes more volatile because the coadaptions are so attuned to the peculiarities of the training data that they won’t generalize to the test data.

Here is the original article on dropout regularization if you are interested in learning more details. It is definitely worth a read!

To apply dropout, you need to set a retention probability for each layer. The retention probability specifies the probability that a unit is not dropped. For example, if you set the retention probability to 0.8, the units in that layer have an 80% chance of remaining active and a 20% chance of being dropped.

Standard practice is to set the retention probability to 0.5 for hidden layers and to something close to 1, like 0.8 or 0.9 on the input layer. Output layers generally do not apply dropout.

In practice, dropout is applied by creating a mask for each layer and filling it with values between 0 and 1 generated by a random number generator according to the retention probability. Each neuron with a corresponding retention probability below the specified threshold is kept, while the other ones are removed. For example, for the first hidden layer in the network above, we would create a mask with four entries.

Alternatively, we could also fill the mask with random boolean values according to the retention probability. Neurons with a corresponding “True” entry are kept while those with a “False” value are discarded.

Dropout is only used during training to make the network more robust to fluctuations in the training data. At test time, however, you want to use the full network in all its glory. In other words, you do not apply dropout with the test data and during inference in production.

But that means your neurons will receive more connections and therefore more activations during inference than what they were used to during training. For example, if you use a dropout rate of 50% dropping two out of four neurons in a layer during training, the neurons in the next layer will receive twice the activations during inference and thus become overexcited. Accordingly, the values produced by these neurons will, on average, be too large by 50%. To correct this overactivation at test and inference time, you multiply the weights of the overexcited neurons by the retention probability (1 – dropout rate) and thus scale them down.

The following graphic by the user Dmytro Prylipko on Datascience Stackexchange nicely illustrates how this works in practice.

An alternative to scaling the activations at test and inference time by the retention probability is to scale them at training time.

You do this by dropping out the neurons and immediately afterward scaling them by the inverse retention probability.

activation \times \frac{1}{retention\, probability}

This operation scales the activations of the remaining neurons up to make up for the signal from the other neurons that were dropped.

This corrects the activations right at training time. Accordingly, it is often the preferred option.

When it comes to applying dropout in practice, you are most likely going to use it in the context of a deep learning framework. In deep learning frameworks, you usually add an explicit dropout layer after the hidden layer to which you want to apply dropout with the dropout rate (1 – retention probability) set as an argument on the layer. The framework will take care of the underlying details, such as creating the mask.

Using TensorFlow, we start by importing the dropout layer, along with the dense layer and the Sequential API from Tensorflow in Python. If you don’t have TensorFlow installed, head over to the TensorFlow documentation for instructions on how to install it.

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout

In a simple neural network that consists of a sequence of dense layers, you add dropout to a dense layer by adding an additional “Dropout” layer right after the dense layer. The following code creates a neural network of two dense layers. We add dropout with a rate of 0.2 to the first dense layer and dropout with a rate of 0.5 to the second dense layer. We assume that our dataset has six dimensions which is why we set the input shape parameter equal to 6.

model = Sequential([ Dense(64, activation='relu', input_shape=(6,)), Dropout(0.2), Dense(64, activation='relu'), Dropout(0.5), Dense(3, activation='softmax') ])

Dropout is an extremely versatile technique that can be applied to most neural network architectures. It shines especially when your network is very big or when you train for a very long time, both of which put a network at a higher risk of overfitting.

When you have very large training data sets, the utility of regularization techniques, including dropout, declines because the network has more data available to learn to generalize better. When the number of training examples is very limited (<5000 according to the original dropout article linked above), other techniques are more effective. Here, again, I suggest reading the original dropout article for more information.

Dropout is especially popular in computer vision applications because vision systems almost never have enough training data. The most commonly applied deep learning models in computer vision are convolutional neural networks. However, dropout is not particularly useful on convolutional layers. The reason for this is that dropout aims to build robustness by making neurons redundant. A model should learn parameters without relying on single neurons. This is especially useful when your layer has a large number of parameters. Convolutional layers have far fewer parameters and therefore generally need less regularization.

Accordingly, in convolutional neural networks, you will mostly find dropout layers after fully connected layers but not after convolutional layers. More recently, dropout has largely been replaced by other regularizing techniques such as batch normalization in convolutional architectures.

Weight decay is a regularization technique in deep learning. Weight decay works by adding a penalty term to the cost function of a neural network which has the effect of shrinking the weights during backpropagation. This helps prevent the network from overfitting the training data as well as the exploding gradient problem.

In neural networks, there are two parameters that can be regularized. Those are the weights and the biases. The weights directly influence the relationship between the inputs and the outputs learned by the neural network because they are multiplied by the inputs. Mathematically, the biases only offset the relationship from the intercept. Therefore we usually only regularize the weights.

The L2 penalty is the most commonly used regularization term for neural networks. You apply L2 regularization by adding the squared sum of the weights to the error term E multiplied by a hyperparameter lambda that you pick manually.

C = E + \frac{\lambda}{2n} \sum_{i=1}^{n}w_i^2

The full equation for a cost function would look like this, where the function L represents a loss function such as cross-entropy or mean squared error.

C(w, b) = \frac{1}{n} \sum_{i=1}^{n} L(\hat y_i, y_i) + \frac{\lambda}{2n} \sum_{i=1}^{n}w_i^2

But how does this term shrink the weights during backpropagation? To understand this, let’s have a look at the cost function for a single training example. This allows us to remove the summation and to average over n examples. For the sake of the demonstration, we also ignore b since it is not regularized.

During backpropagation, we calculate the gradient of the cost function with respect to w.

\frac{\partial C}{\partial w_i} = \frac{\partial E}{\partial w_i} + \frac{\partial}{\partial w_i} (\frac{\lambda}{2} w_i^2)

Differentiating the penalty term with respect to the weight gives us the following outcome (we originally divided λ by 2 so it would cancel out the 2 from the differentiation)

\frac{\partial C}{\partial w_i} = \frac{\partial E}{\partial w_i} + \lambda w_i

When performing updates to the weight during gradient descent, we subtract the gradient multiplied by the learning rate α from the weight.

w_{i_{updated}} = w_i - \alpha \frac{\partial C}{\partial w_i}

w_{i_{updated}} = w_i - \alpha (\frac{\partial E}{\partial w_i} + \lambda w_i)

So the regularized weight update shrinks the weights by the gradient plus a penalty term rather than by the gradient alone (in practice, you scale both by a learning rate α to get gradient descent to converge).

When applying weight decay with the L1 norm, you use the absolute value of the sum of the weights rather than their squared value.

C = E + \frac{\lambda}{2n} \sum_{i=1}^{n} ||w_i|]

The crucial difference between L1 and L2 regularization is that the L1 norm can completely eliminate parameters by setting the weights equal to zero.

For an intuitive understanding of why this is the case, consider that squaring terms that are smaller than zero, such as the weights, will lead to a faster shrinkage of the penalty term. The closer the weight gets to zero, the more the penalty term will shrink relative to the weight. The L1 penalty is not squared. Instead, it subtracts a bigger constant value from the weight in every iteration that can reach 0.

As a consequence, L1 regularized weight matrices end up becoming sparse, which means many of its entries equal zero.

In practice, the dataset you use to train a neural network will usually contain patterns that are reflective of the problem for which the network is being trained as well as random fluctuations that have no explanatory value.

Our goal is to learn as much about the patterns while getting the network to ignore the random fluctuations.

The more big weights we have, the more active our neurons will be. They will use that additional power to fit the training data as closely as possible. As a consequence, they are more likely to pick up more of the random noise.

Shrinking the weights has the practical effect of effectively deactivating some of the neurons by shrinking their weights close to zero. The larger you set the regularization parameter lambda, the smaller the weights will become. But note that contrary to L1 regularization, L2 regularization doesn’t completely set the weights equal to zero. So when using L2 regularization, the neurons technically are still active, but their impact on the overall learning process will be very small.

With its reduced power, the network needs to focus more on patterns that frequently occur throughout the dataset and are thus more likely to be a manifestation of the actual problem the network is trying to model. As a result, the model will also become smoother, meaning that the outputs change more slowly in response to changing inputs.

First, we have to import the layers and regularizers from TensorFlow.

import tensorflow as tf from tensorflow.keras.layers import Dense from tensorflow.keras.regularizers import l2

Regularization in Tensorflow is fairly straightforward. You only need to set the parameter “kernel_regularizer” on the layer that you want to regularize. to the selected loss along with the regularization parameter.

Here, we are creating a Dense layer with L2 regularization and a regularization parameter of 0.1:

Dense(128, kernel_regularizer=l2(0.1), activation="relu")

Similaryl, we can apply L1 regularization:

Dense(128, kernel_regularizer=l1(0.1), activation="relu")

We can string several regularized regularized layers together into a simple neural network using the sequential API. Here our input data has 3 dimensions, which is why we set the input_shape to 3 in the first dense layer. We call the regularization parameter “lambd” since lambda is a reserved keyword in Python.

lambd = 0.1 model = Sequential([ Dense(64, kernel_regularizer=l2(lambd), activation="relu", input_shape=(4,)), Dense(128, kernel_regularizer=l2(lambd), activation="relu"), Dense(128, kernel_regularizer=l2(lambd), activation="relu"), Dense(128, kernel_regularizer=l2(lambd), activation="relu"), Dense(64, kernel_regularizer=l2(lambd), activation="relu"), Dense(64, kernel_regularizer=l2(lambd), activation="relu"), Dense(64, kernel_regularizer=l2(lambd), activation="relu"), Dense(3, activation='softmax') ])]]>

Before training a neural network, there are several things we should do to prepare our data for learning. Normalizing the data by performing some kind of feature scaling is a step that can dramatically boost the performance of your neural network. In this post, we look at the most common methods for normalizing data and how to do it in Tensorflow. We also briefly discuss the most important steps to take in addition to normalization before you even think about training a neural network.

**Normalization in deep learning refers to the practice of transforming your data so that all features are on a similar scale, usually ranging from 0 to 1. This is especially useful when the features in a dataset are on very different scales.**

*Note that the term data normalization also refers to the restructuring of databases to bring tables into a normal form. Always consider the context!*

In a dataset, features usually have different units and, therefore, different scales. For example, a dataset used to predict housing prices may contain the size in sqft or square meters, the number of bedrooms and bathrooms on a simple numeric scale, the age of the property in years, etc.

The size of the properties available in a particular area might range from 500 to 10000 square foot while the number of bathrooms ranges from 1 to 5. This means you have not only very different scales but also different ranges for the variance.

Since the update to the weights depends on the input values, gradient descent will update some weights much faster than others. This makes it harder to converge on an optimal value.

As a general rule, I would always normalize the data. Doing so can dramatically improve the performance of your model, while not normalizing will almost never hurt your model. It is especially important when the algorithm you apply involves gradient descent.

The most commonly applied type of normalization transforms all features to have a mean 0 and standard deviation of 1.

*In machine learning, we usually operate under the assumption that features are distributed according to a Gaussian distribution. The standard Gaussian bell curve, also known as the standard normal distribution, has a mean of 0 and a standard deviation of 1.*

Setting the mean to 0 is achieved by calculating the current mean for each variable x (each of which has n entries).

\mu = \frac{1}{n} \sum ^n_{i=1} x_i

Then you subsequently subtract the mean from each variable x to obtain your rescaled x with a new mean of 0.

x = x - \mu

Next, you need to calculate the standard deviation.

\sigma = \sqrt{\frac{1}{n} \sum ^n_{i=1} (x_i- \mu)^2 }

*Note that since the mean is already zero, we do not need to subtract the standard deviation anymore.*

Lastly, you divide your variables (features) by the standard deviation.

x = \frac{x}{\sigma}

If you have a background in statistics, you might recognize that this process is equivalent to calculating z scores that are commonly used for constructing confidence intervals.

z = \frac{x-\mu}{\sigma}

It is, therefore, sometimes referred to as z-score normalization.

In Tensorflow, you can normalize your data by adding a normalization layer.

import tensorflow as tf import numpy as np norm = tf.keras.layers.experimental.preprocessing.Normalization()

When you pass your training data to the normalization layer, using the adapt method, the layer will calculate the mean and standard deviation of the training set and store them as weights. It will apply normalization to all subsequent inputs based on these weights. So if we use the same dataset, it will perform normalization as described above.

norm.adapt(data) #pass the same data used for adaption will normalize the data norm(data)

If we pass different data as input, it will apply normalization based on the mean and standard deviation of the data passed for adaption.

adapt_data = np.array([2., 3., 12.], dtype='float32') input_data = np.array([5., 6., 7.], dtype='float32') norm = tf.keras.layers.experimental.preprocessing.Normalization() norm.adapt(adapt_data) norm(input_data)

Tensorflow thus makes it easy to normalize your data as part of the model by simply passing in a normalization layer at the appropriate locations.

Sometimes, min-max scaling is applied as an alternative to normalization. Note that min-max scaling is also often referred to as normalization. You perform min-max scaling according to the following formula.

x = \frac{x-x_{min}}{x_{max}-x_{min}}

Since you are dividing by the maximum difference between values in your dataset, all of your values should fall between 0 and 1.

This is different from the standard normalization discussed above, where you merely ensure that your values have a standard deviation equal to one. This means with standard normalization; you will generally have larger standard deviations and values that are larger than 1 in your data.

In neural networks, you generally should use data where observations lie in a range between 0 and 1. In the context of deep learning, min-max normalization should therefore be your first choice.

For example, images usually have three color channels with pixel values ranging from 0 to 255. So if you are training a neural network on an image classification task, it is good practice to scale your pixels to a value between 0 and 1 by dividing by 255.

In Tensorflow, this can easily be done by including a rescaling layer as the first layer in your network.

import tensorflow as tf data = np.array([5., 6., 7.], dtype='float32') min_max = tf.keras.layers.experimental.preprocessing.Rescaling(1./255) min_max(data)

When dealing with images, you should also pass the input shape of your image, usually consisting of its width, height, and the number of channels (3 for RGB).

min_max = tf.keras.layers.experimental.preprocessing.Rescaling(1./255, input_shape=(width, height, 3))

Before you standardize/normalize your data and train a neural network, you need to make sure that the following requirements are met. These steps are an essential part of any kind of data preprocessing you are likely to perform in the context of a deep learning project.

Real-world datasets are rarely complete. Therefore, you need a strategy for handling missing values, such as removing or imputing them.

Often your data will come in the form of text strings. For example, categories might be described using words rather than ordinal numbers. You need to convert those features consisting of non-numeric data to numeric entries.

Neural networks expect a standard size and format. For instance, in the context of image processing, this means all images should have standard width, height, and number of channels.

]]>This post introduces the most common loss functions used in deep learning.

The loss function in a neural network quantifies the difference between the expected outcome and the outcome produced by the machine learning model. From the loss function, we can derive the gradients which are used to update the weights. The average over all losses constitutes the cost.

A machine learning model such as a neural network attempts to learn the probability distribution underlying the given data observations. In machine learning, we commonly use the statistical framework of maximum likelihood estimation as a basis for model construction. This basically means we try to find a set of parameters and a prior probability distribution such as the normal distribution to construct the model that represents the distribution over our data. If you are interested in learning more, I suggest you check out my post on maximum likelihood estimation.

Cross-entropy-based loss functions are commonly used in classification scenarios. Cross entropy is a measure of the difference between two probability distributions. In a machine learning setting using maximum likelihood estimation, we want to calculate the difference between the probability distribution produced by the data generating process (the expected outcome) and the distribution represented by our model of that process.

The resulting difference produced is called the loss. It increases exponentially as the prediction diverges from the actual outcome.

If the actual outcome is 1, the model should produce a probability estimate that is as close as possible to 1 to reduce the loss as much as possible.

If the actual outcome is 0, the model should produce a probability estimate that is as close as possible to 0.

As you can see on the plot, the loss explodes exponentially, preventing the model from reaching a prediction equal to 1 (absolute certainty in the wrong value). Conversely, the closer the estimate gets to the actual outcome, the more the returns diminish.

Cross entropy is also referred to as the negative log-likelihood.

As the name implies, the binary cross-entropy is appropriate in binary classification settings to get one of two potential outcomes. The loss is calculated according to the following formula, where y represents the expected outcome, and y hat represents the outcome produced by our model.

L = -(y_i \; log(\hat y_i) + (1-y_i)log(1-\hat y_i))

To make this concrete, let’s go through an example.

Let’s say you are training a neural network to determine whether a picture contains a cat. The outcome is either 1 (there is a cat) or 0 (there is no cat). You have 2 pictures, the first two of which contain cats, while the last one does not. The neural network is 80% confident that the first image contains a cat.

y = 1 \\ \hat y = 0.8

If we plug the first estimate and the expected outcome into our cross-entropy loss formula, we get the following:

L = -(1 \; log( 0.8) + (1-1)log(1-0.8)) = 0.32

The neural network is 10% confident that the second image contains a cat. In other words, the neural network gives you a 90% probability that the second image does not contain a cat.

y = 0 \\ \hat y = 0.1

If we plug the last estimate and expected outcome into the formula, we get the following.

L = -(0 \; log( 0.1) + (1-0)log(1-0.1)) = 0.15

The function is designed so that either the first or the second term equals zero. You attain the cost by calculating the average over the loss overall examples.

C = -\frac{1}{N} \sum_{i=1}^{N}(y_i \; log(\hat y_i) + (1-y_i)log(1-\hat y_i))

The binary cross-entropy is appropriate in conjunction with activation functions such as the logistic sigmoid that produce a probability relating to a binary outcome.

The categorical cross-entropy is applied in multiclass classification scenarios. In the formula for the binary cross-entropy, we multiply the actual outcome with the logarithm of the outcome produced by the model for each of the two classes and then sum them up. For categorical cross-entropy, the same principle applies, but now we sum over more than two classes. Given that M is the number of classes, the formula is as follows.

L = \sum^M_{j=1} y_j\log(\hat y_j)

Assume that we have a neural network that learns to classify pictures into three classes: Whether they contain a rabbit, a cat, or a dog. To represent each sample, your expected outcome y is a vector of three entries for each class. The entry that corresponds to the outcome is 1, while all others are zero. Let’s say the first image contains a dog.

y = \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix}

The vector of predictions contains probabilities for each outcome that need to sum to 1.

\hat y = \begin{bmatrix} 0.7\\ 0.2\\ 0.1 \end{bmatrix}

To calculate the loss for this particular image of a dog, we plug these values into the formula.

L = 1\log(0.7) + 0\log(0.2)+0\log(0.1) = 0.52

For the cost function, we need to calculate the loss of all the individual training examples.

L = -\frac{1}{N} \sum_{i=1}^{N}\sum^M_{j=1} y_{ij}\log(\hat y_{ij})

The categorical cross-entropy is appropriate in combination with an activation function such as the softmax that can produce several probabilities for the number of classes that sum up to 1.

In deep learning frameworks such as TensorFlow or Pytorch, you may come across the option to choose sparse categorical cross-entropy when training a neural network.

Sparse categorical cross-entropy has the same loss function as categorical cross-entropy. The only difference is how you present the expected output y. If your y’s are in the same format as above, where every entry is expressed as a vector with 1 for the outcome and zeros everywhere else, you use the categorical cross-entropy. This is known as one-hot encoding.

If your y’s are encoded in an integer format, you would use sparse categorical cross-entropy. In the example above, a dog could be represented by 1, a cat by 2, and a rabbit by 3 in integer format.

Mean squared error is used in regression settings where your expected and your predicted outcomes are real-number values.

The formula for the loss is fairly straightforward. It is just the squared difference between the expected value and the predicted value.

L = (y_i - \hat y_i)^2

Suppose you have a model that helps you predict the price of oil per gallon. If the actual price of the house is $2.89 and the model predicts $3.07, you can calculate the error.

L = (2.89 - 3.07)^2 = 0.032

The cost is again calculated as the average overall losses for the individual examples.

C = \frac{1}{N} \sum_{i=1}^N (y_i - \hat y_i)^2

This post will introduce the basic architecture of a neural network and explain how input layers, hidden layers, and output layers work. We will discuss common considerations when architecting deep neural networks, such as the number of hidden layers, the number of units in a layer, and which activation functions to use.

In our technical discussion, we will focus exclusively on simple feed-forward neural networks for the scope of this post.

The input layer accepts the input data and passes it to the first hidden layer. The input layer doesn’t perform any transformations on the data, so it usually isn’t counted towards the total number of layers in a neural network.

The number of neurons equals the number of features in the input dataset.

If you have multidimensional inputs, the input layer will flatten the images into one dimension. A network used for image classification requires images as input. A standard color RGB image usually has three dimensions: the width, the height, and three color channels. A greyscale image doesn’t need multiple color channels. Accordingly, two dimensions are sufficient. To feed a grayscale image to a neural network, you could transform every column of pixels into a vector and stack them on top of each other. A 4×4 grayscale image would thus require an input layer of 16 neurons.

You could further flatten across the color channels given a color image. Modern deep learning frameworks usually take care of the flattening for you. You just need to pass the image to the input layer, specify its dimensions, and the framework will handle the rest.

The goal of hidden layers is to perform one or more transformations on the input data that will ultimately produce an output that is close enough to the expected output. Hidden layers are where most of the magic happens that has put neural networks and deep learning at the cutting edge of modern artificial intelligence.

The transformations performed by hidden layers can be fairly complex such as mapping from a piece of text in one language to its translation in another language. How would you represent the abstract relationship between a piece of text in English and its equivalent in Chinese in a mathematical function that captures the semantic meaning and context, grammatical rules, and cultural nuance?

More traditional machine learning algorithms perform rather poorly at tasks such as language translation because they cannot adequately represent the complexity of the relationship.

Neural networks excel at this type of task because you can get them to learn mappings of almost arbitrary complexity by adding more hidden layers and varying the number of neurons.

A neural network also has the advantage over most other machine learning algorithms in that it can extract complex features during the learning process without the need to explicitly represent those features. This allows a network to learn to recognize objects in images or structures in language. The hidden layers act as feature extractors. For example, in a deep learning-based image recognition system, the earlier layers extract low-level features such as horizontal and vertical lines. The later layers build on these extracted features to construct higher-level ones. Once you reach the output layer, recognizable objects should have been extracted so that the output layer can determine whether the desired object is present or not. The number of hidden layers depends on the complexity of the task and is usually found through experimentation.

Hidden layers accept a vector of inputs from the preceding layer. Then they perform an affine transformation by multiplying with a weight term and adding a bias term.

z = W^Tx + b

Strictly speaking, the connections between the preceding layer and the current layer add the weight and the bias.

To capture non-linear relationships in a mapping, they then push the output z through a non-linear activation function.

a = f(z)

The deep learning research community has come up with several activation functions such as the rectified linear unit (ReLU), leaky ReLU, or the hyperbolic tangent function (Tanh). In the overwhelming majority of cases, the ReLU is a great default choice.

By chaining together several of these operations through multiple hidden layers, you can represent highly complex relationships.

The ReLU activation simply returns the input when it is a positive number, and zero, when it is a negative number.

a = max(0, z)

This looks very simple at first sight especially compared to other activation functions like the logistic sigmoid. But in the context of neural networks, the simplicity of the ReLU has several advantages.

- It is less computationally expensive to evaluate since there is no need for calculating exponentials.
- The previously applied sigmoid and TanH functions tend to saturate to either very high or very low values when the input is a huge positive or huge negative number. This leads to the problem of vanishing and exploding gradients. When you differentiate through these functions and through several layers, the gradient becomes so small or so large that it hinders the convergence of gradient descent. The ReLU performs a quasi-linear transformation. This prevents saturation and speeds up gradient descent.

Generally speaking, more complex functions tend to require more layers to represent them appropriately. If you are dealing with a machine translation or image recognition task, you’ll need more layers than if you were classifying a patient as at risk for heart disease based on eating habits, age, and body mass index. The latter example is a simple classification task for which a neural network with one layer (logistic regression) would be sufficient. The former examples require multiple stages of hidden feature extraction and data transformation.

Unfortunately, there is no exact formula to determine the number of hidden layers in a neural network.

Your best bet is to study the standard networks implemented by the research community in your domain that achieve the best performance. This should give you a decent idea of how many layers and how many neurons are appropriate. Beyond that, you need to experiment systematically by tweaking network architectures and figure out what works best for your specific problem.

In a simple multilayer perceptron, the hidden layers usually consist of so-called fully connected layers. They are called fully connected because each neuron in the preceding layer is connected to each neuron in the current layer.

In more advanced neural network architectures, you will find different types of layers.

The deep learning community has brought forth various layers for different purposes, such as convolutional layers and pooling layers in convolutional neural networks (primarily used in computer vision) or recurrent layers and attention layers in recurrent neural networks and transformers (mainly used in natural language processing). In future posts about more advanced neural network architectures for computer vision and natural language processing, I will discuss these layers.

The output layer produces the final output computed by the neural network, which you compare against the expected output. The number of neurons in the output layer equals the number of classes the prediction can fall into.

For example, if your task was to classify whether an image contained a cat, a dog, or a rabbit, you would have three output classes and thus three neurons.

Much like the hidden layers, the output layer computes an affine transformation based on the weights and biases from the incoming connections.

z = W^Tx + b

Next, it applies a non-linear activation function that represents a probability value in a classification setting. This implies that the individual values need to be larger than zero but smaller than one, and all individual probabilities need to sum to one.

The most commonly used activation function in a binary classification setting is the logistic sigmoid, while in multiclass settings, the softmax is most frequently used.

The logistic sigmoid is an s-shaped function that asymptotes 0 as the input value z is negative, quickly grows towards one as z becomes positive, and asymptotes 1 with a growing positive z value.

\sigma(z) = \frac{1}{1+e^{-z}}

This makes it an ideal function for binary classification problems where the output can either be sorted into the class equivalent of 0 or 1.

The softmax function generalizes the sigmoid to a problem of an arbitrary number of k classes.

softmax(z) = \frac{e^{z(i)}}{\sum^k_{j=0} e^{z(j)}}

In this post, we develop a thorough understanding of the backpropagation algorithm and how it helps a neural network learn new information. After a conceptual overview of what backpropagation aims to achieve, we go through a brief recap of the relevant concepts from calculus. Next, we perform a step-by-step walkthrough of backpropagation using an example and understand how backpropagation and gradient descent work together to help a neural network learn.

**Backpropagation is an algorithm used in machine learning that works by calculating the gradient of the loss function, which points us in the direction of the value that minimizes the loss function. It relies on the chain rule of calculus to calculate the gradient backward through the layers of a neural network. Using gradient descent, we can iteratively move closer to the minimum value by taking small steps in the direction given by the gradient.**

In other words, backpropagation and gradient descent are two different methods that form a powerful combination in the learning process of neural networks.

For the rest of this post, I assume that you know how forward propagation in a neural network works and have a basic understanding of matrix multiplication. If you don’t have these prerequisites, I suggest reading my article on how neural networks learn.

During forward propagation, we use weights, biases, and nonlinear activation functions to calculate a prediction y hat from the input x that should match the expected output y as closely as possible (which is given together with the input data x). We use a cost function to quantify the difference between the expected output y and the calculated output y hat.

J = \frac{1}{n} \sum_{i=1}^{n} L(\hat y, y)

The goal of backpropagation is to adjust the weights and biases throughout the neural network based on the calculated cost so that the cost will be lower in the next iteration. Ultimately, we want to find a minimum value for the cost function. But how exactly does that work?

The adjustment works by finding the gradient of the cost function through the chain rule of calculus.

With calculus, we can calculate how much the value of one variable changes depending on the change in another variable. If we want to find out how a change in a variable x by the fraction dx affects a related variable y, we can use calculus to do that. The change dx in x would change y by dy.

In Calculus notation, we express this relationship as follows.

\frac{dy}{dx}

This is known as the derivative of y with respect to x.

The first derivative of a function gives you the slope of that function at the evaluated coordinate. If you have functions with several variables, you can take partial derivatives with respect to every variable and stack them in a vector. This gives you a vector that contains the slopes with respect to every variable. Collectively the slopes point in the direction of the steepest ascent along the function. This vector is also known as the gradient of a function. Going in the direction of the negative gradient gives us the direction of the steepest descent. Going down the route of the steepest descent, we will eventually end up at a minimum value of the function.

For a more thorough explanation of derivatives as slopes, partial derivatives, and gradients, check out my series of posts on calculus for machine learning.

In a neural network, we are interested in how the change in weights affects the error expressed by the cost J. Our ultimate goal is to find the set of weights that minimizes the cost and thus the error. Thus, our intermediate goal is to find the negative gradient of the cost function with respect to the weights, which points us in the direction of the desired minimum.

-\frac{dJ}{dW}

Let’s look at a concrete example with two layers, where each layer has one neuron. For the sake of simplicity, we will ignore the bias b for now.

We need to calculate the derivatives of the cost J with respect to w2 and w1 and update the respective weights by the derivative term by going back through the network.

But w2 and w1 are not directly related to the cost. To get to them, we have to differentiate through intermediate functions such as the activations computed by the neurons.

J is related to w2 through the following equations in the forward pass.

z_2 = w_2 a_1 \; \rightarrow \; \hat y = \sigma(z_2) \; \rightarrow \; J = \frac{1}{n} \sum_{i=1}^{n} L(\hat y, y)

The backward differentiation through several intermediate functions is done using the chain rule.

\frac{dJ}{dw_2} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz_2} \frac{ dz_2} {dw_2}

The derivative of J with respect to w2 gives us the value by which to adjust the weight w2.

*Note: You might also see versions of this formula that use the partial differentiation symbol instead of the d.*

\frac{\partial J}{\partial w_2} = \frac{\partial J}{\partial \hat y} \frac{\partial \hat y}{ \partial z_2} \frac{ \partial z_2} {\partial w_2}

*The partial derivative is the correct sign in functions where you have multiple variables that can vary. For example, if you have a vector of many weights, you would use the partial derivative symbol. But to avoid unnecessary confusion, we will stick with the d notation throughout this post.*

Conceptually we now know what derivatives to evaluate and how to chain them together to get the desired weight adjustment. In this subsection, we will evaluate the derivatives. For implementing neural networks with a framework like TensorFlow or Pytorch, the conceptual understanding is sufficient. So feel free to skip ahead to the next section if you are not interested in the nitty-gritty math.

First, we have to evaluate the derivative of the cost J with respect to y hat.

The cost is the average loss over all training examples.

J(W, b) = \frac{1}{n} \sum_{i=1}^{n} L(\hat y_i, y_i)

We use the cross-entropy loss to calculate the difference between predicted and expected values.

L(\hat y, y) = -(y \; log(\hat y) + (1-y)log(1-\hat y))

So the derivative of the cost looks as follows:

\frac{dJ}{d \hat y} = \frac{d}{d \hat y} \left( \frac{1}{n} \sum_{i=1}^{n} -(y \; log(\hat y) + (1-y)log(1-\hat y)) \right)

= - \frac{d}{d \hat y} \left[ y \; log(\hat y) + (1-y)log(1-\hat y) \right]

\frac{dJ}{d \hat y} =\frac{y}{\hat y} + \frac{1-y}{1- \hat y}

Next, we need to evaluate the derivative of the predicted outcome y hat with respect to z_2, the value fed to the last sigmoid activation.

\frac{d\hat y}{ dz_2} = \frac{d}{ dz_2} \sigma(z_2) = \frac{d}{ dz_2} (\frac{1}{1 + e^{z_2}})

\frac{d\hat y}{ dz_2} = z_2(1-z_2)

Finally, we evaluate the derivative of z_2 with respect to the weight w_2.

\frac{ dz_2} {dw_2} = \frac{ d} {dw_2} (w_2a_1) = a_1

Combining it all, we get the following expression for the derivative of the cost with respect to w_2-

\frac{dJ}{dw_2} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz_2} \frac{ dz_2} {dw_2} = (\frac{y}{\hat y} + \frac{1-y}{1- \hat y}) (z_2(1-z_2)) a_1

To adjust w1, we can follow the same principle, but we have to go back even further.

z_1=w_1x_1 \; \rightarrow \; a_1=\sigma(z_1)\; \rightarrow \; z_2 = w_2 a_1 \; \rightarrow \; \hat y = \sigma(z_2) \; \rightarrow \; J = \frac{1}{n} \sum_{i=1}^{n} L(\hat y, y)

This gives us the following backward pass.

\frac{dJ}{dw_1} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz_2} \frac{ dz_2} {da_1}\frac {da_1}{ dz_1} \frac {dz_1}{ dw_1}

I’ll leave the differentiation as an exercise to you. We’ve already evaluated the first two derivatives in the chain in the previous section. So you just have to evaluate the last three elements in the chain based on the equations above.

Once we have calculated the gradient of J with respect to w2 using the chain rule, we can subtract it from the original weight w2 to move in the direction of the minimum value of the cost function. But in a non-linear function, the gradient will be different at every point along the function. Therefore, we can’t just calculate the gradient once and expect it to lead us straight to the minimum value. Instead, we need to take a very small step in the direction of the current gradient, recalculate the gradient based on the new location, take a step in that direction, and repeat the process.

In fact, subtracting the gradient as-is from the weight will likely result in a step that is too big. Before subtracting, we therefore multiply the derivative with a small value α called the learning rate. If we don’t use this learning rate, the weights are manipulated too quickly and the network won’t learn properly.

w_{2new} = w_2 - \alpha\frac{dJ}{dw_2}

Similarly, we perform the weight adjustment on w_1:

w_{1new} = w_1 - \alpha\frac{dJ}{dw_1}

We repeat this process many times over until we find a local minimum.

Note that this is a highly simplified calculation. In a fully connected neural network, you would have several neurons in each layer, and each neuron would connect to each neuron in the next layer. When going back, you have to calculate the derivative of each connection to the next layer and add them all up to attain the correct weight adjustment term.

\frac{d z_4}{d a_1} + \frac{d z_3}{d a_1}

Luckily, you won’t ever have to do this by hand. Deep learning frameworks take care of it for you. Understanding the general principles of backpropagation at the level of detail discussed here is sufficient for a practitioner.

]]>In this post, we develop an understanding of how neural networks learn new information.

**Neural networks learn by propagating information through one or more layers of neurons. Each neuron processes information using a non-linear activation function. Outputs are gradually nudged towards the expected outcome by combining input information with a set of weights that are iteratively adjusted through a process called backpropagation. **

The combination of many layers of neurons performing nonlinear data transformations allows the network to learn complex nonlinear relationships in the input data.

Neural networks can be regarded as algorithms that approximate a mathematical function f*(x). Given a set of training data x and the corresponding output y, we hypothesize that some function exists that takes x as input and produces y.

f^\ast(x) = y

The basic idea of a neural network (and supervised machine learning in general) is to let the algorithm find an approximation to that function through training.

In deep learning, neural networks usually consist of multiple layers. We can treat each layer as a separate function that applies some transformation to the data. Depending on the number of layers, our neural network attempts to approximate f*(x) through a series of nested functions.

f^1(f^2(f^3(x))) = f^\ast(x) = y

How does a neural network construct a function that allows it to determine whether an image contains a cat or to represent an English sentence in French?

To understand how this works, let’s have a high-level look at a single neuron processing a single data point.

First, we multiply our data with a randomly chosen weight and add a randomly chosen bias. The result is our input to the neuron. We need the bias and the weight so that we can influence the result produced by the neuron. This is crucial to learning.

weight \times data + bias = input

Next, we send the result through a neuron, which represents a non-linear activation function. The activation function transforms our input into a value in a specific range (usually between 0 and 1).

Now, we compare the output produced by the nonlinear activation function to the expected output. We measure the difference using a cost function. The larger the discrepancy between the produced and the expected output, the larger the imposed cost.

Finally, we send the information on how far off the produced result is from the expected back to the beginning. We use this information to slightly adjust the weight and the bias. Through a process called backpropagation, which is based on the chain rule of differentiation, we can gradually adjust the weights and biases in the desired direction. Then we repeat the process with the updated weights and biases until our produced output is reasonably close to the expected output.

In a deep neural network, you have thousands of these neurons stacked in multiple layers. How do these neurons interact to produce an output? And why do we need a non-linear activation function?

Now that we have a high-level understanding of the basic principles by which a neural network learns, we can dive into the details and address these questions.

If you are familiar with logistic regression, you are in a very good position to understand neural networks. A logistic regression model is basically a neural network with only one layer and one neuron.

We still multiply our input data with a weight and add a bias. From here on, we will adopt the convention of calling the input data x, the weight w, and the bias b. Together they produce the output z. x

z = wx + b

Since we are dealing with vectors of values rather than single values, w, x, b, and z all become vectors or matrices. For reasons that will become clear when we deal with neural networks, we structure our weight vector so that its transpose will be multiplied by x.

z = w^Tx + b

Next, we send the output z through a sigmoid activation function, which is a non-linear function that transforms an input to a value between 0 and 1.

\sigma(z) = \frac{1}{1+e^{-z}}

As you can see on the graph, very large values result in an output close to 1, while small values result in an output close to 0.

Although the analogy is imperfect and neural networks are only loosely inspired by the brain, you can think of it as a biological brain cell. Once a brain cell receives enough stimulus, it will fire off an electric signal. Here, z is like the stimulus that is supplied to the cell, while the output produced by the cell is like the signal that is further transmitted. Therefore, the sigmoid function is called an activation function. It decides whether the cell will activate and fire off a signal or not.

The neuron accepts the data as input, multiplies it with a weight and adds a bias, and finally transforms it to a value between 0 and 1 by sending it through a sigmoid activation function. The resulting signal y hat is compared to the expected outcome y.

With that knowledge in place, we can move on to constructing a neural network.

If you want to learn more about logistic regression and how the whole learning process works, check out my post on logistic regression.

In a neural network, you have several of the previously described neurons stacked on top of each other in layers. A network may consist of several layers.

For the sake of simplicity, let’s look at a very simple neural network with two layers and three neurons in the first layer.

Each observation in the dataset is sent through all the neurons in the first layer. This means each neuron needs to provide a vector of weights that equals the number of observations in length. Otherwise, we wouldn’t be able to multiply the weights with the input.

x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} w = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}

But we don’t just have one neuron but three. We can stack all of them in a matrix W.

W = \begin{bmatrix} w_1^1 & w_1^2 & w_1^3 \\ w_2^1 & w_2^2 & w_2^3 \end{bmatrix}

**The superscript represents the neuron the weight belongs to**. *I’m violating mathematical convention here because the superscript usually represents a power operation. But for the rest of this post, we will use it to note the neuron in a given layer.*

To perform the multiplication of the input data with the weights across all observations and neurons in the layer, we transpose the matrix.

W^Tx = \begin{bmatrix} w_1^1 & w_2^1 \\ w_1^2 & w_2^2 \\ w_1^3 & w_2^3 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} w_1^1x_1 + w^1_2x_2\\ w_1^2x_1 + w^2_2x_2\\ w_1^3x_1 + w^3_2x_2\\ \end{bmatrix}

If you don’t know how matrix multiplication works, check out my post on the topic.

Lastly, we can also add the bias term, which gives us the output z as a vector with 3 entries.

W^Tx + b = \begin{bmatrix} w_1^1 & w_2^1 \\ w_1^2 & w_2^2 \\ w_1^3 & w_2^3 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} b^1 \\ b^2 \\ b^3 \end{bmatrix} = \begin{bmatrix} (w_1^1x_1 + w^1_2x_2) + b^1\\ (w_1^2x_1 + w^2_2x_2) + b^2\\ (w_1^3x_1 + w^3_2x_2) + b^3\\ \end{bmatrix} = \begin{bmatrix} z^1\\ z^2\\ z^3\\ \end{bmatrix} = z

So now we have a z value for each neuron. From logistic regression, we know that we need to send each z value through a sigmoid function to obtain the predicted value y hat. But in this case, we have not yet arrived at y hat because there is still another layer. Instead, we use an intermediate value a.

a = \begin{bmatrix} a_1\\ a_2\\ a_3\\ \end{bmatrix} = \begin{bmatrix} \sigma(z^1)\\ \sigma(z^2)\\ \sigma(z^3)\\ \end{bmatrix}

*Now that we have arrived at a, we no longer need to keep track of neurons and observations in the input data, so we simply denote a as a vector with three entries marked by a subscript* *and dispense with the superscript notation*.

To represent the operations mathematically so far, we relied on the subscript to represent the observation in the dataset and the superscript to represent the neuron in the current layer.

But since we have more than two layers, we have to introduce a third dimension to represent the layer. We denote the layer in square brackets as follows.

a^{[1]} = \begin{bmatrix} a_1^{[1]} \\ a_2^{[1]}\\ a_3^{[1]}\\ \end{bmatrix}

This means the vector a represents the intermediate output of the first layer.

To arrive at our final prediction y hat, we need to pass the vector through the same procedure for the next layer, which only consists of one neuron:

- mutiply a with a vector of weights associated with the neuron
- add a bias associated with the neuron
- pass the resulting term through a non-linear activation function

With the layer notation in place, we can succinctly describe these operations mathematically as follows:

z^{[2]} = W^{[2]t} a^{[1]} + b^{[2]}

\hat y = \sigma(z^{[2]})

*I’ve adopted this convention from Andrew Ng’s Deep Learning course on Coursera.*

We can equally express the full neural network operation from start to finish.

First, we send the input data through the first layer, multiplying with the weights, adding a bias, and passing the whole expression through the sigmoid function.

z^{[1]} = W^{[1]t} x + b^{[1]}

a^{[1]} = \sigma({z^{[1]}})

Then we pass the result from the first laxer through the second layer, which gives us the output y hat.

z^{[2]} = W^{[2]t} a^{[1]} + b^{[2]}

\hat y = \sigma(z^{[2]})

In this example, we’ve used a 1-dimensional vector for x rather than a 2-dimensional matrix. As a consequence, the math was a bit easier to follow. But a real training dataset will consist of many training examples and multiple features. Accordingly, the input becomes higher-dimensional. For example, if your input consists of RGB images, then you will have 4 dimensions representing the number of images, the pixels on the x-axis, the pixels on the y-axis, and the color channels.

Your input layer will always have the same dimension as the input dataset. Your weight matrix needs to have a dimensionality that allows you to multiply it with the input data and to pass it through the next layer.

In practice, this means your weight matrix’s number of rows has the be equivalent to the number of units in the next layer, while its number of columns needs to be equivalent to the number of rows in the input dataset.

While modern deep learning frameworks like Tensorflow will take care of the internal calculation, you still need to be aware of how the calculations affect the dimensionality of your data as it passes through the layers.

In the previous sections, I’ve referred to non-linear activation functions. But why do we actually need non-linear functions?

If you apply a linear function, your data will always be some multiple of the original data. In a machine learning context, this means you won’t be able to represent non-linear decision boundaries. Our goal in training a neural network is to model a function that represents a non-linear decision boundary.

The combination of multiple non-linear activation functions allows a neural network to learn extremely complex non-linear relationships in data.

In previous sections, we’ve always used the sigmoid function as an example for a non-linear activation function. In fact, there are several more functions, such as the hyperbolic tangent, the rectified linear unit, and the leaky rectified linear unit. In most cases, these functions are a better fit for neural network activation than the sigmoid function. I will introduce these functions and discuss their merits in a future post.

Once you’ve sent your data through the neural network in the forward pass and calculated the output y hat, you need to measure how far the predictions diverge from the expected output y. To force the neural network to improve its predictions, you impose a cost on it that is a function of the difference between the predicted output y hat and the expected output y.

During neural network training, this is done using a cost function. Since you want to adjust the weights and the biases to improve the learning outcome, the cost is defined as a function of the weights and the biases.

J(W, b)

The cost is calculated as the total difference between all predicted and expected outcomes. The difference between one predicted and one expected outcome is known as the loss.

L(\hat y, y)

To get to the cost, we need to calculate the average of all losses in our dataset.

J(W, b) = \frac{1}{n} \sum_{i=1}^{n} L(\hat y_i, y_i)

But how do you actually measure the difference between predicted and expected outcomes?

In neural networks for binary classification, a common measure for the loss is the cross-entropy which is also known as the negative log-likelihood.

L = -(y \; log(\hat y) + (1-y)log(1-\hat y))

There are several different loss functions and which one you choose depends on the type of machine learning problem you are facing. Here is a great overview.

The ultimate step in neural network learning is backpropagation. I will discuss backpropagation on a conceptual level without going into the mathematical details since it involves advanced calculus. You don’t need to understand these details when building deep learning systems in practice. Frameworks like Tensorflow will take care of it for you.

The goal of backpropagation is to adjust the parameters (weights and biases) of your neural network to minimize the cost function. This means you need to communicate the cost that you calculate at the end of forward propagation back to the very beginning so you can adjust the original weights and biases.

Using derivatives from calculus, we can calculate how the change in one variable affects a change in another one.

For example, to figure out how a change in a parameter w affects the cost, we can take the derivative of the cost with respect to w (the procedure is similar for the weights).

\frac{dJ}{dw}

This expression is known as the gradient. The weight vector w can be visualized as a vector in a multidimensional space. The gradient is a vector that points in the direction of the steepest slope. In other words, it tells us in what direction we have to adjust the values for w in order to minimize the cost J. To learn more about this procedure, check out my post on gradient descent.

With gradient descent, we repeatedly calculate the gradient and subtract it from the weight vector. Then we move through forward propagation with the adjusted weights, which should result in a reduced cost. This process is repeated until we reach a small enough cost.

w_{new} = w - l \frac{dJ}{dw}

The term l is the learning rate that determines the speed with which the neural network learns. It is a very small term since subtracting the full gradient will lead the network to overshoot.

The original weight and bias that we want to adjust are only related to the cost through a series of intermediate transformations defined by the layers in the neural network. For example, the 2-layer neural network we defined above w is transformed through the following equations defining forward propagation followed by the cost function.

z^{[1]} = W^{[1]t} x + b^{[1]}

a^{[1]} = \sigma({z^{[1]}})

z^{[2]} = W^{[2]t} a^{[1]} + b^{[2]}

\hat y = \sigma(z^{[2]})

J = \frac{1}{n} \sum_{i=1}^{n} L(\hat y, y)

We have to backpropagate the error captured by the cost function through all the intermediate steps to correctly adjust the initial weights and biases. This is where the chain rule from calculus comes into play. The chain rule allows us to take the derivative of the cost with respect to the original W by chaining derivatives of the intermediate values together.

\frac{dJ}{dW^{[1]}} = \frac{dJ}{d\hat y} \frac{d\hat y}{ dz^{[2]} } \frac{ dz^{[2]} }{da^{[1]}} \frac{da^{[1]}}{ dz^{[1]} } \frac{ dz^{[1]} }{dW^{[1]}}

The chain rule is at the heart of backpropagation. Now, you have a conceptual understanding of how a neural network learns information.

]]>In this post, we develop an understanding of the hinge loss and how it is used in the cost function of support vector machines.

**The hinge loss is a specific type of cost function that incorporates a margin or distance from the classification boundary into the cost calculation. Even if new observations are classified correctly, they can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss increases linearly.**

The hinge loss is mostly associated with soft-margin support vector machines.

If you are familiar with the construction of hyperplanes and their margins in support vector machines, you probably know that margins are often defined as having a distance equal to 1 from the data-separating-hyperplane. We want data points to not only fall on the correct side of the hyperplane but also to be located beyond the margin.

Support vector machines address a classification problem where observations either have an outcome of +1 or -1. The support vector machine produces a real-valued output that is negative or positive depending on which side of the decision boundary it falls. Only if an observation is classified correctly and the distance from the plane is larger than the margin will it incur no penalty. The distance from the hyperplane can be regarded as a measure of confidence. The further an observation lies from the plane, the more confident it is in the classification.

For example, if an observation was associated with an actual outcome of +1, and the SVM produced an output of 1.5, the loss would equal 0.

An observation that is located directly on the boundary would incur a loss of 1 regardless of whether the real outcome was +1 or -1.

Observations that fall on the correct side of the decision boundary (hyperplane) but are within the margin incur a cost between 0 and 1.

All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and increases linearly. If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would be 0.5 even though the classification is correct.

Now that we have a strong intuitive understanding of the hinge loss understanding the math will be a breeze.

The loss is defined according to the following formula, where t is the actual outcome (either 1 or -1), and y is the output of the classifier.

l(y) = max(0, 1 -t \cdot y)

Let’s plug in the values from our last example. The outcome was 1, and the prediction was 0.5.

l(y) = max(0, 1 - 1 \cdot 0.5) = 0.5

If, on the other hand, the outcome was -1, the loss would be higher since we’ve misclassified our example.

l(y) = max(0, 1 - (-1) \cdot 0.5) = 1.5

Instead of using a labelling convention of -1, and 1 we could also use 0 and 1 and use the formula for cross-entropy to set one of the terms equal to zero. But the math checks out more beautifully in the former case.

With the hinge loss defined, we are now in a position to understand the loss function for the support vector machine. But before we do this, we’ll briefly discuss why and when we actually need a cost function.

In a hard margin SVM, we want to linearly separate the data without misclassification. This implies that the data actually has to be linearly separable.

If the data is not linearly separable, hard margin classification is not applicable.

Furthermore, if the margin of the SVM is very small, the model is more likely to overfit. In these cases, we can choose to cut the model some slack by allowing for misclassifications. We call this a soft margin support vector machine. But if the model produces too many misclassifications, its utility declines. Therefore, we need to penalize the misclassified samples by introducing a cost function.

In summary, the soft margin support vector machine requires a cost function while the hard margin SVM does not.

In the post on support vectors, we’ve established that the optimization objective of the support vector classifier is to minimize the term w, which is a vector orthogonal to the data-separating hyperplane onto which we project our data points.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2

This minimization problem represents the primal form of the hard margin SVM, which doesn’t account for classification errors.

For the soft-margin SVM, we combine the minimization objective with a loss function such as the hinge loss.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + \sum^m_{j=1} max(0, 1 -t_j \cdot y_j)

The first term sums over the number of features (n), while the second term sums over the number of samples in the data (m).

The t variable is the output produced by the model as a product of the weight parameter w and the data input x.

t_i = w^Tx_j

To understand how the model generates this output, refer to the post on support vectors.

The loss term has a regularizing effect on the model. But how can we control the regularization? That is how can we control how aggressively the model should try to avoid misclassifications. To manually control the number of misclassifications during training, we introduce an additional parameter, C, which we multiply with the loss term.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + C\sum^m_{j=1} max(0, 1 -t_j \cdot y_j)

The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin and be more tolerant towards misclassifications.

If we set C to a large number, then the SVM will pursue outliers more aggressively, which potentially comes at the cost of a smaller margin and may lead to overfitting on the training data. The classifier might be less robust on unseen data.

The hinge loss is a special type of cost function that not only penalizes misclassified samples but also correctly classified ones that are within a defined margin from the decision boundary.

The hinge loss function is most commonly employed to regularize soft margin support vector machines. The degree of regularization determines how aggressively the classifier tries to prevent misclassifications and can be controlled with an additional parameter C. Hard margin SVMs do not allow for misclassifications and do not require regularization.

]]>In this post, we will develop an understanding of support vectors, discuss why we need them, how to construct them, and how they fit into the optimization objective of support vector machines.

A support vector machine classifies observations by constructing a hyperplane that separates these observations.

**Support vectors are observations that lie on the margin surrounding the data-separating hyperplane. Since the margin defines the minimum distance observations should have from the plane, the observations that lie on the margin impact the orientation and position of the hyperplane.**

When data is linearly separable, there are several possible orientations and positions for the hyperplane.

In the illustration above, all of the hyperplanes perfectly separate the observations. Which of them is the best choice?

In addition to separating the training data, we also want the classifier to maximize the overall distance between the classifier and the points. This gives us a certain margin that maximizes our confidence in the prediction. If all the training points have had a certain minimum distance from the hyperplane, we can be more confident that any new observations that maintain that minimum distance are classified correctly.

If, on the other hand, some observations are very close to the plane, new observations with similar characteristics that deviate slightly could well end up on the other side of the hyperplane. Therefore, the red hyperplane is a better choice than the green hyperplanes.

The observations that are closest to the hyperplane are especially important because they lie directly on the margin. They influence the orientation and position of the hyperplane the most and determine how wide the margin is.

If we add just one observation that is closer to the margin, the hyperplane may change significantly.

The observations closest to the plane support the plane and hold it in place. It is a bit like the essential pillars that support a roof.

You construct hyperplanes by maximizing a margin around the hyperplane. You find the margin and, in extension, the position of the hyperplane by finding the minimum distance between the plane and the closest examples.

How do you find the minimum distance to the closest observation?

The minimum distance between an observation x_1 and the hyperplane can be measured along a line that is orthogonal to the plane and goes through x_1. We call this orthogonal line w. To simplify the example, we assume we’ve already determined that x_1 and x_2 are the support vectors. the points closest to the hyperplane) on each side and that the hyperplane also goes through the origin.

The point x_1’s coordinates constitute a vector from the origin. For example, if the point x_1 has coordinates [3,5], the associated vector will equal [3,5].

x_1 = [3,5]

Assuming that your hyperplane also goes through the origin, you can find the shortest distance between x_1 and the plane by performing a vector projection p of x1 onto the orthogonal vector w.

If you are not familiar with vector projections, you can check out my blog post on them. You calculate the projection of p by taking the dot product between the transpose of w and x_1 divided by the norm of w.

p = \frac{w^Tx_1}{||w||}

We need to repeat the same procedure for the support vector x_2 on the other side of the hyperplane. Essentially, that gives us the following expression.

(-)p = \frac{w^Tx_2}{||w||}

Note that we can arbitrarily scale the vector w. This allows us to get rid of the division by simply scaling w to unit length. That way, the norm of w equals 1, and the division becomes obsolete.

p = \frac{w^Tx_1}{1} = w^Tx_2

To remove the constraint that w has to go through the origin, we can add an intercept term b. Now we end up with the characteristic equation for a line that defines our margin m.

Pos\; Margin:\; m = w^Tx_1 + b

Accordingly, the margin on the other side of the hyperplane is defined by

Neg\; Margin:\; -m = w^Tx_1 + b

We can use the same equation to find the hyperplane. The hyperplane is exactly in the middle between the two margins. Therefore, it equals zero (You can verify this by adding up the two margins).

Hyperplane:\; w^Tx_1 + b = 0

We now know how the support vectors help construct the margin by finding a projection onto a vector that is perpendicular to the separating hyperplane. We also know how to find the equation that defines our margins and our hyperplane. But do we find the optimal margin and the corresponding support vectors?

To increase the confidence in our predictions, we want to position the hyperplane so that the margin around the plane is maximized while the overall distance of the points is minimized. Essentially we need to set this up as an optimization problem. Unfortunately, an optimization that involves both maximizing a term and minimizing another one is very complex.

To make optimization easier, a common approach is to scale the equation defining our margin so that the margin equals 1 and -1, respectively. To achieve this goal, we scale the vector w and the entire vector space until the margin m equals 1.

w^Tx_1 + b = 1

Scaling w implies that the length of w (||w||), which we previously eliminated by setting it equal to 1, will change as well. To maintain the original form of the equation, we, therefore, have to correct our estimates by a factor of

\frac{1}{||w||}

Furthermore, all future datapoints x_n should lie outside the margin, which results in the following constraints:

w^Tx_n + b \geq 1 \\ or \\ w^Tx_n + b \leq -1 \\

By setting the margin to a fixed value and constraining the data points to simply lie beyond that margin, we have simplified the optimization objective significantly. To arrive at our optimal hyperplane, all we have to do is maximize the correction factor.

max(\frac{1}{||w||})

Maximizing the inverse of the norm is equivalent to minimizing the norm. Remember that the norm is a quadratic expression. In a two-dimensional vector space, the norm or length of a vector can be calculated according to the Pythagorean theorem using the coordinates w_1 and w_2.

\sqrt{w_1^2 + w_2^2} = ||w||

To save ourselves the calculation of the square root of a potentially very large term, we can also optimize the squared norm. More generally, we then want to optimize the following term.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2

The 1/2 is a convenience term introduced to simplify the calculation when taking the derivative.

I will cover further details of the optimization algorithm and the associated cost function in my next post.

Support vectors are observations that lie closest to a separating hyperplane applied by support vector machines. They help determine the position and orientation of the plane to maximize the margin around the plane.

Support vectors and their associated distances from the hyperplane are found by projecting the observations closest to the plane onto a vector perpendicular to the plane. The overall positioning is achieved by mathematical optimization.

]]>