Summary

This post contains details about part of my project on the fundamentals of deep learning. Here I’ll discuss the basic components of deep neural networks. I’ll also include implementations of each component using the Python programming language. These will include basic methods for model initialization, forward propagation, backpropagation, and parameter optimization. In later posts, I plan on discussing and implementing recent methods from deep learning research. The primary goal of this project is to formalize fundamental concepts in deep learning and create a framework to be able to implement and test ideas from machine learning research.

Note: Project code can be accessed here: https://github.com/AlTheEngineer/Deep-Learning-Fundamentals

Citation: Majority of the work presented here is from stuff I learned from Andrew Ng’s Deep Learning Specialization. Any apparent similarities in code and/or notation are not coincidental.

The Problems We Try To Solve

Before we dive straight into deep learning, it is important to think about what they can be used for. For now we will focus on one type of problems that deep learning tries to solve: supervised learning problems. In supervised learning, we are given a data set of inputs and their corresponding correct outputs. We use this data set to train our deep learning model. Afterwards, we use our trained model to look at previously unseen inputs and predict their correct outputs. An analogy to supervised learning would be a student who prepares for his exam by studying a bunch of solved exam questions. We assume that all possible questions will come from the same book.

Note: The book analogy refers to the assumption that both the training and test sets come from the same data-generating distribution.

Model Representation

Let the input to the model be \( X \). The matrix \( X \) consists of \( m \) training examples and each example \( \mathbf{x} \) consists of \( n_{x} \) features. The i-th training example is denoted as \( \mathbf{x}^{i} \), while the j-th feature of that training examples is denoted as \( x_{j}^{i} \). See the figure below for an illustration.

\[X = \begin{pmatrix} x_{1}^{1} & \dots & x_{1}^{m} \\ \vdots & \ddots & \vdots \\ x_{n_{x}}^{1} & \dots & x_{n_{x}}^{m} \end{pmatrix}\]

The model has \( L \) layers. Each layer contains a set of weights and biases. Let the matrix \( W^{[l]} \) represent the weights of the l-th layer and the vector \( \mathbf{b}^{[l]} \) represent the biases of the l-th layer.

Let the output to the model be \( Y \). The matrix \( Y \) consists of \( m \) outputs, one for each training example \( \mathbf{x} \). In this tutorial, each \( \mathbf{y} \) will be a scalar (i.e. a single number).

Initialization

Before a model can be trained on a data set, the values of it’s weights must be set to initial values. This is called weight initialization. It might seem that setting all weight values to zero would be a simple solution. However, this would essentially mean that every neuron in a given layer \( l \) would learn the same thing. This would mean that the model would be no different than one having one hidden unit (neuron) in each layer. Hence, we need to initialize the weights in a way that would break symmetry in the layer’s units. This can be achieved by randomly initializing the weights using some distribution. For the purposes of our model, we choose an initialization method that works well for the activation functions that we will use (ReLU). This method is known as He initialization and uses a Gaussian distribution with zero mean and variance of

\[Var(W^{l}) = \frac{2}{n_{l}}\]

where \( n_{l} \) denotes the number of hidden units in layer \( l \). Here is an implementation of this method:

def initialize_params_he(net_dims):   
    params = {}
    L = len(net_dims)            # number of layers in the network, including input
    # Initialize params for each layer of the network, excluding the input layer
    for l in range(1, L):
        params['W' + str(l)] = np.random.randn(net_dims[l], net_dims[l-1]) * np.sqrt(2 / net_dims[l-1])
        params['b' + str(l)] = np.zeros((net_dims[l], 1))
        
    return params

where the variable net_dims is a list containing the number of hidden units in each layer.

There are various other initialization methods used in deep learning and it is currently an active area of research. Further discussion on commonly used methods will be in another post.

Forward Propagation

Forward propagation is the sequence of computations that a model performs on an input to produce a prediction. Each layer takes the output from the previous layer and performs two main computation steps:

  • Linear combination
  • Activation

The linear combination step can be expressed as

\[Z^{l} = W^{l}A^{l-1} + b^{l}\]

where \( A^{l-1} \) denotes the activation output of the previous layer.

Note: In the first layer, \( A^{l-1} \) would represent the input \( X \).

Here is an implementation of the linear combination step:

def linear_combination(A_prev, W, b):
    Z = W.dot(A_prev) + b
    return Z

The activation step can be expressed as

\[A^{l} = g^{l}\odot(Z^{l})\]

where \( g\odot() \) denotes an activation function that is applied element-wise to \( Z^{l} \). There are several types of activation functions that can be used in deep learning. For this project, we use the ReLU activation function for the hidden layers, which can be expressed as

\[g(z_{j}^{[l](i)}) = \begin{cases} 0 & \text{if $z_{j}^{[l](i)} \leq 0$} \\ z_{j}^{[l](i)} & \text{otherwise} \end{cases}\]

where \( z_{j}^{[l]( i )} \) denotes an element in \( Z^{[l]} \); and \( g() \) denotes the ReLU activation function. Here is an implementation:

def relu(Z):
    A = np.maximum(0,Z)
    return A

In the final (output) layer, we use the sigmoid activation function, which can be expressed as

\[g(z_{j}^{[L](i)}) = \frac{1}{1 + e^{-z_{j}^{[L](i)}}}\]

where \( z_{j}^{[L] ( i )} \) is an element in \( Z^{[L]} \) and \( g() \) denotes the sigmoid activation function. Here is an implementation:

def sigmoid(Z):
    A = 1/(1+np.exp(-Z))
    return A

Forward propagation involves performing these two computations successively through the model’s hidden layers. The output of the final layer is the model’s prediction. Here is a Python implementation:

def forward_propagation(X, parameters):
    for l in range(1, L):
        # grab the weight matrix and bias vector of the layer
        W = parameters["W" + str(l)]
        b = parameters["b" + str(l)]
        A_prev = A 
        # compute linear combination 
        Z = linear_combination(A_prev, W, b)
        # cache values for later use in backprop
        cache_list.append((A_prev, W, b, Z))
        # compute non-linear activation
        A = activation(Z)

    # Output Layer Forward Propagation
    W = params["W" + str(L)]
    b = params["b" + str(L)]
    A_prev = A
    Z = linear_combination(A_prev, W, b)
    AL = final_activation(Z)
    cache_list.append((A_prev, W, b, Z))
            
    return AL, cache_list

Note: Notice how we cache \( A^{[l-1]} \), \( W^{[l]} \), \( \mathbf{b}^{[l]} \), and \( Z^{[l]} \) values. This is because we will re-use them during back propagation.

Cost Function

The cost function computes the error \( E \) between the model’s prediction \( A^{L} \) and the true output \( Y \). The error can be thought of as a distance measure that describes how far the model is from perfectly predicting the output. For our purposes we will use the binary cross-entropy cost function, which is defined as

\[E(A^{l}, Y) = - \frac{1}{m} \sum \limits_{i=1}^{m} (y^{(i)} \log( a^{[L](i)}) + (1 - y^{(i)}) \log( 1 - a^{[L](i)})\]

where \( y^{(i)} \) is an element in \( Y \) and \( a^{[L] ( i ) } \) is an element in \( A^{[L]} \). The cross-entropy cost function is useful for classification tasks and can be implemented as:

def compute_error(AL, Y):
    m = Y.shape[1]  # number of examples

    error = -(1./m) * np.sum(np.dot(Y,np.log(AL).T) - np.dot(1-Y, np.log(1-AL).T))
    
    return error

Note: In machine learning, the error is sometimes called the loss.

There are various cost functions that can be used in deep learning, and it is an active area of research. Further discussions on commonly used cost functions will be in a different post.

Backward Propagation

Backward propagation is the sequence of computations that a model performs to qunatitatively express the relationships between the model’s error \( E \) and the weights \( W^{[l]} \) and biases \( \mathbf{b^{[l]}} \) in each layer \( l \). It is these relationships that we later use to update the model’s parameters. This is how the model learns from the training data.

In mathematics, we can describe the relationship between two variables using differentiation. The derivative of a variable \( y \) with respect to \( x \) describes the rate of change in \( y \) as \( x \) is changed. This can be expressed as

\[\frac{dy}{dx}\]

In our model, we need to express the derivative of the error with respect to many other variables (activations, weights, biases…etc.). For this we use partial differentiation where the partial derivative of \( y \) with respect to \( x \) describes the rate of change in \( y \) as \( x \) changes, but all other variables are kept constant.

First, we express the partial derivative of \( E \) with respect to the model’s output:

\[\frac{\partial E}{\partial A^{[L]}} = \frac{Y}{A^{[L]}} - \frac{1-Y}{1-A^{[L]}}\]

where \( Y \) is a matrix containing the correct outputs for each training example.

Then, we need to go backwards and express the error with respect to the linear output of the last layer \( Z^{[L]} \). For this we use the chain rule for partial differentiation as such:

\[\frac{\partial E}{\partial Z^{[L]}} = \frac{\partial E}{\partial A^{[L]}} \frac{\partial A^{[L]}}{\partial Z^{[L]}} = \frac{\partial E}{\partial A^{[L]}} \odot \sigma^{'}(Z^{[L]}) = A^{[L]} - Y\]

where \( \sigma ‘ \) denotes the derivative of the sigmoid activation function, which we use for our output layer, with respect to \( Z^{L} \). This can be expressed as

\[\frac{\partial A^{[l]}}{\partial Z^{[l]}} = \sigma '(Z^{[L]}) = A^{[L]} \odot (1 - A^{[L]})\]

and can be implemented as:

def sigmoid_gradient(Z):
    A = sigmoid(Z)
    dA_dZ = A * (1 - A)

    return dA_dZ

Similarly, we then keep propagating the error backwards to express it’s relationship with the weights and biases of each layer in the model. The expressions can be summarised as follows:

\[\frac{\partial E}{\partial Z^{[l]}} = \frac{\partial E}{\partial A^{[l]}} \odot g^{l'}(Z^{[l]})\]

We previously showed how this can be implemented for the output layer (\( Z^{[L]} \)) where there is a sigmoid activation function. However, we use the ReLU activation function for the hidden layers, whose gradient can be expressed as

\[\frac{\partial A^{[l]}}{\partial Z^{[l]}} = relu^{'}(Z^{[l]}) = \begin{cases} 0 & \text{if $z_{j}^{[l](i)} \leq 0$} \\ 1 & \text{otherwise} \end{cases}, \qquad \forall z_{j}^{[l](i)} \in Z^{[l]}\]

where \( grelu^{[l]}’ \) denotes the derivative of the ReLU activation function that is applied element-wise in layer \( l \)and it can be implemented as:

def relu_gradient(Z):
    dA_dZ = np.copy(Z)
    dA_dZ[dA_dZ>0] = 1
    dA_dZ[dA_dZ<=0] = 0
    return dA_dZ

We continue backpropagating to compute partial derivatives with respect to the weights, biases, and previous activations. This is repeated until we reach the first hidden layer. The expressions can be summarised as:

\[\frac{\partial E}{\partial W^{[l]}} = \frac{1}{m} \frac{\partial E}{\partial Z^{[l]}} A^{ [l -1 ] T}\] \[\frac{\partial E}{\partial \mathbf{b}^{[l]}} = \frac{1}{m} \sum \limits_{i=1}^{m}\frac{\partial E}{\partial Z^{[l]}}\] \[\frac{\partial E}{\partial A^{[l-1]}} = W^{[l]T} \frac{\partial E}{\partial Z^{[l]}}\]

where \( A^{[l-1]T} \) denotes the transpose of matrix \( A^{[l-1]} \); and \( m \) is the number of training examples. And here is an implementation:

def backward_linear(dZ, A_prev, W, b):

    m = A_prev.shape[1]      # number of training examples
    # compute gradient of loss wrt to the weight matrix of the network's given layer l: dL/dWl
    dW = 1./m * np.matmul(dZ, A_prev.T)
    # compute gradient of loss wrt to the bias vector of the network's given layer l: dL/dbl
    db = 1./m * np.sum(dZ, axis=1, keepdims=True)
    # compute gradient of loss wrt to the activations of the previous layer l-1: dL/dAl-1
    dA_prev = np.dot(W.T, dZ)

    return dA_prev, dW, db

Updating Model Parameters

After backpropagation, we are left with a set of relationships between the model’s error and it’s parameters. The next step is to use them to update the values of the parameters in such a way so as to reduce the error. These updates are carried out using optimizers. Stochastic gradient descent (SGD) is a very common optimizer in deep learning. The updates that SGD does on the parameters can be expressed as:

\[W^{[l]}_{new} = W^{[l]}_{old} - \alpha \frac{\partial E}{\partial W^{[l]}_{old}}\] \[\mathbf{b}^{[l]}_{new} = \mathbf{b}^{[l]}_{old} - \alpha \frac{\partial E}{\partial \mathbf{b}^{[l]}_{old}}\]

where $\alpha$ is the learning rate, which is a value that describes how fast (or slow) the model learns from the training examples. Here is a Python implementation:

def update_parameters_sgd(parameters, gradients, learning_rate):
    L = len(params) // 2 # number of layers
    
    for l in range(L):
        parameters["W" + str(l+1)] -= learning_rate * gradients["dW" + str(l+1)]
        parameters["b" + str(l+1)] -= learning_rate * gradients["db" + str(l+1)]
    return parameters

Training Our Model

That’s pretty much it. The general procedure behind deep learning involves the following steps:

  • Choose a suitable model architecture, number of hidden layers, and types of activation functions

  • Initialize the model parameters (weights and biases)

  • Feed the input into the model

  • Compute the model prediction using forward propagation

  • Compute the error using the cost function

  • Compute the partial derivatives of the error with respect to the parameters using back propagation

  • Update the model parameters using an optimizer

  • Repeat until error meets convergence criteria

Seems simple enough, right? Well, there are various methods we can use to carry out each step and finding just the right combination of methods can be rather elusive, especially for solving complicated problems.

At this point, we can use all of the different components we previously built to create a training script for our model. But before that, we need to prepare our data to make training effective. Basically, we will divide our training set into random and distinct chunks called mini-batches. During each training iteration, we will feed a mini-batch as input to the model. After feeding all the mini-batches from the training set, the model has been trained on all the training examples. This is called an epoch. We then repeat model training to complete more epochs until the model’s accuracy reaches a sufficiently high value.

Here is a basic implementation to generate mini-batches from a given dataset:

def generate_mini_batches(X, Y, batch_size=64, seed=0):
    np.random.seed(seed)
    m = X.shape[1]    # number of training examples
    mini_batches = []
    
    # Randomly shuffle the dataset
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Partition the shuffled data into mini-batches of size batch_size 
    for k in range(0, num_complete_minibatches):
        
        mini_batch_X = shuffled_X[:,k * batch_size:(k + 1) * batch_size]
        mini_batch_Y = shuffled_Y[:,k * batch_size:(k + 1) * batch_size]
        
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    return mini_batches

That’s pretty much it! It is important to note that the code snippets provided in this post are missing some lines for initializing variables and error testing. Please visit this github page for a full working code for this tutorial.

Future Work

This is only the first part of this project. I am currently working on basic implementations of recently developed methods for deep learning. These methods include:

  • Alternative initialization methods, activation functions, cost functions, and optimizers
  • Hyperparameter search techniques (e.g. learning rate finder, architecture search)
  • Regularization (e.g. dropout, L2, and batch normalization)
  • Different network types (CNNs, LSTMs, GRUs)

I will make sure to post updates once they are completed.