Build a DeepLearning algorithm from scratch in F# 03 – Building A Deep Neural Network

Build a DeepLearning algorithm from scratch in F# 03 – Building A Deep Neural Network

Welcome to the third post of my “Build a DeepLearning algorithm from scratch in F#” serie.

In the first article, we described how to implement a logistic regression with a single neuron. In the second article, we’ve seen how to implement a neural network made of a single hidden layer.

In today’s article, we will see how to implement and use an \(L\) layer deep neural network.

Used Notation

  • Superscript \([l]\) denotes a quantity associated with the \(l^{th}\) layer.
    • Example: \(a^{[L]}\) is the \(L^{th}\) layer activation. \(W^{[L]}\) and \(b^{[L]}\) are the \(L^{th}\) layer parameters.
  • Superscript \((i)\) denotes a quantity associated with the \(i^{th}\) example.
    • Example: \(x^{(i)}\) is the \(i^{th}\) training example.
  • Lowerscript \(i\) denotes the \(i^{th}\) entry of a vector.
    • Example: \(a^{[l]}_i\) denotes the \(i^{th}\) entry of the \(l^{th}\) layer’s activations).

Let’s get started!

Algorithm description

  • 1 – Initialize the parameters for an \(L\)-layer neural network.
  • 2 – Implement the forward propagation module (shown in purple in the figure below).
    • 2.1 – First, the LINEAR part of a layer’s forward propagation step (resulting in \(Z^{[l]}\)).
    • 2.2 – Then apply the differend kind of ACTIVATION functions (relu/sigmoid).
    • 2.3 – Combine the previous two steps into a new [LINEAR->ACTIVATION] forward function.
    • 2.4 – Stack the [LINEAR->RELU] forward function L-1 time (for layers 1 through L-1) and add a [LINEAR->SIGMOID] at the end (for the final layer \(L\)).
  • 3 – Compute the loss.
  • 4 – Implement the backward propagation module (denoted in red in the figure below).
    • 4.1 First, LINEAR part of a layer’s backward propagation step.
    • 4.2 Then apply the differend kind of ACTIVATION functions (relu_backward/sigmoid_backward)
    • 4.3 Combine the previous two steps into a new [LINEAR->ACTIVATION] backward function.
    • 4.4 Stack [LINEAR->RELU] backward L-1 times and add [LINEAR->SIGMOID] backward in a new L_model_backward function
  • 5 – Finally update the parameters.

Figure 1
 

1 – Initialization

The initialization for a deeper L-layer neural network is more complicated than a single neuron or a single hidden layer neural net because there are many more weight matrices and bias vectors. \(n^{[l]}\) is the number of units in layer \(l\). Thus for example if the size of our input \(X\) is \((12288, 209)\) (with \(m=209\) examples) then:

Shape of W Shape of b Activation Shape of Activation
Layer 1 \((n^{[1]},12288)\) \((n^{[1]},1)\) \(Z^{[1]} = W^{[1]} X + b^{[1]} \) \((n^{[1]},209)\)
Layer 2 \((n^{[2]}, n^{[1]})\) \((n^{[2]},1)\) \(Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\) \((n^{[2]}, 209)\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
Layer L-1 \((n^{[L-1]}, n^{[L-2]})\) \((n^{[L-1]}, 1)\) \(Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}\) \((n^{[L-1]}, 209)\)
Layer L \((n^{[L]}, n^{[L-1]})\) \((n^{[L]}, 1)\) \(Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}\) \((n^{[L]}, 209)\)

1.1 – Importing nuget packages

Same as usual, we will import the nuget packages that allow us to load the data from csv files, do the matrix operations, and display results using nice charts

1
2
3
4
5
6
#load "Paket.fsx"

Paket.Package
    [
        "FsLab"
    ]
1
#load "Paket.Generated.Refs.fsx"
1
2
3
4
5
open System
open FSharp.Data
open FSharp.Data.CsvExtensions
open MathNet.Numerics.LinearAlgebra
open MathNet.Numerics.LinearAlgebra.Double
1
2
3
#load "XPlot.Plotly.Paket.fsx"
#load "XPlot.Plotly.fsx"
open XPlot.Plotly

1.2 – Helper methods

1
2
3
4
5
6
7
8
9
10
let createVector size value =
    Seq {
        for _ in 0..size-1  do
            yield value
        }

let initialize_with_zeros dim =
    let w = createVector dim 0.0 |> DenseVector.ofSeq
    let b = 0.0
    w, b
1
2
3
4
5
6
7
type MatrixVectorDU =
        | Vector of Vector<float>
        | Matrix of Matrix<float>

let shape x = match x with
                    | Vector n -> (1, n.Count)
                    | Matrix n -> (n.RowCount, n.ColumnCount)
1
2
3
4
type public NnParameter = {
    W: Matrix<float>;
    b: Vector<float>;
    }

1.3 – Initialize implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
let initialize_parameters_deep layer_dims =
    let layer_count = layer_dims |> List.length
    Seq {
        for l in 1..layer_count-1 do
            let sqrtPrevious = sqrt (double (layer_dims.[l-1]))
            let W = (Matrix.Build.Random( (layer_dims.[l]), (layer_dims.[l-1]) )).Map(fun x -> x / sqrtPrevious)
            let b = createVector layer_dims.[l] 0.0 |> DenseVector.ofSeq
           
            yield {
                    W = W;
                    b = b;
                }
        }
1
initialize_parameters_deep([5;4;3])

2 – Forward propagation module

2.1 – Linear Forward

Now that we have initialized our parameters, we will do the forward propagation module. We will implement three functions in this order:

  • LINEAR
  • LINEAR -> ACTIVATION where ACTIVATION will be either ReLU or Sigmoid.
  • [LINEAR -> RELU] \(\times\) (L-1) -> LINEAR -> SIGMOID (whole model)

The linear forward module computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}$$

where \(A^{[0]} = X\).

Reminder:
The mathematical representation of this unit is \(Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type public LinearForwardResult = {
    A: Matrix<float>;
    W: Matrix<float>;
    b: Vector<float>;
    Z: Matrix<float>;
    }

let linear_forward (A:Matrix<float>) (W:Matrix<float>) (b:Vector<float>) =
    //Helper method
    let multiply_matrix_by_matrix_of_another_size_and_add_b (X:Matrix<float>) (weights:Matrix<float>) (b:Vector<float>) =
        let weight_shape = shape (MatrixVectorDU.Matrix weights)
        let mutable weights_vectors = Seq.empty<Vector<float>>
        for i in 0..(snd weight_shape) - 1 do
            let ax_plus_b = X.LeftMultiply (weights.Column(i) |> DenseVector.ofSeq) |> Vector.map (fun x -> x + b.[i])
            weights_vectors <- Seq.append weights_vectors [ax_plus_b]
        weights_vectors |> DenseMatrix.ofRowSeq

    let Z = multiply_matrix_by_matrix_of_another_size_and_add_b A (W.Transpose()) b
    {
        A = A;
        W = W;
        b = b;
        Z = Z;
    }

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
let A_testcase = matrix [
                            [ 1.62434536; -0.61175641]
                            [-0.52817175; -1.07296862]
                            [ 0.86540763; -2.3015387 ]
                        ]
                       
let W_testcase =  matrix [
                    [ 1.74481176; -0.7612069; 0.3190391]
                ]
               
let b_tescase = vector [-0.24937038]

let linear_testcase_result = linear_forward A_testcase W_testcase b_tescase
printfn "%s" (linear_testcase_result.Z |> string)

Output:

Z [[ 3.26295337 -1.23429987]]

2.2 – Linear-Activation Forward

We will use two activation functions:

  • Sigmoid: \(\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}\).
  • ReLU: The mathematical formula for ReLu is \(A = RELU(Z) = max(0, Z)\).
1
2
3
4
5
let sigmoid (z:Matrix<float>) =
    z.Map(fun x -> -x)
            |> (fun x -> x.PointwiseExp())
            |> (fun x -> 1.0 + x)
            |> (fun x -> 1.0 / x)
1
2
3
4
let relu (z:Matrix<float>) =
    let z_shape = shape (MatrixVectorDU.Matrix z)
    let other_matrix = DenseMatrix.Build.Dense(fst z_shape, snd z_shape)
    z.Map2((fun x y -> Math.Max(x, y)), other_matrix)

For more convenience, we are going to group two functions (Linear and Activation) into one function (LINEAR->ACTIVATION). Hence, we will implement a function that does the LINEAR forward step followed by an ACTIVATION forward step.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
type public LinearActivationForwardResult = {
    A_prev: Matrix<float>;
    W: Matrix<float>;
    b: Vector<float>;
    Z: Matrix<float>;
    A: Matrix<float>;
    }

let linear_activation_forward (a_prev:Matrix<float>) (W:Matrix<float>) (b:Vector<float>) (activation:string) =
    let linear_forward_result = linear_forward a_prev W b
    let Z = linear_forward_result.Z
    let A = if activation = "sigmoid" then sigmoid Z else if activation = "tanh" then tanh Z else relu Z
   
    let W_shape = shape (MatrixVectorDU.Matrix W)
    let A_shape = shape (MatrixVectorDU.Matrix A)
    let A_prev_shape = shape (MatrixVectorDU.Matrix a_prev)
    assert (A_shape = ( (W_shape |> fst), (A_prev_shape |> snd)))

    {
        A_prev = a_prev;
        W = W;
        b = b;
        Z = Z;
        A = A;
    }

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
let A_prev_testcase = matrix [
                                [-0.41675785; -0.05626683]
                                [-2.1361961;   1.64027081]
                                [-1.79343559; -0.84174737]
                            ]
                           
let W_testcase = matrix [
                    [ 0.50288142; -1.24528809; -1.05795222;]
                ]
               
let b_testcase = vector[-0.90900761]

let sigmoid_activation_testcase = linear_activation_forward A_prev_testcase W_testcase b_testcase "sigmoid"
printfn "With sigmoid: A = %s" (string sigmoid_activation_testcase.A)

let relu_activation_testcase = linear_activation_forward A_prev_testcase W_testcase b_testcase "relu"
printfn "With relu: A = %s" (string relu_activation_testcase.A)

let tanh_activation_testcase = linear_activation_forward A_prev_testcase W_testcase b_testcase "tanh"
printfn "With tanh: A = %s" (string tanh_activation_testcase.A)

Output:

With sigmoid: A [[ 0.96890023 0.11013289]]
With ReLU: A [[ 3.43896131 0. ]]
With tanh: A [[ 0.99794156 -0.96982745]]

2.3 – Forward propagation loop

For even more convenience when implementing the \(L\)-layer Neural Net, we will need a function that replicates the previous one (linear_activation_forward with RELU) \(L-1\) times, then follows that with one linear_activation_forward with SIGMOID.

We will keep a “cache” of every activation forward results that will be used to compute the derivatives in the backward propagation step.

Figure 2 : *[LINEAR -> RELU] \(\times\) (L-1) -> LINEAR -> SIGMOID* model

Note: In the code below, the variable AL will denote \(A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})\). (This is sometimes also called Yhat, i.e., this is \(\hat{Y}\).)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
let L_model_forward (X:Matrix<float>) (parameters:list<NnParameter>) =
    let mutable caches = Seq.empty<LinearActivationForwardResult>
    let mutable A = X
    let L = parameters |> Seq.length
   
    //Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in 1..L-1 do
        let A_prev = A
        let forward_activation_result = linear_activation_forward A_prev parameters.[l-1].W parameters.[l-1].b "relu"
        A <- forward_activation_result.A
        caches <- Seq.append caches [forward_activation_result]
   
    //Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    let AL = linear_activation_forward A parameters.[L-1].W parameters.[L-1].b "sigmoid"
    caches <- Seq.append caches [AL]

    caches

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
let parameters_testcase = [
                    {
                        W = matrix [
                                        [ 0.35480861;  1.81259031; -1.3564758; -0.46363197;  0.82465384]
                                        [-1.17643148;  1.56448966;  0.71270509; -0.1810066;  0.53419953]
                                        [-0.58661296; -1.48185327;  0.85724762;  0.94309899;  0.11444143]
                                        [-0.02195668; -2.12714455; -0.83440747; -0.46550831;  0.23371059]
                                    ];
                        b = vector [ 1.38503523;-0.51962709; -0.78015214; 0.95560959];
                    };
                    {
                   
                        W = matrix[
                                        [-0.12673638; -1.36861282;  1.21848065; -0.85750144]
                                        [-0.56147088; -1.0335199 ;  0.35877096;  1.07368134]
                                        [-0.37550472;  0.39636757; -0.47144628;  2.33660781]
                                ];
                        b = vector[ 1.50278553; -0.59545972; 0.52834106];
                    };
                    {
                        W = matrix[
                                    [ 0.9398248;  0.42628539; -0.75815703]
                                ];
                        b = vector [-0.16236698];
                    }
                ]
                       
let X_testcase = matrix [
                            [-0.31178367;  0.72900392;  0.21782079; -0.8990918 ]
                            [-2.48678065;  0.91325152;  1.12706373; -1.51409323]
                            [ 1.63929108; -0.4298936;   2.63128056;  0.60182225]
                            [-0.33588161;  1.23773784;  0.11112817;  0.12915125]
                            [ 0.07612761; -0.15512816;  0.63422534;  0.810655  ]
                        ]

let lmodel_forward_result = L_model_forward X_testcase parameters_testcase
let last_testcase = Seq.last lmodel_forward_result
printfn "AL = %s" (string(last_testcase.A))
printfn "Length of caches list = %s" (string(lmodel_forward_result |> Seq.length)
AL [[ 0.03921668 0.70498921 0.19734387 0.04728177]]
Length of caches list 3

Great! Now we have a full forward propagation that takes the input X and outputs a row vector \(A^{[L]}\) containing our predictions. It also records all intermediate values in “caches”. Using \(A^{[L]}\), we can compute the cost of your predictions.

3 – Cost function

Before we implement backward propagation, we need to compute the cost between iterations, because we want to check if our model is actually learning.

We will compute the cross-entropy cost \(J\), using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) $$

1
2
3
4
let compute_cost (AL:Vector<float>) (Y:Vector<float>) =
    let m = shape (MatrixVectorDU.Vector Y) |> snd |> float
    let logprobs =  AL.PointwiseLog().PointwiseMultiply(Y) + AL.Map(fun a -> 1.0 - a).PointwiseLog().PointwiseMultiply(Y.Map(fun y -> 1.0 - y))
    - logprobs.Sum() |> float |> (fun x -> x / m)

Test case:

1
2
3
4
5
6
let AL_testcase = vector [ 0.8;  0.9;  0.4]
let Y_testcase = vector [ 1.0; 1.0; 1.0]

let cost = compute_cost AL_testcase Y_testcase

printfn "cost = %s" (string cost)

Expected Output:

cost 0.41493159961539694

4 – Backward propagation module

Back propagation is used to calculate the gradient of the loss function with respect to the parameters.

Reminder:

Figure 3 : Forward and Backward propagation for *LINEAR->RELU->LINEAR->SIGMOID*
*The purple blocks represent the forward propagation, and the red blocks represent the backward propagation.*

Now, similar to forward propagation, we are going to build the backward propagation in three steps:

  • LINEAR backward
  • LINEAR -> ACTIVATION backward where ACTIVATION computes the derivative of either the ReLU or sigmoid activation
  • [LINEAR -> RELU] \(\times\) (L-1) -> LINEAR -> SIGMOID backward (whole model)

For layer \(l\), the linear part is: \(Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}\) (followed by an activation).

Suppose we have already calculated the derivative \(dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}\). We want to get \((dW^{[l]}, db^{[l]} dA^{[l-1]})\).

 

Figure 4
The three outputs \((dW^{[l]}, db^{[l]}, dA^{[l]})\) are computed using the input \(dZ^{[l]}\).Here are the formulas we need:
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} $$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]} = W^{[l] T} dZ^{[l]} }$$

4.1 – Linear Backward

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
type public LinearBackwardResult = {
    dA_prev: Matrix<float>;
    dW: Matrix<float>;
    db: Vector<float>;
    }
   
let multiply_matrix_by_matrix_of_another_size (X:Matrix<float>) (weights:Matrix<float>) =
    let weight_shape = shape (MatrixVectorDU.Matrix weights)
    let mutable weights_vectors = Seq.empty<Vector<float>>
    for i in 0..(snd weight_shape) - 1 do
        let product = X.LeftMultiply (weights.Column(i) |> DenseVector.ofSeq)
        weights_vectors <- Seq.append weights_vectors [product]
    weights_vectors |> DenseMatrix.ofRowSeq

let linear_backward (dZ:Matrix<float>) (cache:LinearActivationForwardResult) =
    let A_prev = cache.A_prev
    let W = cache.W
    let b = cache.b
    let m = (shape (MatrixVectorDU.Matrix A_prev)) |> snd |> float
    let dW = 1.0 / m * (dZ * A_prev.Transpose())
    let db = 1.0 / m * (dZ |> Matrix.sumRows)
    let dA_prev =  multiply_matrix_by_matrix_of_another_size dZ W
   
    {
        dA_prev = dA_prev;
        dW = dW;
        db = db;
    }

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
let linear_cache_testcase = {

    A_prev = matrix [
                        [-0.52817175; -1.07296862]
                        [ 0.86540763; -2.3015387 ]
                        [ 1.74481176; -0.7612069 ]
                    ];
                   
    W = matrix [
                    [ 0.3190391; -0.24937038;  1.46210794]
                ];
   
    b = vector [-2.06014071];
   
    Z = matrix [
                    [ 0.0 ]
                ];
   
    A = matrix [
                    [ 0.0 ]
                ];
}

let dZ_testcase = matrix [ [ 1.62434536; -0.61175641] ]

linear_backward dZ_testcase linear_cache_testcase

Output:

dA_prev [[ 0.51822968 -0.19517421]
[-0.40506361 0.15255393]
[ 2.37496825 -0.89445391]]
dW [[-0.10076895 1.40685096 1.64992505]]
db [[ 0.50629448]]

4.2 – Linear-Activation backward

Next, we will create a function that merges the two helper functions: linear_backward and the backward step for the activation linear_activation_backward.

We will implement two backward functions:

  • sigmoid_backward: Implements the backward propagation for SIGMOID unit.

\(g'(Z^{[l]}) =\sigma (x)\cdot (1-\sigma(x)).\)

  • relu_backward: Implements the backward propagation for RELU unit.
dZ = relu_backward dA activation_cache

If \(g(.)\) is the activation function, sigmoid_backward and relu_backward compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) $$.

1
2
3
4
5
6
7
8
9
10
let sigmoid_backward (dA:Matrix<float>) (cache:LinearActivationForwardResult) =
    let Z = cache.Z
   
    let s = Z |> Matrix.map (fun x -> -x)
                |> (fun x -> x.PointwiseExp())
                |> Matrix.map (fun x -> 1.0 + x)
                |> Matrix.map (fun x -> 1.0 / x)
       
    let dZ = dA.PointwiseMultiply(s).PointwiseMultiply( (s |> Matrix.map (fun x -> 1.0 - x) ) )
    dZ
1
2
3
4
5
6
7
8
9
let relu_backward (dA:Matrix<float>) (cache:LinearActivationForwardResult) =
    let Z = cache.Z
    let mutable dZ = dA.Clone()
   
    for i in 0..dA.RowCount - 1 do
        for j in 0..dA.ColumnCount - 1 do
            let value = if Z.Item(i, j) <= 0.0 then 0.0 else dA.Item(i, j)
            dZ.Item(i, j) <- value
    dZ

Implementation: Backpropagation for the LINEAR->ACTIVATION layer.

1
2
3
4
5
6
7
8
9
10
let linear_activation_backward (dA:Matrix<float>) (cache:LinearActivationForwardResult) (activation:string) =
    let dZ = if activation = "sigmoid" then sigmoid_backward dA cache else relu_backward dA cache
   
    let linear_backward_result = linear_backward dZ cache

    {
        dA_prev = linear_backward_result.dA_prev;
        dW = linear_backward_result.dW;
        db = linear_backward_result.db;
    }

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
let linear_cache_testcase = {

    A_prev = matrix [
                        [ -2.1361961;   1.64027081]
                        [ -1.79343559; -0.84174737 ]
                        [ 0.50288142; -1.24528809 ]
                    ];
                   
    W = matrix [
                    [ -1.05795222; -0.90900761;  0.55145404]
                ];
   
    b = vector [ 2.29220801 ];
   
    Z = matrix [
                    [ 0.04153939; -1.11792545]
                ];
   
    A = matrix [
                    [ 0.0 ]
                ];
}

let AL_sigmoid_back_testcase = matrix [ [-0.41675785; -0.05626683] ]

linear_activation_backward AL_sigmoid_back_testcase linear_cache_testcase "sigmoid"

Output with sigmoid:

dA_prev [[ 0.11017994 0.01105339]
[ 0.09466817 0.00949723]
[-0.05743092 -0.00576154]]
dW [[ 0.10266786 0.09778551 -0.01968084]]
db [[-0.05729622]]
1
linear_activation_backward AL_sigmoid_back_testcase linear_cache_testcase "relu"

Output with relu:

dA_prev [[ 0.44090989 0. ]
[ 0.37883606 0. ]
[-0.2298228 0. ]]
dW [[ 0.44513824 0.37371418 -0.10478989]]
db [[-0.20837892]]

4.3 – Backpropagation loop

We will implement the backward function for the whole network. Recall that when we implemented the L_model_forward function, at each iteration, we stored a cache which contains (X,W,b, and z). In the back propagation module, we will use those variables to compute the gradients. Therefore, in the L_model_backward function, we will iterate through all the hidden layers backward, starting from layer \(L\). On each step, we will use the cached values for layer \(l\) to backpropagate through layer \(l\). Figure 5 below shows the backward pass.

Figure 5/: Backward pass
Initializing backpropagation:
To backpropagate through this network, we know that the output is,
\(A^{[L]} = \sigma(Z^{[L]})\). Our code thus needs to compute dAL \(= \frac{\partial \mathcal{L}}{\partial A^{[L]}}\).

We can then use this post-activation gradient dAL to keep going backward. As seen in Figure 5, we can now feed in dAL into the LINEAR->SIGMOID backward function we implemented (which will use the cached values stored by the L_model_forward function). After that, we will have to use a for loop to iterate through all the other layers using the LINEAR->RELU backward function. We will store each dA, dW, and db in the grads dictionary. To do so, we will use this formula :

$$grads[“dW” + str(l)] = dW^{[l]} $$

For example, for \(l=3\) this would store \(dW^{[l]}\) in grads[“dW3”].

Implementation: Backpropagation for the [LINEAR->RELU] \(\times\) (L-1) -> LINEAR -> SIGMOID model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let L_model_backward (AL:Vector<float>) (Y:Vector<float>) (caches:Seq<LinearActivationForwardResult>) =
    let caches_array = caches |> Seq.toArray
    let mutable grads_seq = Seq.empty<LinearBackwardResult>
    let L = caches |> Seq.length // the number of layers
    let m = shape (MatrixVectorDU.Vector AL) |> snd

    // Initializing the backpropagation
    let dAL = - (Y.PointwiseDivide(AL) - (Y |> Vector.map(fun x -> 1.0 - x)).PointwiseDivide((AL |> Vector.map(fun x -> 1.0 - x))))
   
    // Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    let mutable current_cache = caches_array.[L-1]
    let mutable grads = linear_activation_backward (dAL |> DenseMatrix.OfRowVectors) current_cache "sigmoid"
   
    grads_seq <- Seq.append grads_seq [grads]
   
    for l = (L-2) downto 0 do
        // lth layer: (RELU -> LINEAR) gradients.
        // Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
        current_cache <- caches_array.[l]
        let current_grads = linear_activation_backward grads.dA_prev current_cache "relu"
        grads <- current_grads
        grads_seq <- Seq.append grads_seq [current_grads]

    grads_seq |> Seq.toList

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
let AL_testcase = vector [ 1.78862847;  0.43650985]
let Y_testcase = vector [ 1.0; 0.0 ]

let forward_caches_testcase = [|
    {
        A = matrix [  [ 0.0; 0.0; 0.0 ] ];

        A_prev = matrix [
                            [ 0.09649747; -1.8634927 ]
                            [-0.2773882; -0.35475898 ]
                            [-0.08274148; -0.62700068]
                            [-0.04381817; -0.47721803]
                        ];

        W = matrix [
                        [-1.31386475;  0.88462238;  0.88131804;  1.70957306]
                        [ 0.05003364; -0.40467741; -0.54535995; -1.54647732]
                        [ 0.98236743; -1.10106763; -1.18504653; -0.2056499 ]
                   ];

        b = vector [ 1.48614836; 0.23671627; -1.02378514];

        Z = matrix [
                        [-0.7129932;  0.62524497 ]
                        [-0.16051336; -0.76883635]
                        [-0.23003072;  0.74505627]
                   ];
        }; {
       
        A = matrix [  [ 0.0 ] ];

        A_prev = matrix [
                            [ 1.97611078; -1.24412333]
                            [-0.62641691; -0.80376609]
                            [-2.41908317; -0.92379202]
                        ];

        W = matrix  [
                        [-1.02387576;  1.12397796; -0.13191423]
                    ];

        b = vector [ -1.62328545 ];

        Z = matrix [ [ 0.64667545; -0.35627076] ];

    };
|]

let L_model_backward_testcase_result = L_model_backward AL_testcase Y_testcase forward_caches_testcase
L_model_backward_testcase_result
1
2
3
4
5
6
7
let dA1 = L_model_backward_testcase_result.[0].dA_prev
let dW1 = L_model_backward_testcase_result.[1].dW
let db1 = L_model_backward_testcase_result.[1].db

printfn "dA1 = %s" (string dA1)
printfn "dW1 = %s" (string dW1)
printfn "db1 = %s" (string db1)

Output

dW1 [[ 0.41010002 0.07807203 0.13798444 0.10502167]
[ 0. 0. 0. 0. ]
[ 0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 [[-0.22007063]
[ 0. ]
[-0.02835349]]
dA1 [[ 0.12913162 -0.44014127]
[-0.14175655 0.48317296]
[ 0.01663708 -0.05670698]]

5 – Update Parameters

In this section we will update the parameters of the model, using gradient descent:

$$ W^{[l]} = W^{[l]} – \alpha \text{ } dW^{[l]} $$$$ b^{[l]} = b^{[l]} – \alpha \text{ } db^{[l]} $$

where \(\alpha\) is the learning rate.

1
2
3
4
5
6
7
8
9
10
11
12
13
let update_parameters(parameters:list<NnParameter>) (grads:Seq<LinearBackwardResult>) (learning_rate:float) =

    let L = parameters |> Seq.length // the number of layers
    let gradsList = grads |> Seq.toList
   
    Seq {
        for l in 0..L-1 do
       
            yield {
                    W = parameters.[l].W - (gradsList.[L - 1 - l].dW |> Matrix.map (fun x -> learning_rate * x))
                    b = parameters.[l].b - (gradsList.[L - 1 - l].db |> Vector.map (fun x -> learning_rate * x))
            }
        }

Test case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
let parameters = [
                    {
                        W = matrix [
                                        [-0.41675785; -0.05626683; -2.1361961 ;  1.64027081]
                                        [-1.79343559; -0.84174737;  0.50288142; -1.24528809]
                                        [-1.05795222; -0.90900761;  0.55145404;  2.29220801]
                                    ];
                                   
                        b = vector[ 0.04153939; -1.11792545; 0.53905832];
                    };
                    {
                         W = matrix [
                                         [-0.5961597; -0.0191305; 1.17500122];
                                    ];
                                   
                        b = vector [-0.74787095];
                    };
                ]
                   

let grads = [|
                {
                    dA_prev = matrix [ [ 0.0 ] ];
                   
                    dW = matrix [
                                   [-0.40467741; -0.54535995; -1.54647732]
                                ];
                         
                    db = vector [ 0.98236743 ];
                };
                {
                    dA_prev = matrix [ [ 0.0 ] ];
                   
                   
                   
                    dW = matrix [
                                   [ 1.78862847;  0.43650985;  0.09649747; -1.8634927 ]
                                   [-0.2773882 ; -0.35475898; -0.08274148; -0.62700068]
                                   [-0.04381817; -0.47721803; -1.31386475;  0.88462238]
                                ];
                         
                    db = vector [ 0.88131804; 1.70957306; 0.05003364];
                }
            |]

update_parameters parameters grads 0.1

Output:

W1 [[-0.59562069 -0.09991781 -2.14584584 1.82662008]
[-1.76569676 -0.80627147 0.51115557 -1.18258802]
[-1.0535704 -0.86128581 0.68284052 2.20374577]]
b1 [[-0.04659241]
[-1.28888275]
[ 0.53405496]]
W2 [[-0.55569196 0.0354055 1.32964895]]
b2 [[-0.84610769]]

6 – Assemble our L-layer Neural Network

We will use use the functions we’ve implemented so far to build a deep network, and apply it to cat vs non-cat classification. Hopefully, we will see an improvement in accuracy relative to our logistic regression implementation.

Let’s get started!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let L_layer_model (X:Matrix<float>) (Y:Vector<float>) (layers_dims:list<int>)
                    (learning_rate:float) (num_iterations:int) (print_cost:bool) =
   
    //Parameters initialization.
    let mutable parameters = initialize_parameters_deep layers_dims |> Seq.toList
    let mutable costs = Seq.empty<float>
   
    //Loop (gradient descent)
    for i in 0..num_iterations - 1 do

        let forward_result = L_model_forward X parameters
        let AL = (forward_result |> Seq.last).A.Row(0)
        let cost = compute_cost AL Y
        let grads = L_model_backward AL Y forward_result
         
        // Update parameters.
        parameters <- update_parameters parameters grads learning_rate |> Seq.toList
               
        //Print the cost every 100 training examples
        if print_cost && i % 100 = 0 then
            (printfn "Cost after iteration %i: %f" i cost)
            costs <- Seq.append costs [cost]
       
    parameters, costs
1
2
3
4
5
6
7
8
9
10
11
12
13
let predict (X:Matrix<float>) (Y:Vector<float>) (parameters:list<NnParameter>) =
   
    let mutable predictions = Seq.empty<float>
    let m = shape (MatrixVectorDU.Vector Y) |> snd |> float
    let n = parameters |> List.length
   
    let propagationResult = L_model_forward X parameters
    let AL = (propagationResult |> Seq.last).A.Row(0)
    let predictions = AL |> Vector.map (fun x -> if x > 0.5 then 1.0 else 0.0)

    printfn "Accuracy: {%f}" (100.0 * (1.0 - (predictions - Y |> Vector.map(fun x -> abs x) |> Vector.toArray |> Array.average)))
       
    predictions

7 – See our model in action

We will use the same “Cat vs non-Cat” dataset as in our first post. The model we built had 70% test accuracy on classifying cats vs non-cats images. Hopefully, our new model will perform better!

7.1 – Load Dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
//returns a sequence of array, first index is a pixel array, second index is a one item label array
let parse_csv (x:CsvFile) =
    Seq {
        for row in x.Rows do
            let rowValues = row.Columns
                            |> Seq.map (fun c -> int c)
                            |> Seq.toArray
                       
            let pixelsValues = rowValues.[1..]
            let labelValues = rowValues.[..0]
            yield [| pixelsValues; labelValues |]
        }

//extract and split parsed value from csv to train/test x matrix and y vector
let load_dataset (train:CsvFile) (test:CsvFile) =
    let parsed_train_rows =  parse_csv train
    let parsed_test_rows = parse_csv test
   
    let extract_y (x:Seq<int[][]>) =  
        Seq {
            for row in x do
                yield (float row.[1].[0])
            }
    let extract_x (x:Seq<int[][]>) =
        Seq {
            for row in x do
                yield row.[0] |> Seq.map (fun r -> float r) |> Seq.toArray
            }
               
    let train_x = extract_x parsed_train_rows |> Seq.toArray |> DenseMatrix.ofColumnArrays
    let train_y = extract_y parsed_train_rows |> DenseVector.ofSeq
    let test_x = extract_x parsed_test_rows |>  Seq.toArray |> DenseMatrix.ofColumnArrays
    let test_y = extract_y parsed_test_rows |> DenseVector.ofSeq
   
    train_x, train_y, test_x, test_y
     
//building our datasets
let train_ds =  CsvFile.Load("C:\\Users\\Mathieu\\Dropbox\\ML\Blog\\03_Deep_Neural_Network\\datasets\\train.csv", ",", ''', false, true, 0)
let test_ds =  CsvFile.Load("C:\\Users\\Mathieu\\Dropbox\\ML\Blog\\03_Deep_Neural_Network\\datasets\\test.csv", ",", '
'', false, true, 0)
let train_x, train_y, test_x, test_y = load_dataset train_ds test_ds
1
2
3
4
5
6
7
let normalize_pixels (matrix:Matrix<float>) =
    let columnCount = matrix.ColumnCount - 1
    let rowCount = matrix.RowCount - 1
    for i in 0..rowCount do
        for j in 0..columnCount do
            matrix.Item(i, j) <- matrix.Item(i, j) / (float 255)
    matrix

7.2 – Explore Dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
open System.Drawing

let showPicture (matrix:Matrix<float>) (index:int) (filename:string) =
    let pixelVector = matrix.Column(index)
    let mutable i = 0
    let mutable line = -1
    let mutable vectorIndex = 0
    let bitmap = new Bitmap(64, 64)
    while vectorIndex < pixelVector.Count - 3 do
        if i % 64 = 0 then
            i <- 0
            line <- line + 1
           
        bitmap.SetPixel(i, line, Color.FromArgb(
                                                int pixelVector.[vectorIndex],
                                                int pixelVector.[vectorIndex + 1],
                                                int pixelVector.[vectorIndex + 2]))
        vectorIndex <- vectorIndex + 3
        i <- i + 1
    bitmap.Save("C:\\Users\Mathieu\\Dropbox\\ML\\Blog\\03_Deep_Neural_Network\\generated\" + filename)
    "
<img src='generated/"  + filename + "' style='width:128;height:128;'>" |> Util.Html |> Display
1
showPicture train_x 2 "image_2.bmp"

As usual, we reshape and standardize the images before feeding them to the network.

1
2
normalize_pixels train_x
normalize_pixels test_x

7.3 – Train our model

1
2
3
4
let layers_dims = [12288; 20; 7; 5; 1] //  5-layer model
let startedTime = System.DateTime.Now
let llayer_output = L_layer_model train_x train_y layers_dims 0.0075 1800 true
let endedTime = System.DateTime.Now

Cost after iteration 0: 0.666255
Cost after iteration 100: 0.520911
Cost after iteration 200: 0.494446
Cost after iteration 300: 0.423527
Cost after iteration 400: 0.352914
Cost after iteration 500: 0.341232
Cost after iteration 600: 0.290970
Cost after iteration 700: 0.108312
Cost after iteration 800: 0.065780
Cost after iteration 900: 0.042299
Cost after iteration 1000: 0.032148
Cost after iteration 1100: 0.025922
Cost after iteration 1200: 0.021531
Cost after iteration 1300: 0.018271
Cost after iteration 1400: 0.016059
Cost after iteration 1500: 0.014428
Cost after iteration 1600: 0.013200
Cost after iteration 1700: 0.012193

1
2
3
4
5
6
7
llayer_output
    |> snd
    |> Seq.toArray
    |> Chart.Line
    |> Chart.WithLayout (Layout(title = "Learning rate=0.0075"))
    |> Chart.WithXTitle("iterations (per hundreds)")
    |> Chart.WithYTitle("cost")

We can see the cost going down after each iteration batch, meaning our model is learning well ! Now let’s see how accurate the predictions are.

1
predict train_x train_y (llayer_output |> fst)
seq [0.0; 0.0; 1.0; 0.0; ...]
Accuracy: {100.000000}
1
predict test_x test_y (llayer_output |> fst)
seq [1.0; 1.0; 1.0; 1.0; ...]
Accuracy: {80.000000}

We can see we achieved an 80% accuracy on predicting cat vs non cat images. In the first article, we had a 70% accuracy, so it is a good improvement.

1 thought on “Build a DeepLearning algorithm from scratch in F# 03 – Building A Deep Neural Network

Leave a Reply

Your email address will not be published. Required fields are marked *