Welcome to the first article of a serie dedicated to learning how to build neural networks from scratch using the F# language.

We will base our articles on the original work of Andrew NG tutorials, availabe on Coursera, but using F# instead of Python.

Today we will build a logistic regression classifier to recognize cats. We will do this with a Neural Network mindset, as a first step to build more complex deep learning architectures.

If you want to know more about how logistic regression relates to neural network, here is a link to sebastian raschka Machine Learning FAQ

**We will learn to:**

- Build the general architecture of a learning algorithm, including:
- Initializing parameters
- Calculating the cost function and its gradient
- Using an optimization algorithm (gradient descent)

- Gather all three functions above into a main model function, in the right order.

**Why using .Net ?**

The .Net platform has the advantage to be able to run on many platforms, including windows, linux, MacOS, but also iOS and Android through Xamarin. One could imagine training a deep learning algorithm on a cloud platform such as Azure or Amazon, and run the trained algorithm offline within a mobile app developed in Xamarin, or any iot device running a .Net virtual machine.

**Why using F# ?**

Many reasons ! First, I wanted to learn a new language, different from the traditional ones I already know, and functional programming attracted me. Plus, as said above, F# comes with the good points of being a .Net language. I also have this crazy idea that at the end of this posts serie, we could end up having a simplistic re-usable library to easily build and train deep learning algorithm. People could even contribute to the library, who knows… And here it come, Machine Learning, and Deep Learning, are fields where developers are involved, but not only, we also have a lot of data scientists and mathematicians deeply interested in this field, who are used to functional, scripted languages allowing them to quickly describe and implement algorithms. I believe this is why languages such as Python and R are so widely used in the Machine Learning world. F# and .Net here come as a winning combination, as a scientist can easily implement algorithms using a scientist-friendly language, and a developer could re-use this work in a developer-friendly language such as C# through .Net languages interoperability, and end users would be favored as the resulting models could easily be used on a wide range of platforms / applications.

**Pre-requisite:**

This post assume you already have a jupyter notebook environment set up. If not, please refer to the anaconda webpage.

We will use the IfSharp F# kernel to execute F# code within our jupyter notebook. You can download a build for latest version here. The F# jupyter notebook corresponding to this post is hosted on my personal blog and can be downloaded here.

## Importing nuget packages

In python, when installing modules, they can be imported directly from a python notebook. In the .net world, packages need to be installed through nuget before you can reference them in a project. The lines below dowload and install the FsLab nuget packages in our notebook. FsLab is a set of packages that allow us to analyze, visualize and access data within our F# notebooks. We will use mainly the MathNet library for vector and matrices operations, and some charting functions from Xplot.Plotly. For more information about FsLab package you can visit the **FsLab website**

```
#load "Paket.fsx"
Paket.Package
[
"FsLab"
]
```

```
#load "Paket.Generated.Refs.fsx"
```

## Load Dataset

The original dataset in this assesment is in h5 file format. For simplicity, I exported it from h5 to csv file format using a python script, one for training, one for testing. For each line, the first column correspond to the call, the following ones correspond to the pixel values flattened in one line.

Andrew’s original python notebook comes with a sanity check, so this way we can easily now if our csv export / matrix building from csv parsing works as expected.

```
open FSharp.Data
open FSharp.Data.CsvExtensions
open MathNet.Numerics.LinearAlgebra
//returns a sequence of array, first index is a pixel array, second index is a one item label array
let parse_csv (x:CsvFile) =
seq {
for row in x.Rows do
let rowValues = row.Columns
|> Seq.map (fun c -> int c)
|> Seq.toArray
let pixelsValues = rowValues.[1..]
let labelValues = rowValues.[..0]
yield [| pixelsValues; labelValues |]
}
//extract and split parsed value from csv to train/test x matrix and y vector
let load_dataset (train:CsvFile) (test:CsvFile) =
let parsed_train_rows = parse_csv train
let parsed_test_rows = parse_csv test
let extract_y (x:Seq<int[][]>) =
seq {
for row in x do
yield (float row.[1].[0])
}
let extract_x (x:Seq<int[][]>) =
seq {
for row in x do
yield row.[0] |> Seq.map (fun r -> float r) |> Seq.toArray
}
let train_x = extract_x parsed_train_rows |> Seq.toArray |> DenseMatrix.ofColumnArrays
let train_y = extract_y parsed_train_rows |> DenseVector.ofSeq
let test_x = extract_x parsed_test_rows |> Seq.toArray |> DenseMatrix.ofColumnArrays
let test_y = extract_y parsed_test_rows |> DenseVector.ofSeq
train_x, train_y, test_x, test_y
//building our datasets
let train_ds = CsvFile.Load("C:\\SomeFolder\\Datasets\\train.csv", ",", ''', false, true, 0)
let test_ds = CsvFile.Load("C:\\SomeFolder\\Datasets\\test.csv", ",", ''', false, true, 0)
let shape = fun (matrix:Matrix<float>) -> matrix.RowCount, matrix.ColumnCount
let train_x, train_y, test_x, test_y = load_dataset train_ds test_ds
```

```
printfn "sanity check after reshaping: "
train_x.[0..5,0]
```

**Expected Values**:

**train_x shape** | (12288, 209) |

**train_y shape** | (1, 209) |

**test_x shape** | (12288, 50) |

**test_y shape** | (1, 50) |

**sanity check after reshaping** | [17 31 56 22 33] |

## Show images

It is quite easy to plot a picture in a python notebook, but I havn’t found an easy way to do so with our flatten pixels dataset using F# so I am providing the code to do it manually by generating the corresponding bitmap file, and outputing the result using an html tag.

```
open System.Drawing
let showPicture (matrix:Matrix<float>) (index:int) (filename:string) =
let pixelVector = matrix.Column(index)
let mutable i = 0
let mutable line = -1
let mutable vectorIndex = 0
let bitmap = new Bitmap(64, 64)
while vectorIndex < pixelVector.Count - 3 do
if i % 64 = 0 then
i <- 0
line <- line + 1
bitmap.SetPixel(i, line, Color.FromArgb(
int pixelVector.[vectorIndex],
int pixelVector.[vectorIndex + 1],
int pixelVector.[vectorIndex + 2]))
vectorIndex <- vectorIndex + 3
i <- i + 1
bitmap.Save("C:\\SomeFolder\\images\\" + filename)
"<img src='folderpath/" + filename |> Util.Html |> Display
```

```
showPicture train_x 25 "cat__25.bmp"
```

To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of three numbers ranging from 0 to 255.

One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole array from each example, and then divide each example by the standard deviation of the whole array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).

Let’s standardize our dataset.

```
let normalize_pixels (matrix:Matrix<float>) =
let columnCount = matrix.ColumnCount - 1
let rowCount = matrix.RowCount - 1
for i in 0..rowCount do
for j in 0..columnCount do
matrix.Item(i, j) <- matrix.Item(i, j) / (float 255)
matrix
normalize_pixels train_x
normalize_pixels test_x
```

## 3 – General Architecture of the learning algorithm

It’s time to design a simple algorithm to distinguish cat images from non-cat images.

We will build a Logistic Regression, using a Neural Network mindset. The following Figure explains why **Logistic Regression is actually a very simple Neural Network!**

**Mathematical expression of the algorithm**:

For one example \(x^{(i)}\)

\(z^{(i)} = w^T x^{(i)} + b \tag{1}\)

\(\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}\)

\( \mathcal{L}(a^{(i)}, y^{(i)}) = – y^{(i)} \log(a^{(i)}) – (1-y^{(i)} ) \log(1-a^{(i)})\tag{3}\)

The cost is then computed by summing over all training examples:

\( J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{6}\)

## 4 – Building the parts of our algorithm

The main steps for building a Neural Network are:

- Define the model structure (such as number of input features)
- Initialize the model’s parameters
- Loop:
- Calculate current loss (forward propagation)
- Calculate current gradient (backward propagation)
- Update parameters (gradient descent)

You often build 1-3 separately and integrate them into one function we call `model`

.

### 4.1 – Helper functions

We will start by implementing the `sigmoid`

helper function. As you’ve seen in the figure above, we need to compute $$sigmoid( w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$ to make predictions.

```
let sigmoid (z:Vector<float>) =
z.Map(fun x -> -x)
|> (fun x -> x.PointwiseExp())
|> (fun x -> 1.0 + x)
|> (fun x -> 1.0 / x)
```

```
let testResult = vector[ 0.0; 2.0 ] |> sigmoid
printfn "sigmoid([0, 2]) = [%f, %f]" testResult.[0] testResult.[1]
```

### 4.2 – Initializing parameters

We will implement parameter initialization in the cell below. In this scenario, we have to initialize w as a vector of zeros.

```
let createVector size value =
seq {
for _ in 0..size-1 do
yield value
}
let initialize_with_zeros dim =
let w = createVector dim 0.0 |> DenseVector.ofSeq
let b = 0.0
w, b
```

```
let dim = 2
let mutable w, b = initialize_with_zeros dim
w, b
```

### 4.3 – Forward and Backward propagation

Now that our parameters are initialized, we can do the “forward” and “backward” propagation steps for learning the parameters.

We will implement a function `propagate`

that computes the cost function and its gradient.

**Hints**:

Forward Propagation:

- You get X
- You compute \(A = \sigma(w^T X + b) = (a^{(0)}, a^{(1)}, …, a^{(m-1)}, a^{(m)})\)
- You calculate the cost function: \(J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})\)

Here are the two formulas you will be using:

\( \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T\tag{7}\)

\( \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\tag{8}\)

```
let propagate (w:Vector<float>) (b:float) (X:Matrix<float>) (Y:Vector<float>) =
let m = X.ColumnCount
//FORWARD PROPAGATION (FROM X TO COST)
let A = X.LeftMultiply w
|> Vector.map (fun x -> x + b)
|> sigmoid
let cost = A.PointwiseLog().PointwiseMultiply(Y) + Y.Map(fun y -> 1.0 - y).PointwiseMultiply(A.Map(fun a -> 1.0 - a).PointwiseLog())
|> (fun x -> x.Sum() |> float)
|> (fun x -> -(1.0/ float m * x ))
//BACKWARD PROPAGATION (TO FIND GRAD )
let dw = X * (A - Y) |> (fun x -> (1.0/ float m * x ))
let db = A - Y
|> (fun x -> x.Sum() |> float)
|> (fun x -> (1.0/ float m * x ))
(dw, db), cost
```

```
w <- vector[ 1.0; 2.0 ]
b <- 2.0
let X = matrix[ [ 1.0; 2.0]
[ 3.0; 4.0]]
let Y = vector[ 1.0; 0.0 ]
propagate w b X Y
```

### d) Optimization

- We have initialized your parameters.
- We are also able to compute a cost function and its gradient.
- Now, we want to update the parameters using gradient descent.

We will write down the optimization function. The goal is to learn \(w\) and \(b\) by minimizing the cost function \(J\). For a parameter \(\theta\), the update rule is \(\theta = \theta – \alpha \text{ } d\theta\), where \(\alpha\) is the learning rate.

```
type TrainModelResult = {
w: Vector<float>;
b: float;
dw: Vector<float>;
db: float;
costs: Seq<float>
}
let train (w:Vector<float>) (b:float) (X:Matrix<float>)
(Y:Vector<float>) (num_iterations:int)
(learning_rate:float) (print_cost:bool) =
let mutable costs = Seq.empty<float>
let mutable dw = vector [0.0]
let mutable db = 0.0
let mutable w_internal = w
let mutable b_internal = b
for i in 0..num_iterations - 1 do
//Cost and gradient calculation
let results = propagate w_internal b_internal X Y
let grads = fst results
let cost = snd results
dw <- fst grads
db <- snd grads
// update rule
w_internal <- w_internal - learning_rate * dw
b_internal <- b_internal - learning_rate * db
//Record the costs
if i % 100 = 0 then costs <- Seq.append costs [cost]
//Print the cost every 100 training examples
if print_cost && i % 100 = 0 then (printfn "Cost after iteration %i: %f" i cost)
{
w = w_internal;
b = b_internal;
dw = dw;
db = db;
costs = costs
}
```

```
train w b X Y 100 0.005 false
```

The previous function will output the learned w and b. We are able to use w and b to predict the labels for a dataset X. Implement the `predict()`

function. There is two steps to computing predictions:

- Calculate $$\hat{Y} = A = \sigma(w^T X + b)$$
- Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), and return the prediction results

```
let predict (w:Vector<float>) (b:float) (X:Matrix<float>) =
let A = X.LeftMultiply w
|> Vector.map (fun x -> x + b)
|> sigmoid
A |> Vector.map (fun x -> if x > 0.5 then 1.0 else 0.0)
```

```
predict w b X
```

**What to remember:**

We have implemented several functions that:

- Initialize (w,b)
- Optimize the loss iteratively to learn parameters (w,b):
- computing the cost and its gradient
- updating the parameters using gradient descent

- Use the learned (w,b) to predict the labels for a given set of examples

## 5 – Merge all functions into a model

You will now see how the overall model is structured by putting together all the building blocks (functions implemented in the previous parts) together, in the right order.

```
type EvaluateModelResult = {
costs: Seq<float>;
Y_prediction_test: Vector<float>;
Y_prediction_train: Vector<float>;
w: Vector<float>;
b: float;
learning_rate: float;
num_iterations: int;
}
let model (x_train:Matrix<float>) (y_train:Vector<float>)
(x_test:Matrix<float>) (y_test:Vector<float>)
(num_iterations:int) (learning_rate:float) (print_cost:bool) =
let mutable w, b = initialize_with_zeros (64 * 64 * 3)
//Gradient descent
let results = train w b x_train y_train num_iterations learning_rate print_cost
let costs = results.costs
w <- results.w
b <- results.b
//Predict test/train set examples
let Y_prediction_test = predict w b x_test
let Y_prediction_train = predict w b x_train
//Print train/testErrors
printfn "train accuracy: {%f}" (100.0 * (1.0 - (Y_prediction_train - y_train |> Vector.map(fun x -> abs x) |> Vector.toArray |> Array.average)))
printfn "test accuracy: {%f} " (100.0 * (1.0 - (Y_prediction_test - y_test |> Vector.map (fun x -> abs x) |> Vector.toArray |> Array.average)))
{
costs = costs;
Y_prediction_test = Y_prediction_test;
Y_prediction_train = Y_prediction_train;
w = w;
b = b;
learning_rate = learning_rate;
num_iterations = num_iterations;
}
```

```
let first_model = model train_x train_y test_x test_y 2000 0.005 true
```

```
//Example of a picture that was wrongly classified
showPicture train_x 1 "cat_1_.bmp"
```

Let’s also plot the cost function and the gradients.

```
#load "XPlot.Plotly.Paket.fsx"
#load "XPlot.Plotly.fsx"
open XPlot.Plotly
```

```
first_model.costs
|> Seq.toArray
|> Chart.Line
|> Chart.WithLayout (Layout(title = "Learning rate=0.005"))
|> Chart.WithXTitle("iterations (per hundreds)")
|> Chart.WithYTitle("cost")
```

**Interpretation**:

You can see the cost decreasing. It shows that the parameters are being learned. However, you see that you could train the model even more on the training set. Try to increase the number of iterations in the cell above and rerun the cells. You might see that the training set accuracy goes up, but the test set accuracy goes down. This is called overfitting.

## 6 – Further analysis

Now that we have built our first image classification model, let’s analyze it further, and examine possible choices for the learning rate $\alpha$.

#### Choice of learning rate

**Reminder**:

In order for Gradient Descent to work you must choose the learning rate wisely. The learning rate $\alpha$ determines how rapidly we update the parameters. If the learning rate is too large we may “overshoot” the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That’s why it is crucial to use a well-tuned learning rate.

Let’s compare the learning curve of our model with several choices of learning rates. Run the cell below. This should take about 1 minute. Feel free also to try different values than the three we have initialized the `learning_rates`

variable to contain, and see what happens.

```
let learning_rates = [| 0.01; 0.001; 0.0001|]
let mutable models = Seq.empty<EvaluateModelResult>
let mutable scatters = Seq.empty<Seq<int* float>>
for i in learning_rates do
printfn "learning rate is: %f" i
let model = model train_x train_y test_x test_y 1500 i false
models <- Seq.append models [model]
printfn "-------------------------------------------------------"
let extract_key_value (x_seq:Seq<float>)=
seq {
let mutable key = 0
for value in x_seq do
key <- key + 1
let extract_tuple = (key, value)
yield extract_tuple
}
let labels = learning_rates |> Array.toSeq |> Seq.map (fun x -> "learning rate: " + string x)
models
|> Seq.map (fun x -> x.costs)
|> Seq.map (fun x -> extract_key_value x)
|> (fun x -> scatters<- Seq.append scatters x)
scatters
|> Chart.Line
|> Chart.WithLabels(labels)
|> Chart.WithLegend(true)
|> Chart.WithXTitle("iterations")
|> Chart.WithYTitle("cost")
```

**Interpretation**:

- Different learning rates give different costs and thus different predictions results.
- If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example, using 0.01 still eventually ends up at a good value for the cost).
- A lower cost doesn’t mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy.
- In deep learning, we usually recommend that you:
- Choose the learning rate that better minimizes the cost function.
- If your model overfits, use other techniques to reduce overfitting. (We’ll talk about this in later posts.)