Welcome to the first article of a serie dedicated to learning how to build neural networks from scratch using the F# language.
We will base our articles on the original work of Andrew NG tutorials, availabe on Coursera, but using F# instead of Python.
Today we will build a logistic regression classifier to recognize cats. We will do this with a Neural Network mindset, as a first step to build more complex deep learning architectures.
If you want to know more about how logistic regression relates to neural network, here is a link to sebastian raschka Machine Learning FAQ
We will learn to:
 Build the general architecture of a learning algorithm, including:
 Initializing parameters
 Calculating the cost function and its gradient
 Using an optimization algorithm (gradient descent)
 Gather all three functions above into a main model function, in the right order.
Why using .Net ?
The .Net platform has the advantage to be able to run on many platforms, including windows, linux, MacOS, but also iOS and Android through Xamarin. One could imagine training a deep learning algorithm on a cloud platform such as Azure or Amazon, and run the trained algorithm offline within a mobile app developed in Xamarin, or any iot device running a .Net virtual machine.
Why using F# ?
Many reasons ! First, I wanted to learn a new language, different from the traditional ones I already know, and functional programming attracted me. Plus, as said above, F# comes with the good points of being a .Net language. I also have this crazy idea that at the end of this posts serie, we could end up having a simplistic reusable library to easily build and train deep learning algorithm. People could even contribute to the library, who knows… And here it come, Machine Learning, and Deep Learning, are fields where developers are involved, but not only, we also have a lot of data scientists and mathematicians deeply interested in this field, who are used to functional, scripted languages allowing them to quickly describe and implement algorithms. I believe this is why languages such as Python and R are so widely used in the Machine Learning world. F# and .Net here come as a winning combination, as a scientist can easily implement algorithms using a scientistfriendly language, and a developer could reuse this work in a developerfriendly language such as C# through .Net languages interoperability, and end users would be favored as the resulting models could easily be used on a wide range of platforms / applications.
Prerequisite:
This post assume you already have a jupyter notebook environment set up. If not, please refer to the anaconda webpage.
We will use the IfSharp F# kernel to execute F# code within our jupyter notebook. You can download a build for latest version here. The F# jupyter notebook corresponding to this post is hosted on my personal blog and can be downloaded here.
Importing nuget packages
In python, when installing modules, they can be imported directly from a python notebook. In the .net world, packages need to be installed through nuget before you can reference them in a project. The lines below dowload and install the FsLab nuget packages in our notebook. FsLab is a set of packages that allow us to analyze, visualize and access data within our F# notebooks. We will use mainly the MathNet library for vector and matrices operations, and some charting functions from Xplot.Plotly. For more information about FsLab package you can visit the FsLab website
1 2 3 4 5 6 7  #load "Paket.fsx" Paket.Package [ "FsLab" ] #load "Paket.Generated.Refs.fsx" 
Load Dataset
The original dataset in this assesment is in h5 file format. For simplicity, I exported it from h5 to csv file format using a python script, one for training, one for testing. For each line, the first column correspond to the call, the following ones correspond to the pixel values flattened in one line.
Andrew’s original python notebook comes with a sanity check, so this way we can easily now if our csv export / matrix building from csv parsing works as expected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46  open FSharp.Data open FSharp.Data.CsvExtensions open MathNet.Numerics.LinearAlgebra //returns a sequence of array, first index is a pixel array, second index is a one item label array let parse_csv (x:CsvFile) = seq { for row in x.Rows do let rowValues = row.Columns > Seq.map (fun c > int c) > Seq.toArray let pixelsValues = rowValues.[1..] let labelValues = rowValues.[..0] yield [ pixelsValues; labelValues ] } //extract and split parsed value from csv to train/test x matrix and y vector let load_dataset (train:CsvFile) (test:CsvFile) = let parsed_train_rows = parse_csv train let parsed_test_rows = parse_csv test let extract_y (x:Seq<int[][]>) = seq { for row in x do yield (float row.[1].[0]) } let extract_x (x:Seq<int[][]>) = seq { for row in x do yield row.[0] > Seq.map (fun r > float r) > Seq.toArray } let train_x = extract_x parsed_train_rows > Seq.toArray > DenseMatrix.ofColumnArrays let train_y = extract_y parsed_train_rows > DenseVector.ofSeq let test_x = extract_x parsed_test_rows > Seq.toArray > DenseMatrix.ofColumnArrays let test_y = extract_y parsed_test_rows > DenseVector.ofSeq train_x, train_y, test_x, test_y //building our datasets let train_ds = CsvFile.Load("C:\\SomeFolder\\Datasets\\train.csv", ",", ''', false, true, 0) let test_ds = CsvFile.Load("C:\\SomeFolder\\Datasets\\test.csv", ",", ''', false, true, 0) let shape = fun (matrix:Matrix) > matrix.RowCount, matrix.ColumnCount let train_x, train_y, test_x, test_y = load_dataset train_ds test_ds 
Input:
1 2  printfn "sanity check after reshaping: " train_x.[0..5,0] 
Output:
1  seq [17.0; 31.0; 56.0; 22.0; ...] 
Expected Values:
**train_x shape**  (12288, 209) 
**train_y shape**  (1, 209) 
**test_x shape**  (12288, 50) 
**test_y shape**  (1, 50) 
**sanity check after reshaping**  [17 31 56 22 33] 
Show images
It is quite easy to plot a picture in a python notebook, but I havn’t found an easy way to do so with our flatten pixels dataset using F# so I am providing the code to do it manually by generating the corresponding bitmap file, and outputing the result using an html tag.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  open System.Drawing let showPicture (matrix:Matrix) (index:int) (filename:string) = let pixelVector = matrix.Column(index) let mutable i = 0 let mutable line = 1 let mutable vectorIndex = 0 let bitmap = new Bitmap(64, 64) while vectorIndex < pixelVector.Count  3 do if i % 64 = 0 then i < 0 line < line + 1 bitmap.SetPixel(i, line, Color.FromArgb( int pixelVector.[vectorIndex], int pixelVector.[vectorIndex + 1], int pixelVector.[vectorIndex + 2])) vectorIndex < vectorIndex + 3 i < i + 1 bitmap.Save("C:\\SomeFolder\\images\" + filename) "<img src="'folderpath/"" /> Util.Html > Display 
Input:
1  showPicture train_x 25 "cat__25.bmp" 
Output:
To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of three numbers ranging from 0 to 255.
One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole array from each example, and then divide each example by the standard deviation of the whole array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).
Let’s standardize our dataset.
1 2 3 4 5 6 7 8 9 10  let normalize_pixels (matrix:Matrix) = let columnCount = matrix.ColumnCount  1 let rowCount = matrix.RowCount  1 for i in 0..rowCount do for j in 0..columnCount do matrix.Item(i, j) < matrix.Item(i, j) / (float 255) matrix normalize_pixels train_x normalize_pixels test_x 
3 – General Architecture of the learning algorithm
It’s time to design a simple algorithm to distinguish cat images from noncat images.
We will build a Logistic Regression, using a Neural Network mindset. The following Figure explains why Logistic Regression is actually a very simple Neural Network!
Mathematical expression of the algorithm:
For one example \(x^{(i)}\)
\(z^{(i)} = w^T x^{(i)} + b \tag{1}\)
\(\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}\)
\( \mathcal{L}(a^{(i)}, y^{(i)}) = – y^{(i)} \log(a^{(i)}) – (1y^{(i)} ) \log(1a^{(i)})\tag{3}\)
The cost is then computed by summing over all training examples:
\( J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{6}\)
4 – Building the parts of our algorithm
The main steps for building a Neural Network are:
 Define the model structure (such as number of input features)
 Initialize the model’s parameters
 Loop:
 Calculate current loss (forward propagation)
 Calculate current gradient (backward propagation)
 Update parameters (gradient descent)
You often build 13 separately and integrate them into one function we call model.
4.1 – Helper functions
We will start by implementing the sigmoid helper function. As you’ve seen in the figure above, we need to compute $$sigmoid( w^T x + b) = \frac{1}{1 + e^{(w^T x + b)}}$$ to make predictions.
1 2 3 4 5  let sigmoid (z:Vector) = z.Map(fun x > x) > (fun x > x.PointwiseExp()) > (fun x > 1.0 + x) > (fun x > 1.0 / x) 
Input:
1 2  let testResult = vector[ 0.0; 2.0 ] > sigmoid printfn "sigmoid([0, 2]) = [%f, %f]" testResult.[0] testResult.[1] 
Output:
1  sigmoid([0, 2]) = [0.500000, 0.880797] 
4.2 – Initializing parameters
We will implement parameter initialization in the cell below. In this scenario, we have to initialize w as a vector of zeros.
1 2 3 4 5 6 7 8 9 10  let createVector size value = seq { for _ in 0..size1 do yield value } let initialize_with_zeros dim = let w = createVector dim 0.0 > DenseVector.ofSeq let b = 0.0 w, b 
Input:
let dim = 2
let mutable w, b = initialize_with_zeros dim
w, b
Output:
(seq [0.0; 0.0], 0.0)
For image inputs, w will be of shape (num_px * num_px * 3, 1)
4.3 – Forward and Backward propagation
Now that our parameters are initialized, we can do the “forward” and “backward” propagation steps for learning the parameters.
We will implement a function propagate that computes the cost function and its gradient.
Hints:
Forward Propagation:
 You get X
 You compute \(A = \sigma(w^T X + b) = (a^{(0)}, a^{(1)}, …, a^{(m1)}, a^{(m)})\)
 You calculate the cost function: \(J = \frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1y^{(i)})\log(1a^{(i)})\)
Here are the two formulas you will be using:
\( \frac{\partial J}{\partial w} = \frac{1}{m}X(AY)^T\tag{7}\)
\( \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}y^{(i)})\tag{8}\)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  let propagate (w:Vector) (b:float) (X:Matrix) (Y:Vector) = let m = X.ColumnCount //FORWARD PROPAGATION (FROM X TO COST) let A = X.LeftMultiply w > Vector.map (fun x > x + b) > sigmoid let cost = A.PointwiseLog().PointwiseMultiply(Y) + Y.Map(fun y > 1.0  y).PointwiseMultiply(A.Map(fun a > 1.0  a).PointwiseLog()) > (fun x > x.Sum() > float) > (fun x > (1.0/ float m * x )) //BACKWARD PROPAGATION (TO FIND GRAD ) let dw = X * (A  Y) > (fun x > (1.0/ float m * x )) let db = A  Y > (fun x > x.Sum() > float) > (fun x > (1.0/ float m * x )) (dw, db), cost 
Input:
1 2 3 4 5 6 7 8 9  w < vector[ 1.0; 2.0 ] b < 2.0 let X = matrix[ [ 1.0; 2.0] [ 3.0; 4.0]] let Y = vector[ 1.0; 0.0 ] propagate w b X Y 
Output:
1  ((seq [0.9999321585; 1.99980262], 0.4999352306), 6.000064773) 
d) Optimization
 We have initialized your parameters.
 We are also able to compute a cost function and its gradient.
 Now, we want to update the parameters using gradient descent.
We will write down the optimization function. The goal is to learn \(w\) and \(b\) by minimizing the cost function \(J\). For a parameter \(\theta\), the update rule is \(\theta = \theta – \alpha \text{ } d\theta\), where \(\alpha\) is the learning rate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45  type TrainModelResult = { w: Vector; b: float; dw: Vector; db: float; costs: Seq } let train (w:Vector) (b:float) (X:Matrix) (Y:Vector) (num_iterations:int) (learning_rate:float) (print_cost:bool) = let mutable costs = Seq.empty let mutable dw = vector [0.0] let mutable db = 0.0 let mutable w_internal = w let mutable b_internal = b for i in 0..num_iterations  1 do //Cost and gradient calculation let results = propagate w_internal b_internal X Y let grads = fst results let cost = snd results dw < fst grads db < snd grads // update rule w_internal < w_internal  learning_rate * dw b_internal < b_internal  learning_rate * db //Record the costs if i % 100 = 0 then costs < Seq.append costs [cost] //Print the cost every 100 training examples if print_cost && i % 100 = 0 then (printfn "Cost after iteration %i: %f" i cost) { w = w_internal; b = b_internal; dw = dw; db = db; costs = costs } 
Input:
1  train w b X Y 100 0.005 false 
Output:
1 2  (seq [0.1124578971; 0.2310677468], 1.559304925), (seq [0.9015842801; 1.762508423], 0.4304620717), seq [6.000064773]) 
The previous function will output the learned w and b. We are able to use w and b to predict the labels for a dataset X. Implement the <predict() function. There is two steps to computing predictions:

 Calculate $$\hat{Y} = A = \sigma(w^T X + b)$$

 Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), and return the prediction results
1  let predict (w:Vector) (b:float) (X:Matrix) = let A = X.LeftMultiply w > Vector.map (fun x > x + b) > sigmoid A > Vector.map (fun x > if x > 0.5 then 1.0 else 0.0) 
Input:
1  predict w b X 
Output:
1  seq [1.0; 1.0] 
What to remember: We have implemented several functions that:

 Initialize (w,b)

 Optimize the loss iteratively to learn parameters (w,b):

 computing the cost and its gradient

 updating the parameters using gradient descent

 Optimize the loss iteratively to learn parameters (w,b):

 Use the learned (w,b) to predict the labels for a given set of examples
5 – Merge all functions into a model
You will now see how the overall model is structured by putting together all the building blocks (functions implemented in the previous parts) together, in the right order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40  type EvaluateModelResult = { costs: Seq<float>; Y_prediction_test: Vector<float>; Y_prediction_train: Vector<float>; w: Vector<float>; b: float; learning_rate: float; num_iterations: int; } let model (x_train:Matrix<float>) (y_train:Vector<float>) (x_test:Matrix<float>) (y_test:Vector<float>) (num_iterations:int) (learning_rate:float) (print_cost:bool) = let mutable w, b = initialize_with_zeros (64 * 64 * 3) //Gradient descent let results = train w b x_train y_train num_iterations learning_rate print_cost let costs = results.costs w < results.w b < results.b //Predict test/train set examples let Y_prediction_test = predict w b x_test let Y_prediction_train = predict w b x_train //Print train/testErrors printfn "train accuracy: {%f}" (100.0 * (1.0  (Y_prediction_train  y_train > Vector.map(fun x > abs x) > Vector.toArray > Array.average))) printfn "test accuracy: {%f} " (100.0 * (1.0  (Y_prediction_test  y_test > Vector.map (fun x > abs x) > Vector.toArray > Array.average))) { costs = costs; Y_prediction_test = Y_prediction_test; Y_prediction_train = Y_prediction_train; w = w; b = b; learning_rate = learning_rate; num_iterations = num_iterations; } 
1  let first_model = model train_x train_y test_x test_y 2000 0.005 true 
Cost after iteration 0: 0.693147
Cost after iteration 100: 0.584508
Cost after iteration 200: 0.466949
Cost after iteration 300: 0.376007
Cost after iteration 400: 0.331463
Cost after iteration 500: 0.303273
Cost after iteration 600: 0.279880
Cost after iteration 700: 0.260042
Cost after iteration 800: 0.242941
Cost after iteration 900: 0.228004
Cost after iteration 1000: 0.214820
Cost after iteration 1100: 0.203078
Cost after iteration 1200: 0.192544
Cost after iteration 1300: 0.183033
Cost after iteration 1400: 0.174399
Cost after iteration 1500: 0.166521
Cost after iteration 1600: 0.159305
Cost after iteration 1700: 0.152667
Cost after iteration 1800: 0.146542
Cost after iteration 1900: 0.140872
train accuracy: {99.043062}
test accuracy: {70.000000}
1 2  //Example of a picture that was wrongly classified showPicture train_x 1 "cat_1_.bmp" 
Let’s also plot the cost function and the gradients.
1 2 3  #load "XPlot.Plotly.Paket.fsx" #load "XPlot.Plotly.fsx" open XPlot.Plotly 
Input:
1 2 3 4 5 6  first_model.costs > Seq.toArray > Chart.Line > Chart.WithLayout (Layout(title = "Learning rate=0.005")) > Chart.WithXTitle("iterations (per hundreds)") > Chart.WithYTitle("cost") 
Output:
Interpretation:
You can see the cost decreasing. It shows that the parameters are being learned. However, you see that you could train the model even more on the training set. Try to increase the number of iterations in the cell above and rerun the cells. You might see that the training set accuracy goes up, but the test set accuracy goes down. This is called overfitting.
6 – Further analysis
Now that we have built our first image classification model, let’s analyze it further, and examine possible choices for the learning rate $\alpha$.
Choice of learning rate
Reminder:
In order for Gradient Descent to work you must choose the learning rate wisely. The learning rate $\alpha$ determines how rapidly we update the parameters. If the learning rate is too large we may “overshoot” the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That’s why it is crucial to use a welltuned learning rate.
Let’s compare the learning curve of our model with several choices of learning rates. Run the cell below. This should take about 1 minute. Feel free also to try different values than the three we have initialized the learning_rates variable to contain, and see what happens.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  let learning_rates = [ 0.01; 0.001; 0.0001] let mutable models = Seq.empty<EvaluateModelResult> let mutable scatters = Seq.empty<Seq<int* float>> for i in learning_rates do printfn "learning rate is: %f" i let model = model train_x train_y test_x test_y 1500 i false models < Seq.append models [model] printfn "" let extract_key_value (x_seq:Seq<float>)= seq { let mutable key = 0 for value in x_seq do key < key + 1 let extract_tuple = (key, value) yield extract_tuple } let labels = learning_rates > Array.toSeq > Seq.map (fun x > "learning rate: " + string x) models > Seq.map (fun x > x.costs) > Seq.map (fun x > extract_key_value x) > (fun x > scatters< Seq.append scatters x) scatters > Chart.Line > Chart.WithLabels(labels) > Chart.WithLegend(true) > Chart.WithXTitle("iterations") > Chart.WithYTitle("cost") 
Output:
learning rate is: 0.010000
train accuracy: {99.521531}
test accuracy: {68.000000}
——————————————————
learning rate is: 0.001000
train accuracy: {88.995215}
test accuracy: {64.000000}
——————————————————
learning rate is: 0.000100
train accuracy: {68.421053}
test accuracy: {36.000000}
——————————————————
Interpretation:
 Different learning rates give different costs and thus different predictions results.
 If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example, using 0.01 still eventually ends up at a good value for the cost).
 A lower cost doesn’t mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy.
 In deep learning, we usually recommend that you:
 Choose the learning rate that better minimizes the cost function.
 If your model overfits, use other techniques to reduce overfitting. (We’ll talk about this in later posts.)
Tell me, please, how can I get original or csv dataset?
The article has a link to github where the notebook is hosted. Along with the notebook comes the dataset : https://github.com/mathieuclerici/Blog_Articles_Code/tree/master/01_Logistic_Regression_Neural_Net_Mindset/Notebook/datasets
Hello, great job, congratulations and thanks for sharing! Could you recommend a book about deep learning ????
I would recommend “Deep Learning” by Ian Goodfellow if you want to get deep understanding of how deep learning algorithms work under the hood.
Then there are plenty of books on the subject, based on what programming language you want to use (usually R or Python) and their corresponding frameworks.
I definitly recommend Coursera deep learning course by Andrew NG.
Thanks for your time!