Logistic regression

Logistic regression is used to classify things into positive case or negative case. It is a special case of linear regression which outputs value between 0 and 1, which denote the probability or the likelihood of the given sample being positive. For e.g. what is the probability of the given email being a spam?

We can then predict a positive case if the hypothesis outputs a value above a certain threshold, which normally is 0.5 but could be more or less. The ideal threshold can be derived based on the cross validation result. More on this in a later post.


As mentioned above, logistic regression is similar to linear regression, but the hypothesis always returns an output between 0 and 1. This is achieved by passing the hypothesis of the linear regression to a Sigmoid function.
Thus the hypothesis is denoted as:

hθ(x) = g(θTx) 

where g is the sigmoid function defined as:


Cost Function

The cost function for classification problems are a bit more involved due to the fact that simple squared error is not sufficient to work with the small differences here, since the max difference possible is 1, predictive 0% probability for a positive case. Therefore to magnify the differences a log scale is used.
cost function of logistic regression
Vectorized version:

J = sum(-y .* log(hypo) - (1-y).*log(1-hypo)) / -m

Derivative of cost function (gradient)

The gradient for hypothesis is the same as that of linear regression:
Derivative of cost function
The hypothesis of course is different as stated above.

Multiclass Classification

So far we have only looked at 2 distinct results from the algorithm, either negative (y=0) or positive (y=1). But what if there are more than 2 distinct possibilities? This is covered under multi class or one-vs-all classification.

The basic idea is simple. We train the classifier once per each possible outcome. For e.g. If we need to predict if the given sentence is in English, German or French, we will derive the value of θ for each.

So the θ for English will be able to predict whether the sentence is in English or not, thus changing the problem to a single class (positive/negative) classification. Similarly the θ for German and French are derived.

Then for a given sentence we will calculate the probability of it being English, German or French and return the language having the highest probability.


Linear regression

Linear regression is a fancy name to a simple technique. This algorithm (model) predicts the most likely result (y) given the input features (x). To be able to predict, the model needs (lots of) historical data with the correct output (thus it is supervised learning).

This model finds weights (θ) to be assigned to each feature, such that sum of the weighted features is closest to the given answer y.

Education Technology

Machine Learning: My notes.

I recently finished the machine learning class offered online by Stanford. It was a great experience. Since I would not be using ML any time soon, I plan to make a few blog posts to capture my learning while they are still current. This should help me recollect the concepts on a later date. If someone finds these notes useful, that is an added benefit.

Machine learning

Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.