In linear regression, the function learned is used to estimate the value of the target $y$ using values of input $x$. While it could be used for classification purposes by setting the target value to a distinct constant for each class, it’s a poor choice for this task. The target attribute takes on a finite number of values, yet the linear model produces a continuous range.
For classification tasks, logistic regression
is a better choice. As with linear regression, logistic regression learns weights for a linear equation. However, instead of using the equation to predict a target attribute’s value, it separates instances into classes.
Maths of logistic regression
Hypothesis function
Simple logistic regression works on two classes, i.e. $y=0$ or $y=1$. Rather than using $h_\theta (x) = \sum_{j=0}^k \theta_j x_j$
as the hypothesis, we use a hypothesis based on the sigmoid
or logistic function
, which is continuous and bounded by 0 and 1:
Note that $f(z)$ increases as $z$ increases, and $f(0) = 0.5$. Multiplying $z$ by a constant affects how steep the curve is, while adding a constant to $z$ shifts the curve to the left or right. The hypothesis function used in logistic regression is:
The values of $h_\theta (x)$ are probabilities. Once the proper weights $\theta$ are learned, $\sum_{j=0}^k \theta_j x_j = 0$
defines the decision boundary
for the classifier. Any instance $x$ where $\sum_{j=0}^k \theta_j x_j \geq 0$
falls on one side of the boundary, and $\sum_{j=0}^k \theta_j x_j < 0$
falls on the other.
Cost function
In this context, using $h_\theta$ in the cost function used for linear regression results in a function that’s not convex. To ensure a convex function, the following function is used:
This is actually the combination of two functions. Given one instance $(x, y)$,
The idea is that if $y=1$, then the cost function approaches 0 as $h_\theta(x)$ approaches 1, and as $h_\theta (x)$ approaches 0, the cost function increases. The case is similar when $y=0$. In other words, if the actual and predicted values are the same, then the error is 0. However, as the predicted value moves away from the actual one, the error grows. Note how the cost is always nonnegative.
Gradient descent
Given the hypothesis and the cost function, we can work out that the update for each weight $\theta_j$ in gradient descent is the same as it is in linear regression:
Implementation in Python
The code for logistic regression is really similar to that of linear regression’s.


To test this function, we generate a classification dataset using scikitlearn, and compare the prediction results:


The values are pretty close, although our method takes much longer to run.
Feb 11  Gradient Descent and Linear Regression  6 min read 
Feb 09  Text Classification with Naive Bayes and NLTK  12 min read 
Feb 03  Naive Bayes  7 min read 