Gradient Descent and Linear Regression
In a classification problem, we aim to learn a function that correctly places a newly seen instance into a class based upon the instance’s attributes. This is in contrast to regression
, where we attempt to learn a real-valued function, i.e. model
, from attributes $x_1, \cdots, x_k$ to a target attribute $y$.
Linear models
In a linear model, the hypothesis function
has the form
where each $\theta_j$ is a weight, and $h_\theta(x)$ (the hypothesis) is the estimated value of $y$. If we define $x_0$ to be a constant 1, then the function can be written more succinctly:
The function $h_\theta(x)$ defines a $k$-dimensional hyperplane
. If $k=1$, it’s a line separating a 2d space into two pieces. If $k=2$, then it’s a 2d plane, etc. Intuitively, each weight $\theta_j$ represents the “slope” of the plane in the $j$-th dimension.
The problem in constructing a linear model is to find weights that fit the data. Specifically, we want to minimize the squared error between each $y^{(i)}$
and $h_\theta (x^{(i)})$
, the actual and predicted values for observation $i$. The term $y^{(i)} - h_\theta (x^{(i)})$
is called the residual
of $y^{(i)}$
with respect to $h_\theta (x^{(i)})$
. The problem can be cast as minimizing the cost function
$J(\theta)$:
The $1/2$ term doesn’t change what the best weights are; it ultimately simplifies calculations. Note that $J$ is a function of $\theta$. The space of possible values for $\theta$ defines an error surface
, and we want to find the values for $\theta$ that correspond to the lowest point on that surface.
Gradient descent
Minimizing error is a standard objective in machine learning. For linear regression, there exists a unique global minimum
to $J(\theta)$, and analytic techniques can be used to find it1. For other functions, techniques such as gradient descent
are more appropriate.
In gradient descent, the gradient vector at a given point on the error surface represents the direction of greatest rate of increase in $J$ at the point:
Here the point is defined by specific values for $\theta_0, \cdots, \theta_k$. Intuitively, $\frac{\partial J}{\partial \theta_j}$
represents the slope of the surface in the $j$-th dimension.
The idea behind gradient descent is to adjust the current weights in the opposite direction of the gradient, thereby decreasing the error value. The weights are initialized (typically to 0), and then iteratively adjusted. The equation to update each existing weight $\theta_j$ to a new value $\theta_j’$ is:
where $\alpha$ is a small real-valued constant called the learning rate
2. Note that all weights are updated in parallel. If $\nabla J = 0$ at a given point, then further updates yield no improvement. The point constitutes a local minimum
for $J(\theta)$. Gradient descent stops when the local minimum is reached. In the context of linear regression, the cost function $J$ is convex
, so there is only one minimum – the global minimum. In other cases, gradient descent is not guaranteed to find the global minimum.
Batch and stochastic gradient descent
In the simplest form of gradient descent, all instances in teh data are examined before updates are made. This is called batch gradient descent
. We take the average of the gradients of all instances and use that mean gradient to update our parameters.
A variant is called stochastic gradient descent
(SGD), where a randomly chosen instance3 is used instead of the entire data set. The parameters are updated using the computed gradients. This often reduces error more quickly. The algorithm might not converge to a minimum, but the results are typically good approximations.
Scaling
When fitting a function of several attributes, it’s often the case that attributes will have very different magnitudes. This can drastically reduce the efficiency of gradient descent. As such, it’s good practice to scale
the inputs (the target is left alone). Different techniques can be used, and one of the most common approaches is the z-score. Each $x_j^{(i})$ is replaced with
where $\mu_j$ is the mean of $x_j$ across all instances and $\sigma_j$ is the standard deviation. In practice, not standardizing the values can make tuning the gradient descent procedure more difficult.
Implementation in Python
With the math written out, the implementation of the method is surprisingly simple. We just update the weights based on Equation $\eqref{eq:gradient-descent}$.
|
|
To test our function, we will generate a fake dataset designed for regression problems. Scikit-learn never ceases to amaze me:
|
|
We have a data set with 500 observations of two features that can predict $y$ pretty well. Now let’s try applying our function on this data:
|
|
Checking results
To see if these fitted coefficients are actually the ones that minimize the mean squared error, we may use the formula1 to compute the coefficients directly:
|
|
Yep, exactly the same values! Actually if we plot the errs
variable, it reaches the global minimum after less than 20 iterations.
Using scikit-learn
Finally, we fit this multiple regression model using scikit-learn:
|
|
Once again, we get the same coefficients.
In matrix notation, if our model is $\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}$, then the weight coefficients are
$$ \boldsymbol{\beta} = (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y} $$Note that the first column of $\boldsymbol{X}$ is a column of ones, and $\beta_0$ is the intercept term. ↩︎
If $\alpha$ is too large, then $J$ might not converge; it either increases without a bound or oscillates between points. If $\alpha$ is too small, then gradient descent may take forever to converge. ↩︎
In
mini-batch gradient descent
, the two are combined where we choose a random sample of the training data, and use the mean gradient to update the parameters. ↩︎
Feb 12 | Logistic Regression | 4 min read |
Feb 09 | Text Classification with Naive Bayes and NLTK | 12 min read |
Feb 03 | Naive Bayes | 7 min read |