The problem of

classification is core to machine learning; given a dataset of labelled examples, e.g.

images of known digits, can a model be learned so that future instances can be classified automatically? Whether dealing with vision, audio, or natural language, classification problems are abundant and have many practical applications. It is not surprising that researchers are always coming up with new ways of learning and modeling data to better tackle classification.

Support vector machines (SVMs) are a commonly used method for solving such problems that performs very well; as such, algorithms for learning SVMs and the theory behind SVMs are both hot topics in machine learning research. One of the most interesting such algorithms (in my opinion), is

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, which exhibits a unique and counter-intuitive relationship between classification error, training set size, and runtime.

Before I discuss the ideas behind Pegasos, first let me present some background on classification problems and SVMs. The formal definition of the classification problem is as follows. Suppose you are given a training set $S = \{(\textbf{x}_i, y_i)\}_{i=1}^m$ of size $m$ where $\textbf{x}_i \in \mathbb{R}^n$ are feature vectors and $y_i \in \{-1, +1\}$ is the classification label (for simplicity, we are assuming a

binary classification problem which SVMs are most naturally suited for). For example, each $\textbf{x}_i$ might be the representation of an image after

whitening, and $y_i$ might indicate whether the image contains a mug or not. We can think of the training examples as being drawn from some unknown probability distribution $P(\textbf{X}, Y)$ that we want to estimate. The goal is to find a classifier $h: \mathbb{R}^n \rightarrow \{-1, +1\}$ which minimizes $\Pr[h(\textbf{x}) \neq y]$ for $(\text{x}, y)$ drawn from $P$, i.e. the probability of a classification error given a random sample from the true distribution. Intuitively, a learning algorithm will attempt to build a model for $P$ using $S$ as a representative sample of the distribution.

The idea behind SVMs is to model the classifier $h$ as a separating

hyperplane in $\mathbb{R}^n$ where points on one side of the hyperplane are classified as $-1$ and points on the other side are classified as $+1$. One way of representing this is to let $h(\textbf{x}) = \text{sign}(\langle \textbf{w}, \textbf{x} \rangle)$ for some predictor $\textbf{w} \in \mathbb{R}^n$. Then we can form the optimization problem typically used when learning a SVM:

$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$

where $\ell(\textbf{w}; (\textbf{x}, y)) = \max\{0,~1 - y \langle \textbf{w}, \textbf{x} \rangle\}$ is the

hinge loss. The hinge loss will be zero if the inner product is sufficiently far in the same direction as $y$, and it increases linearly with the inner product for training samples that are "close" to the hyperplane or on the wrong side. The first term in the optimization function represents the error for the classifier, while the second term is for

regularization, i.e. adding a penalty to large values of $\textbf{w}$ to prevent

overfitting the training set.

One of the most useful features of SVMs is that the optimization problem can be solved without depending on the representation of the data itself, but only on the inner product between $\textbf{w}$ and the $\textbf{x}_i$'s. This leads to the idea of the

kernel trick for SVMs, which means that instead of operating on the raw representation of the data that may not be

linearly separable, we map the data into a high dimensional feature space. This space may potentially even be infinite, e.g. with a

Gaussian kernel, but as long as the inner product in that space is easy to compute the optimization problem can be solved. An important detail to note is that, when applying the kernel trick, the vector $\textbf{w}$ also lives in the high dimensional feature space. But thanks to the

Karush-Kuhn-Tucker conditions, it is known that $\textbf{w}$ can always be represented as a linear combination of the $\textbf{x}_i$'s, so we can represent $\textbf{w}$ as a set of coefficients. Thus the kernel trick allows SVMs to efficiently perform nonlinear classification of data and makes them much more powerful.

Researchers have been working on solving the SVM optimization problem as efficiently as possible since it was introduced. There have been a number of techniques developed, including

interior point methods, decomposition methods such as the

SMO algorithm, and

gradient methods. Typically, interior point methods have runtimes which grow as $m^3$ while other algorithms can be anywhere from quadratic to linear in $m$ (the interior point methods compensate by having much weaker dependence on the accuracy of the solution). Pegasos is a

stochastic subgradient method which falls roughly into the family of gradient methods, but it turns out that the runtime necessary for it to achieve a fixed accuracy not dependent on $m$ (in the case of a linear kernel). This might seem strange at first, but if you think about it, adding more training examples should make it easier for a learning algorithm to achieve the same accuracy. We'll see next time how Pegasos works at a high level and the implications of the algorithm on the relationship between training set size and runtime.