Machine Learning - Lecture 15 Support Vector Machines

Chris Thornton

Introduction to max-margin classifiers

If there is a line (hyperplane) separating two sets of datapoints, we can use error-correction to work out what it is (see previous lecture).

Another approach involves maximizing the weight-vector's `safety margin', i.e., its inner product with the most nearly mis-classified datapoint.

This gives us the so-called maximum margin classifier.

Max-margin hyperplane (linear SVM)

Non-linear SVMs

Unfortunately, we often have datasets that have no separating hyperplane.

We need to move to a non-linear solution, as we did in moving from delta-rule learning to MLPs.

Ideally, we'd like to map the data into a feature space in which we can form a separating hyperplane.

Separating data in a higher-dimensional space

The kernel trick

But where do we get the features for the mapping?

We'd like them to be non-linear functions (curved boundaries are needed).

But there are infinitely many of these.

One solution is to use the so-called kernel trick.

A kernel function maps pairs of datapoints onto their inner products (i.e., they work like distance functions).

A feature space based on a kernel function has one dimension for every pair of datapoints.

Mathematical minimization can then be used to find the max-margin hyperplane in the feature-space.

The effect is to identify a non-linear (curved) boundary in the original data space.

Illustration

What's really going on?

In using a kernel function, we are moving from the original data space to a space that has one dimension for every pair of original points.

Manipulating points in the feature space then has the effect of `stretching' or `compressing' areas of the data space.

This can be a way of `pulling' differently classified datapoints apart, or `pushing' same-class points together.

Getting past the hype

SVMs using kernel functions have been getting a lot of attention.

But their practical value remains unclear at this stage.

Derivation of weights for a separating hyperplane may still be best done using iterative error-correction.

Key problems with SVM/kernel method

A practical problem is the leap in complexity resulting from mapping from an dataspace containing

$n \times n$ dimensional inner-product space.

Another problem is the kernel function itself.

With primitive data (e.g., 2d data points), good kernels are easy to come by.

With the forms of data we're often interested in (web pages, MRI scans etc.), finding a sensible kernel function may be much harder.

How would we go about defining a function that gives the distance between two web pages?

As usual, success depends on getting the problem into the right representation.

Summary

Max-margin classifiers can be derived by minimization.
Kernel-based SVMs
Complexity problems
The difficulty of finding good kernel functions.

Questions

In what ways might we calculate the distance (dissimilarity) between web pages?
In the SVM method, we distort the data space so as to enable simple (e.g., hyperplane-based) representation of the target function. Can the components of the distortion be viewed as genuine features?
How is generalization performance likely to be affected, where the SVM produces a high degree of data-space distortion?