Support Vector Machines are a popular type of Supervised Learning algorithm. They perform well on smaller datasets and are thus a viable alternative when there is not enough data to train an algorithm which is data hungry like a deep net. They can be used for both Classification and Regression tasks in Scikit-Learn.

In this post, concepts on how they work with a Linear Kernel are explained including Margins, Support Vectors, and Kernels. Examples are given for both binary and multi-class classification. More complexity is added with Pipelines and Randomised Grid Search Cross-Validation in the later examples.

## How does an SVM work?

In the fundamental case, Binary Classification, we have 2 classes and are attempting to find a boundary between them. The basic idea is that a boundary must be found such that the margin between 2 vectors called Support Vectors is maximised.

The figure immediately below illustrates these elements. There are two clearly separable classes denoted by different shades of blue formed by the generic features X1 and X2.

The red dots represent points used to form the support vectors (the dotted lines). The solid line at the center of the support vectors is the classification boundary.

### Support Vectors

These support vectors are so called because they are vectors of points used to form the margin which creates the boundary between classes. An important point about the vectors is that they are only affected by the points near the margin. In other words, any point that does not form the margin and the support vectors will not affect the location of the support vectors or that margin.

Note that this is not necessarily always the case. For instance, when using an rbf, poly or sigmoid kernel, the hyperparameter gamma can be used to adjust how much the points further away affect the margin shape.

### The Margin and Regularisation

Maximizing the margin aids in generalising to new data. In Scikit-Learn the hyperparameter C controls this and can “soften” the margin such that there is more of a mix of data on both sides of the boundary and the margin to improve generalisation. Smaller values of C give a larger boundary. (A silly analogy that might help in remembering this is to imagine the margin as a balloon attached to an air pump shaped like a C. As you press down on the C/pump making the C smaller the margin/balloon increases in size as air is pushed into it. I always have trouble remembering this so I made this up to help. :) )

This is because in real data it is often the case that there is some intermingling (see the plot for the Breast Cancer Dataset) of the points in the classes and they are not as cleanly separated as shown in the previous diagram. Smaller values of C increase the “softness.”

### Sensitivity to Feature Scaling

Feature Scaling affects the size of margins, and thus generalisation ability. Thus a feature scaling preprocessing step like the StandardScaler() should be applied with a pipeline when using SVMs.

### Kernels

There are several kernels that can be used with SVMs. Different ones affect the way the data is classified and the shapes of the boundaries. The default is rbf. Other options are linear, poly, sigmoid and precomputed. A linear kernel is used in this post.

RBF kernels wind themselves around the data, and poly kernels create bowl like shapes similar to a polynomial function. The diagram below shows how varied the results they produce can be. Experimentation can help determine which performs the best for a particular dataset.

### Performance

The performance of the algorithm largely depends on the type of kernel and the amount of data used. Linear kernels train very quickly compared to polynomial and RBF kernels which can take considerably longer on the same volume of data.

### Math

The equation of the boundary looks quite similar to the equation of the line for Linear Regression) with slight alterations.

Here, $w$ represents a vector of weights which acts as the coefficients for the corresponding $X$ feature values. $b$ is the y-intercept.

A sign function is used to find a score for classification. If the output is greater than 0 that means there is a positive result for a particular class.

For example, the weights for the diagram below for the Breast Cancer Dataset are -0.67104362 and -0.13771607 and the intercept is 12.66432746. To classify a point with feature values of 20, 30 (the star in the diagram) consider the equation that follows.

Substituting the feature values for the new point:

Positive in this case means benign (1) and negative means malignant (0). Since the value is less than 0 the sample is malignant.

## Basic Binary Classification

### Imports

First, import the Support Vector Classifier SVC, and the built in standard toy datasets from Scikit-Learn. The toy datasets are efficient options for learning and quick prototyping because they do not have to be preprocessed for use like raw data.

Matplotlib should also be imported since visualisation is critical to understanding and evaluating a model. If using a Jupyter Notebook, % matplotlib notebook will make plots interactive.

### Load and Explore the Data

Next, we load the Breast Cancer Wisconsin (Diagnostic) toy dataset from Scikit-Learn. breast_cancer.data gives the feature data and breast_cancer.target gives the labels for that data.

According to breast_cancer.DESCR this dataset is from the UCI repository. (Quick Tip: Use print(breast_cancer.DESCR) to make the output easier to read.)It has 569 observations and 30 features in total. The target shows whether or not the masses which were examined were malignant (0) or benign (1).

breast_cancer.feature_names shows all the feature names for the masses examined.

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='<U23')


Loading the data into a Pandas DataFrame outputs the data in an easier to read format. A quick glance at the scales of the numbers in the data suggest that applying a Scaling preprocessor with a Pipeline would be a good option.

mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

### Train the Model

Next, carry on following the typical Scikit-Learn pattern and instantiate the classifier. Recall that the kernel hyperparameter can take several different values. The default is rbf or Radial Basis Function which can produce non-linear boundaries. For this demonstration a linear kernel is used. C is the hyperparameter used to control regularisation. The lower it is the higher the regularisation effect as mentioned previously.

Then fit the data with fit().

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=0, shrinking=True,
tol=0.001, verbose=False)


#### Support Vectors

The support_vectors_ attribute of the trained model shows the points used to create the support vectors. 152 of the 569 data points were used in this case.

array([[ 17.99,  10.38],
[ 11.42,  20.38],
[ 12.45,  15.7 ],
[ 13.71,  20.83]])

(152, 2)


### Metrics

The model scores quite well on the test set. This score ranges in value from 0-1, where 1 is best.

0.88497652582159625


The Mean Squared Error) is similarly low at 0.097902097902097904.

0.097902097902097904


### Visualise the Data

Plotting the data further explains the high scores. The classes are largely separated except for a handful of data points.

## Binary Classification with Randomised Grid Search and a Pipeline

### Overview

This example builds on the prior one by adding complexity to the training process. Firstly, all 30 features from the Breast Cancer dataset are used. Feature Selection can thus be incorporated to increase efficiency and accuracy. This also means Feature Scaling is useful to account for the different value ranges in the data.

A slight drawback of using all the features is that they cannot be graphed. At least not without a technique like Principal Component Analysis, which in itself has the drawback of merging features, removing their names, and giving them generic ones. This can take away from understanding the data.

Randomised Grid Search Cross-Validation adds in the benefits of Cross-Validation. Cross-Validation gives a more realistic average score over the different folds of the dataset, improving on the Train/Test Split score. Randomised Grid Search automatically finds the best parameters, if desired, for both the preprocessing steps (Feature Selection and Scaling) and the model (aka the hyperparameters).

All these working parts are pulled together by using a Pipeline which is passed into the Grid Search function along with the range of variable values to consider.

### Imports

Firstly, some additional imports are needed.

### Model and Data

Next, the classifier is instantiated as before with a linear kernel. X is reset to the original full dataset with X = breast_cancer.data.

Next comes the interesting part with the Pipeline and Grid Search. A summary of the code considered is shown below with explanations immediately following.

#### The Parameter Grid

The param_dist dictionary is populated with the parameters to search over for Feature Selection and the value of C, where C controls Regularisation. SelectKBest is a type of Univariate Feature Selection, where the best performing features are selected to train the final model. K denotes the number of features to select.

#### The Pipeline

Then, the pipeline, pipe, is created taking the feature selector, scaler, and instantiated model as parameters.

These are all passed in to the Randomised Grid Search function along with the param_dist and some other parameters for the search. The number of Cross-Validation folds cv is set to 10. The Mean Squared Error is used for scoring in this case. n_iter is set to 10 to limit the number of combinations of parameters searched over to 10. Finally, setting random_state to any number makes the results reproducible for that specific number.

#### Fitting the Model

Now, the model can be fit/trained with the data.

#### Compute Time

Note that RandomizedSearchCV can be quite time-consuming depending on the processing power available. Thus, though linear kernels are typically quite fast for Support Vector Machines, adding in this extra complexity can make the process comparably lengthy. Higher values for cv and n_iter will increase compute.

#### Experimentation

This is a good opportunity to do some experimentation with variables to see which models perform best. Some ideas include altering the values of C, removing preprocessing steps (Feature Selection and Scaling), changing the number of Cross-Validation folds, and modifying n_iter. (Note that C is a float so decimal values can also be searched through.)

For example, the score was 0.0298769771529 with scaling vs 0.0439367311072 without for a particular run. This is a small difference but it confirms that scaling values affects performance of SVCs.

### Performance

RandomizedSearchCV has some attributes of the trained model (denoted by the underscore appended to the end of the attribute) which can be used to check the best score, best parameters, and the best estimator found in the search.

The best_score_ shows the best Cross-Validation score for the best parameters and model. The error here is very low. A negative sign is used because since it is an error, the convention in Scikit-Learn is to negate the value. Thus a higher negative value (tending to infinity) means a higher error and a lower one (tending to 0) means a lower error.

Note that the additional complexity has reduced the error from the previous example by 0.0697825899934979 or 71.3%.

0.0281195079086


best_params_ shows the best parameters chosen for the range specified in the param_dist. Changing random_state in RandomizedSearchCV will allow these values to alter. The regularisation parameter C was set to 1 and 20 of the 30 features were selected.

{'svc__C': 1, 'selectkbest__k': 20}


best_estimator_ shows the parameters for all the components of the pipeline that were used to train the best model.

Pipeline(memory=None,
steps=[('selectkbest', SelectKBest(k=20, score_func=<function f_classif at 0x10f1da730>)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=0, shrinking=True,
tol=0.001, verbose=False))])


## Multi-Class Classification

Multi-Class Classification is done using LinearSVC(). This is implemented differently from SVC() with liblinear and thus can produce different results.

For Multi-Class Classification it uses a one-vs-the-rest technique. Essentially there is a conversion of the data into a set of Binary Classification problems where each class is separated from the others as though all the other classes were merged into one.

This is repeated for all classes until each class has its own boundary. At prediction time, a new data point is compared to all the boundaries and the new point is assigned to the class with the highest score.

A similar workflow will be followed to classify multiple classes as done in the previous section, so explanations will be kept minimal.

### Load and Explore the Data

The wine dataset is another toy dataset which comes with Scikit-Learn. It is new in version 0.19, so an update may be required to access it. To start exploring, import the dataset with load_wine(), then set X and y.

According to wine.DESCR, the dataset is also from the UCI repository. It shows the data collected via a chemical analysis of wines produced by 3 different growers in the same region in Italy. There are 178 observations with 13 features.

(178, 13)


#### Classes

It has 3 classes labelled 0, 1 and 2 in the target corresponding to generic class names of ‘class_0’, ‘class_1’, ‘class_2’ as shown with wine.target_names.

array(['class_0', 'class_1', 'class_2'],
dtype='<U7')


#### Features

The 13 features used to classify the different wines shown below:

['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']


#### Tabular data with Pandas

Again, importing the data into a Pandas DataFrame makes the data more easily digestible. Feature Scaling will again be imperative given the data ranges.

alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0

#### Visualisation

Two of the features are visualised below to aid in understanding.

### Train the model

The explanation for the steps above considering Binary Classification with Randomised Grid Search and Pipelines holds and so will not be repeated. The major differences here are that LinearSVC is used instead of SVC and the range for k has to be capped at 13 because there are only 13 features.

#### Performance

Again the best_score_, best_params_, and best_estimator_ are shown. The MSE is quite good as it is very low at 0.0280898876404. 11 features were used with C=1 in the best model.

0.0280898876404

{'selectkbest__k': 11, 'linearsvc__C': 1}

Pipeline(memory=None,
steps=[('selectkbest', SelectKBest(k=11, score_func=<function f_classif at 0x10b3901e0>)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearsvc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
verbose=0))])


## Conclusion

This post examined how to use Linear Support Vector Machines for Classification using SVC() and LinearSVC() from Scikit-Learn. Both binary (two classes) and multi-class classification (>2 classes) were covered. We started with the theory of how SVMs work including an overview of margins, support vectors and kernels.

Examples using toy datasets from Scikit-Learn were given for both types of classification. The more complex examples included Pipelines, Feature Selection, Scaling, and Randomised Grid Search.

Considering non-linear classification with RBF and polynomial kernels next would be instructive.

## References

1. Collins-Thompson, Kevin. Linear Classifiers: Support Vector Machines. Coursera, https://www.coursera.org/learn/python-machine-learning/lecture/uClaN/linear-classifiers-support-vector-machines.

2. Géron, Aurélien. “Hands-on Machine Learning with Scikit-Learn and Tensorflow.” (2017).

3. Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing In Science & Engineering 9.3 (2007): 90-95.

4. McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

5. Pedregosa, Fabian, et al. “Scikit-learn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 2825-2830.

6. Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).