Linear Classification
Support Vector Machines
 CodeSupport Vector Machines are a popular type of Supervised Learning algorithm. They perform well on smaller datasets and are thus a viable alternative when there is not enough data to train an algorithm which is data hungry like a deep net. They can be used for both Classification and Regression tasks in ScikitLearn.
In this post, concepts on how they work with a Linear Kernel are explained including Margins, Support Vectors, and Kernels. Examples are given for both binary and multiclass classification. More complexity is added with Pipelines and Randomised Grid Search CrossValidation in the later examples.
How does an SVM work?
In the fundamental case, Binary Classification, we have 2 classes and are attempting to find a boundary between them. The basic idea is that a boundary must be found such that the margin between 2 vectors called Support Vectors is maximised.
The figure immediately below illustrates these elements. There are two clearly separable classes denoted by different shades of blue formed by the generic features X1
and X2
.
The red dots represent points used to form the support vectors (the dotted lines). The solid line at the center of the support vectors is the classification boundary.
Support Vectors
These support vectors are so called because they are vectors of points used to form the margin which creates the boundary between classes. An important point about the vectors is that they are only affected by the points near the margin. In other words, any point that does not form the margin and the support vectors will not affect the location of the support vectors or that margin.
Note that this is not necessarily always the case. For instance, when using an rbf
, poly
or sigmoid
kernel, the hyperparameter gamma can be used to adjust how much the points further away affect the margin shape.
The Margin and Regularisation
Maximizing the margin aids in generalising to new data. In ScikitLearn the hyperparameter C
controls this and can “soften” the margin such that there is more of a mix of data on both sides of the boundary and the margin to improve generalisation. Smaller values of C
give a larger boundary. (A silly analogy that might help in remembering this is to imagine the margin as a balloon attached to an air pump shaped like a C. As you press down on the C
/pump making the C
smaller the margin/balloon increases in size as air is pushed into it. I always have trouble remembering this so I made this up to help. :) )
This is because in real data it is often the case that there is some intermingling (see the plot for the Breast Cancer Dataset) of the points in the classes and they are not as cleanly separated as shown in the previous diagram. Smaller values of C
increase the “softness.”
Sensitivity to Feature Scaling
Feature Scaling affects the size of margins, and thus generalisation ability. Thus a feature scaling preprocessing step like the StandardScaler()
should be applied with a pipeline when using SVMs.
Kernels
There are several kernels that can be used with SVMs. Different ones affect the way the data is classified and the shapes of the boundaries. The default is rbf
. Other options are linear
, poly
, sigmoid
and precomputed
. A linear kernel is used in this post.
RBF kernels wind themselves around the data, and poly kernels create bowl like shapes similar to a polynomial function. The diagram below shows how varied the results they produce can be. Experimentation can help determine which performs the best for a particular dataset.
Performance
The performance of the algorithm largely depends on the type of kernel and the amount of data used. Linear kernels train very quickly compared to polynomial and RBF kernels which can take considerably longer on the same volume of data.
Math
The equation of the boundary looks quite similar to the equation of the line for Linear Regression) with slight alterations.
Here, $w$ represents a vector of weights which acts as the coefficients for the corresponding $X$ feature values. $b$ is the yintercept.
A sign function is used to find a score for classification. If the output is greater than 0 that means there is a positive result for a particular class.
For example, the weights for the diagram below for the Breast Cancer Dataset are 0.67104362 and 0.13771607 and the intercept is 12.66432746. To classify a point with feature values of 20, 30 (the star in the diagram) consider the equation that follows.
Substituting the feature values for the new point:
Positive in this case means benign (1) and negative means malignant (0). Since the value is less than 0 the sample is malignant.
Basic Binary Classification
Imports
First, import the Support Vector Classifier SVC
, and the built in standard toy datasets from ScikitLearn. The toy datasets are efficient options for learning and quick prototyping because they do not have to be preprocessed for use like raw data.
Matplotlib should also be imported since visualisation is critical to understanding and evaluating a model. If using a Jupyter Notebook, % matplotlib notebook
will make plots interactive.
Load and Explore the Data
Next, we load the Breast Cancer Wisconsin (Diagnostic) toy dataset from ScikitLearn. breast_cancer.data
gives the feature data and breast_cancer.target
gives the labels for that data.
According to breast_cancer.DESCR
this dataset is from the UCI repository. (Quick Tip: Use print(breast_cancer.DESCR)
to make the output easier to read.)It has 569 observations and 30 features in total. The target shows whether or not the masses which were examined were malignant (0) or benign (1).
breast_cancer.feature_names
shows all the feature names for the masses examined.
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'], dtype='<U23')
Loading the data into a Pandas DataFrame outputs the data in an easier to read format. A quick glance at the scales of the numbers in the data suggest that applying a Scaling preprocessor with a Pipeline would be a good option.
mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  mean fractal dimension  radius error  texture error  perimeter error  area error  smoothness error  compactness error  concavity error  concave points error  symmetry error  fractal dimension error  worst radius  worst texture  worst perimeter  worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension  

0  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001  0.14710  0.2419  0.07871  1.0950  0.9053  8.589  153.40  0.006399  0.04904  0.05373  0.01587  0.03003  0.006193  25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654  0.4601  0.11890 
1  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869  0.07017  0.1812  0.05667  0.5435  0.7339  3.398  74.08  0.005225  0.01308  0.01860  0.01340  0.01389  0.003532  24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860  0.2750  0.08902 
2  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974  0.12790  0.2069  0.05999  0.7456  0.7869  4.585  94.03  0.006150  0.04006  0.03832  0.02058  0.02250  0.004571  23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430  0.3613  0.08758 
3  11.42  20.38  77.58  386.1  0.14250  0.28390  0.2414  0.10520  0.2597  0.09744  0.4956  1.1560  3.445  27.23  0.009110  0.07458  0.05661  0.01867  0.05963  0.009208  14.91  26.50  98.87  567.7  0.2098  0.8663  0.6869  0.2575  0.6638  0.17300 
4  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980  0.10430  0.1809  0.05883  0.7572  0.7813  5.438  94.44  0.011490  0.02461  0.05688  0.01885  0.01756  0.005115  22.54  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625  0.2364  0.07678 
Train the Model
Next, carry on following the typical ScikitLearn pattern and instantiate the classifier. Recall that the kernel hyperparameter can take several different values. The default is rbf
or Radial Basis Function which can produce nonlinear boundaries. For this demonstration a linear kernel is used. C
is the hyperparameter used to control regularisation. The lower it is the higher the regularisation effect as mentioned previously.
Then fit the data with fit()
.
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=1, probability=False, random_state=0, shrinking=True, tol=0.001, verbose=False)
Support Vectors
The support_vectors_
attribute of the trained model shows the points used to create the support vectors. 152 of the 569 data points were used in this case.
array([[ 17.99, 10.38], [ 11.42, 20.38], [ 12.45, 15.7 ], [ 13.71, 20.83]])
(152, 2)
Metrics
The model scores quite well on the test set. This score ranges in value from 01, where 1 is best.
0.88497652582159625
The Mean Squared Error) is similarly low at 0.097902097902097904.
0.097902097902097904
Visualise the Data
Plotting the data further explains the high scores. The classes are largely separated except for a handful of data points.
Binary Classification with Randomised Grid Search and a Pipeline
Overview
This example builds on the prior one by adding complexity to the training process. Firstly, all 30 features from the Breast Cancer dataset are used. Feature Selection can thus be incorporated to increase efficiency and accuracy. This also means Feature Scaling is useful to account for the different value ranges in the data.
A slight drawback of using all the features is that they cannot be graphed. At least not without a technique like Principal Component Analysis, which in itself has the drawback of merging features, removing their names, and giving them generic ones. This can take away from understanding the data.
Randomised Grid Search CrossValidation adds in the benefits of CrossValidation. CrossValidation gives a more realistic average score over the different folds of the dataset, improving on the Train/Test Split score. Randomised Grid Search automatically finds the best parameters, if desired, for both the preprocessing steps (Feature Selection and Scaling) and the model (aka the hyperparameters).
All these working parts are pulled together by using a Pipeline which is passed into the Grid Search function along with the range of variable values to consider.
Imports
Firstly, some additional imports are needed.
Model and Data
Next, the classifier is instantiated as before with a linear kernel. X
is reset to the original full dataset with X = breast_cancer.data
.
Pipeline and Randomised Grid Search
Next comes the interesting part with the Pipeline and Grid Search. A summary of the code considered is shown below with explanations immediately following.
The Parameter Grid
The param_dist
dictionary is populated with the parameters to search over for Feature Selection and the value of C, where C controls Regularisation. SelectKBest
is a type of Univariate Feature Selection, where the best performing features are selected to train the final model. K
denotes the number of features to select.
The Pipeline
Then, the pipeline, pipe
, is created taking the feature selector, scaler, and instantiated model as parameters.
Randomised Grid Search
These are all passed in to the Randomised Grid Search function along with the param_dist
and some other parameters for the search. The number of CrossValidation folds cv
is set to 10. The Mean Squared Error is used for scoring in this case. n_iter
is set to 10 to limit the number of combinations of parameters searched over to 10. Finally, setting random_state
to any number makes the results reproducible for that specific number.
Fitting the Model
Now, the model can be fit/trained with the data.
Compute Time
Note that RandomizedSearchCV
can be quite timeconsuming depending on the processing power available. Thus, though linear kernels are typically quite fast for Support Vector Machines, adding in this extra complexity can make the process comparably lengthy. Higher values for cv
and n_iter
will increase compute.
Experimentation
This is a good opportunity to do some experimentation with variables to see which models perform best. Some ideas include altering the values of C
, removing preprocessing steps (Feature Selection and Scaling), changing the number of CrossValidation folds, and modifying n_iter
. (Note that C
is a float so decimal values can also be searched through.)
For example, the score was 0.0298769771529 with scaling vs 0.0439367311072 without for a particular run. This is a small difference but it confirms that scaling values affects performance of SVCs.
Performance
RandomizedSearchCV
has some attributes of the trained model (denoted by the underscore appended to the end of the attribute) which can be used to check the best score, best parameters, and the best estimator found in the search.
The best_score_
shows the best CrossValidation score for the best parameters and model. The error here is very low. A negative sign is used because since it is an error, the convention in ScikitLearn is to negate the value. Thus a higher negative value (tending to infinity) means a higher error and a lower one (tending to 0) means a lower error.
Note that the additional complexity has reduced the error from the previous example by 0.0697825899934979 or 71.3%.
0.0281195079086
best_params_
shows the best parameters chosen for the range specified in the param_dist
. Changing random_state
in RandomizedSearchCV
will allow these values to alter. The regularisation parameter C
was set to 1 and 20 of the 30 features were selected.
{'svc__C': 1, 'selectkbest__k': 20}
best_estimator_
shows the parameters for all the components of the pipeline that were used to train the best model.
Pipeline(memory=None, steps=[('selectkbest', SelectKBest(k=20, score_func=<function f_classif at 0x10f1da730>)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=1, probability=False, random_state=0, shrinking=True, tol=0.001, verbose=False))])
MultiClass Classification
MultiClass Classification is done using LinearSVC()
. This is implemented differently from SVC()
with liblinear and thus can produce different results.
For MultiClass Classification it uses a onevstherest technique. Essentially there is a conversion of the data into a set of Binary Classification problems where each class is separated from the others as though all the other classes were merged into one.
This is repeated for all classes until each class has its own boundary. At prediction time, a new data point is compared to all the boundaries and the new point is assigned to the class with the highest score.
A similar workflow will be followed to classify multiple classes as done in the previous section, so explanations will be kept minimal.
Load and Explore the Data
The wine dataset is another toy dataset which comes with ScikitLearn. It is new in version 0.19, so an update may be required to access it. To start exploring, import the dataset with load_wine()
, then set X
and y
.
About the Data
According to wine.DESCR
, the dataset is also from the UCI repository. It shows the data collected via a chemical analysis of wines produced by 3 different growers in the same region in Italy. There are 178 observations with 13 features.
(178, 13)
Classes
It has 3 classes labelled 0, 1 and 2 in the target corresponding to generic class names of ‘class_0’, ‘class_1’, ‘class_2’ as shown with wine.target_names
.
array(['class_0', 'class_1', 'class_2'], dtype='<U7')
Features
The 13 features used to classify the different wines shown below:
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Tabular data with Pandas
Again, importing the data into a Pandas DataFrame makes the data more easily digestible. Feature Scaling will again be imperative given the data ranges.
alcohol  malic_acid  ash  alcalinity_of_ash  magnesium  total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity  hue  od280/od315_of_diluted_wines  proline  

0  14.23  1.71  2.43  15.6  127.0  2.80  3.06  0.28  2.29  5.64  1.04  3.92  1065.0 
1  13.20  1.78  2.14  11.2  100.0  2.65  2.76  0.26  1.28  4.38  1.05  3.40  1050.0 
2  13.16  2.36  2.67  18.6  101.0  2.80  3.24  0.30  2.81  5.68  1.03  3.17  1185.0 
3  14.37  1.95  2.50  16.8  113.0  3.85  3.49  0.24  2.18  7.80  0.86  3.45  1480.0 
4  13.24  2.59  2.87  21.0  118.0  2.80  2.69  0.39  1.82  4.32  1.04  2.93  735.0 
Visualisation
Two of the features are visualised below to aid in understanding.
Train the model
The explanation for the steps above considering Binary Classification with Randomised Grid Search and Pipelines holds and so will not be repeated. The major differences here are that LinearSVC
is used instead of SVC
and the range for k
has to be capped at 13 because there are only 13 features.
Performance
Again the best_score_
, best_params_
, and best_estimator_
are shown. The MSE is quite good as it is very low at 0.0280898876404. 11 features were used with C=1
in the best model.
0.0280898876404
{'selectkbest__k': 11, 'linearsvc__C': 1}
Pipeline(memory=None, steps=[('selectkbest', SelectKBest(k=11, score_func=<function f_classif at 0x10b3901e0>)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearsvc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=0, tol=0.0001, verbose=0))])
Conclusion
This post examined how to use Linear Support Vector Machines for Classification using SVC()
and LinearSVC()
from ScikitLearn. Both binary (two classes) and multiclass classification (>2 classes) were covered. We started with the theory of how SVMs work including an overview of margins, support vectors and kernels.
Examples using toy datasets from ScikitLearn were given for both types of classification. The more complex examples included Pipelines, Feature Selection, Scaling, and Randomised Grid Search.
Considering nonlinear classification with RBF and polynomial kernels next would be instructive.
References

CollinsThompson, Kevin. Linear Classifiers: Support Vector Machines. Coursera, https://www.coursera.org/learn/pythonmachinelearning/lecture/uClaN/linearclassifierssupportvectormachines.

Géron, Aurélien. “Handson Machine Learning with ScikitLearn and Tensorflow.” (2017).

Hunter, John D. “Matplotlib: A 2D graphics environment.” Computing In Science & Engineering 9.3 (2007): 9095.

McKinney, Wes. “Data structures for statistical computing in python.” Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.

Pedregosa, Fabian, et al. “Scikitlearn: Machine learning in Python.” Journal of Machine Learning Research 12.Oct (2011): 28252830.

Pérez, Fernando, and Brian E. Granger. “IPython: a system for interactive scientific computing.” Computing in Science & Engineering 9.3 (2007).