# Novelty detection and outlier detection with Scikit

Novelty detection and outlier detection with Scikit.

# Machine learning models

## Novelty detection: one-class SVMs

One-class SVM is often used in novelty detection problem in which a clean labelled one class dataset is assume known apriori. The training is to separate all data points from origin with a maximized margin. Once the model is trained on the clean dataset, it is used to predict novelties which are the points located around the classification boundary/frontier. One good thing with one-class SVM is that the model is able to detect non-linear patterns using kernel function e.g., Gaussian RBF kernel.

## Outlier detection: robust covariance estimation

Compared to one-class SVM, robust covariance estimation is designed for outlier detection problem in which we usually we have a mixture dataset with inliers and few outliers. The model estimate the variance of the dataset and detect the outliers when the actual variance of the data point is larger than the detected variance of the original dataset.

# Empirical evaluations

In this section, I will test the performance of two algorithms described above in the context of outlier detection. In particular, robust covariance estimation is designed for outlier detection, one-class SVM designed for novelty detection is degraded into the same context.

## Experiment settings

Let’s assume that data points are located in 2 dimensional space described by x and y coordinates. Inliers are generate from some Gaussian distributions with fixed means and standard derivations. Outliers are generated from a uniform distribution from the same space. In particular, I generate four different datasets. Each dataset consist of four 2D Gaussian distributions with $\sigma=0.3$ and $\mu$ from [(0,0),(0,0),(0,0),(0,0)] to [(4,4),(-4,-4),(-4,4),(4,-4)]. One-class SVMs and robust variance detection models are then applied to these four datasets.

## Results

Classification errors of two models in four datasets are shown in the following table

Dataset One-class SVM Variance detection
1 6 6
2 26 6
3 40 54
4 46 98

I also plot the decision boundary of two models in the following figures.

From the results, we can observe that when there exist clear cluster structures in the dataset one-class SVM is able to capture the cluster structured with Gaussian kernels. On the other hand, robust covariance estimation fails to capture the cluster structure and tries to estimate a covariance based on all data points from different clusters which lead to a drop of the performance when more cluster are presented.

## Code

Python code that is used to generate the experiments is shown in the following code block.

10 October 2015