Variance Threshold

`VarianceThreshold` is a simple baseline approach to feature selection. It removes all features whose variance doesnâ€™t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Constructor Parameters

• \$threshold (float) - features with a variance lower than this threshold will be removed (default 0.0)
``````use Phpml\FeatureSelection\VarianceThreshold;

\$transformer = new VarianceThreshold(0.15);
``````

Example of use

As an example, suppose that we have a dataset with boolean features and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

``````Var[X] = p(1 - p)
``````

so we can select using the threshold .8 * (1 - .8):

``````use Phpml\FeatureSelection\VarianceThreshold;

\$samples = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]];
\$transformer = new VarianceThreshold(0.8 * (1 - 0.8));

\$transformer->fit(\$samples);
\$transformer->transform(\$samples);

/*
\$samples = [[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]];
*/
``````

Pipeline

`VarianceThreshold` implements `Transformer` interface so it can be used as part of pipeline:

``````use Phpml\FeatureSelection\VarianceThreshold;
use Phpml\Classification\SVC;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\Pipeline;

\$transformers = [
new TfIdfTransformer(),
new VarianceThreshold(0.1)
];
\$estimator = new SVC();

\$pipeline = new Pipeline(\$transformers, \$estimator);
``````