An Introduction to the exprso Package

Thomas Quinn

2017-12-01

Introduction

In this tutorial, we present the exprso package for R, a library built to tackle a wide variety of supervised machine learning tasks, including the construction of ensemble classifiers. We designed exprso using a modular framework, whereby each function acts as a self-contained, yet interchangeable, part of the whole. With these modules, the investigator has access to multiple tools that they can combine in almost any sequence to build their own personalized machine learning pipelines on the fly. In this way, we balance the simplicity of automation with endless customization, all while maintaining software extensibility.

We can install the most recent version of exprso directly from CRAN.

install.packages("exprso")
library(exprso)

exprso Objects

This package contains five object types that handle the machine learning procedures:

exprso Functions

Functions included in this package rely on the objects listed above. Some of these functions return an updated version of the same object type provided, while others return a new object type. We have adopted a nomenclature to help organize the functions available in this package. In this scheme, most functions have a few letters in the beginning of their name which designate their use:

Importing data

We recommend importing data using the exprso function. This function has two arguments. The first expects the data with samples as rows and features as columns. The second expects the annotations with samples as rows where the first column contains the outcome to predict.

data(iris)
array <- exprso(iris[1:80, 1:4], iris[1:80, 5])
## [1] "Preparing data for binary classification."

Subsetting data

To subset an ExprsArray object, we provide methods for the [ and $ operators that access the @annot annotations slot directly. Alternatively, one could use the modSubset or subset functions. Note that the “defineCase” column always contains the outcome to predict. For binary classification, this is always coded as “Case” and “Control”.

sub <- array[array$defineCase == "Case", ]
sub <- modSubset(array, colBy = "defineCase", include = "Case")
sub <- subset(array, subset = array$defineCase == "Case")

Splitting data

When performing classification, an investigator will typically withhold some percentage of the data to use later when assessing classifier performance, effectively splitting the data into two. The first dataset, called the training set, gets used to build the model, while the other, called the external validation or test set, gets used to evaluate the model. This package offers two convenience functions for splitting the data, splitSample and splitStratify. The former builds the training set based on simple random sampling (with or without replacement), assigning the remaining subjects to the test set. The latter builds the training set using stratified random sampling. These functions both return a list of two ExprsArray objects corresponding to the training set and test set respectively. Below, we use the splitStratify function to build the training and test sets through a stratified random sample across the dichotomous (binary) classification annotation.

arrays <-
  splitStratify(array,
                percent.include = 67,
                colBy = NULL)
## 
## Pre-stratification table:
## df
##    Case Control 
##      30      50 
## 
## Weighted stratification results:
## 
##    Case Control 
##      20      20
array.train <- arrays[[1]]

Balancing data

All subjects not included in the training set (based on the percent.include argument) will automatically get assigned to the test set. Sometimes, when using splitStratify on a dataset with an unequal number of annotated subjects, the resultant test set may contain relative class frequencies that differ from the training set. If needed, we can fix this so-called “imbalance” at the cost of reducing sample size by performing splitStratify a second time. Now, we will use the test set as the input and let percent.include = 100 (keeping the other parameters the same). This will split the test set such that the new “training set” (i.e., slot 1) now contains the balanced test set and the new “test set” (i.e., slot 2) now contains the “spillover”.

balance <-
  splitStratify(arrays[[2]],
                percent.include = 100,
                colBy = NULL)
## 
## Pre-stratification table:
## df
##    Case Control 
##      10      30 
## 
## Weighted stratification results:
## 
##    Case Control 
##      10      10
array.test <- balance[[1]]

Selecting features

Considering the high-dimensionality of many datasets, it is prudent and often necessary to prioritize which features to include during classifier construction. This package provides functions for some of the most frequently used feature selection methods. Each function works as a self-contained wrapper that (1) pre-processes the ExprsArray input, (2) performs the feature selection, and (3) returns an ExprsArray output with an updated feature selection history. These histories get passed along at every step of the way until they eventually get used to pre-process an unlabeled dataset during classifier deployment (i.e., prediction).

One feature selection function is fsStats. This performs basic feature selection based on simple statistical tests. Specifically, this function will rank features using either the Student’s \(t\)-test or the Kolmogorov-Smirnov test. Below, the argument top = 0 tells tells the program to rank all features.

array.train <-
  fsStats(array.train, top = 0, how = "t.test")

The argument top specifies either the names or the number of features to supply to the feature selection method, not what the user intends to retrieve from the feature selection method. When calling the first feature selection method (or the first build method, if skipping feature selection), a numeric top argument will select a “top ranked” feature set according to their default order in the ExprsArray input. Then, because each feature selection method returns an ExprsArray object with the features implicitly (re-)ranked, all subsequent numeric top arguments will select a “top ranked” feature set according to the results of the previous feature selection method. For example, the third feature selection call draws the top features from the second feature ranking. The user may deploy, in tandem, any number of these functions in whatever order they choose.

Another feature selection function is fsPrcomp. This performs dimension reduction by way of principal components analysis (PCA). Like the feature selection steps, all dimension reduction models get saved in the ExprsArray history to deploy later on a test set. Below, we use the top 50 features (as selected by fsStats) for PCA.

array.train <-
  fsPrcomp(array.train, top = 50)

The other feature selection methods included in this package all follow the same use pattern. Below, we plot the first three components of the training set in 3-dimensional space.

plot(array.train)

Constructing classifiers

This package provides functions for several supervised machine learning methods, including support vector machines, artificial neural networks, random forests, and more. These functions require an ExprsArray object as input and return an ExprsModel object as output. This ExprsModel object contains the feature selection history that led up to classifier construction as well as the classifier itself. Below, we build an artificial neural network with five intermediate nodes in the hidden layer using the top 10 components from the training set above.

mach <-
  buildANN(array.train, top = 10, size = 5)
## Setting range to 0.38759719586878 (default behavior, override explicitly)...
## Setting decay to 0.5 (default behavior, override explicitly)...
## Setting maxit to 1000 (default behavior, override explicitly)...
## # weights:  31
## initial  value 40.316989 
## iter  10 value 12.124033
## iter  20 value 12.078803
## iter  20 value 12.078803
## iter  20 value 12.078803
## final  value 12.078803 
## converged

Deploying classifiers

We deploy an ExprsModel object using predict. This function returns an ExprsPredict object containing the prediction results in three forms: prediction, probability, and decision boundary predictions. The probability and decision boundary predictions relate to one another by a logistic transformation. The prediction (@pred) slot converts these metrics into a single “all-or-nothing” class label assignment.

Another function, calcStats, allows us to compare the prediction results against the actual class labels. The aucSkip argument specifies whether to calculate the area under the receiver operating characteristic (ROC) curve. Note, however, that performance metrics calculated using the ROC curve may differ from those calculated using a confusion matrix because the former may adjust the discrimination threshold to optimize sensitivity and specificity. The discrimination threshold is automatically chosen as the point along the ROC curve which minimizes the Euclidean distance from (0, 1). Below, we deploy a classifier on the test set, then use the result to calculate classifier performance.

pred <-
  predict(mach, array.test)
## Individual classifier performance:
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
##          actual
## predicted Control Case
##   Control      10    0
##   Case          0   10
##   acc sens spec
## 1   1    1    1
calcStats(pred)
## Calculating accuracy using ROCR based on prediction probabilities...

##   acc sens spec auc
## 1   1    1    1   1

Pipeline methods

This package includes several functions named with the prefix pl. These “pipeline” functions exist to help with high-throughput learning. In other words, they wrap repetitive tasks into a single call. This includes extensive parameter searches as well as some elaborate cross-validation. Some of these pl functions can even have other pl functions embedded within them. For example, the function plGrid contains the function plCV for managing simple \(v\)-fold and leave-one-out cross-validation.

High-throughput parameter searches

When constructing a classifier using a build method, we can only specify one set of parameters at a time. However, we often want to test models across a vast range of parameters. For this task, we provide the plGrid function. This function builds and deploys a model for each combination of all provided arguments. For example, calling plGrid with the arguments how = "buildSVM", top = c(3, 5, 10), cost = 10^(-3:3), and kernel = c("linear", "radial") will yield 42 classifiers.

We note here that this function only accepts one how per run. To analyze the results of multiple build parameter searches jointly, combine the results of multiple plGrid function calls using ?conjoin. We will also note that plGrid does not execute any data splitting or feature selection, both of which the user may perform beforehand. However, plGrid does allow the user to specify multiple classifier sizes by providing a numeric vector as the top argument.

The plGrid function can also calculate \(v\)-fold cross-validation accuracy at each step of the parameter search (toggled by supplying a non-NULL argument to fold). We emphasize, however, that the cross-validation method embedded within plGrid (i.e., plCV) does not re-select features with each fold, which may lead to overly-optimistic measures of classifier performance in the setting of prior feature selection.

Below, we run through a few different support vector machine builds, calculating leave-one-out cross-validation accuracy (i.e., via fold = 0) at each step.

gs <-
  plGrid(array.train = array.train,
         array.valid = array.test,
         top = c(2, 4),
         how = "buildSVM",
         fold = 0,
         kernel = "linear",
         cost = 10^(-3:3)
)

The returned object contains two slots, @summary and @machs, which store the performance summary and corresponding ExprsModel objects, respectively. The performance summary contains columns detailing the parameters used to build each machine along with performance metrics for the training set (and test set, if provided). Columns named with “train” describe training set performances. Columns named with “valid” describe test set performances. The column, "train.plCV", contains the cross-validation accuracy, if performed. The returned ExprsPipeline object also contains an ExprsModel object for each entry in the performance summary.

To subset an ExprsPipeline object, we provide methods for the [ and $ operators that access the @summary performance summary slot directly. Alternatively, one could use the pipeSubset or subset functions.

sub <- gs[gs$cost == 1, ]
sub <- pipeSubset(gs, colBy = "cost", include = 1)
sub <- subset(gs, subset = gs$cost == 1)

Other cross-validation

The exprso package also provides a means by which to perform elaborate cross-validation, including Monte Carlo style and 2-layer “nested” cross-validation. Analogous to how plGrid manages multiple build and predict tasks, these pipelines (i.e., plMonteCarlo and plNested) effectively manage multiple plGrid tasks. In order to organize the sheer number of arguments necessary to execute these functions, we have implemented argument handler functions (i.e., ctrlSplitSet, ctrlFeatureSelect, and ctrlGridSearch) that handle data splitting, feature selection, and grid searching, respectively.

In simplest terms, plMonteCarlo and plNested use a single training set to calculate classifier performances on a withheld internal validation set. This internal validation set serves as a kind of proxy for a statistically independent test set. The main difference between plMonteCarlo and plNested stems from how the internal validation set gets constructed. On one hand, the plMonteCarlo method uses the ctrlSplitSet argument handler to split the training set into a training subset and an intenral validation set with each bootstrap. On the other hand, the plNested method splits the training set into \(v\)-folds, treating each fold as an internal validation set while treating those outside that fold as the training subset.

For clarity, we call any performance measured on an internal validation set as the outer-loop cross-validation performance and any cross-validation accuracy measured using the training subset (i.e., via plGrid) the inner-loop cross-validation performance. In the performance summaries of the ExprsPipeline objects returned by plMonteCarlo and plNested, columns named with “train” describe training subset performances while columns named with “valid” describe internal validation set performances. Although the inner-loop cross-validation performances (i.e., via plCV) can still over-estimate cross-validation through prior feature selection, the outer-loop cross-validation performances derive from classifiers that have undergone feature selection anew with each bootstrap or fold. However, we emphasize here that performing feature selection on a training set prior to the use of plMonteCarlo or plNested can still result in overly optimistic outer-loop cross-validation performances.

In the example below, we perform five iterations of plMonteCarlo using the original training set as it existed before it underwent any feature selection (i.e., the first slot of the object arrays). With each iteration, we (1) sample the subjects randomly through bagging (i.e., random sampling with replacement), (2) perform feature selection using the Student’s t-test, and then (3) execute a grid-search across multiple support vector machine parameters and classifier sizes. In this framework, the user could instead perform any number of feature selection tasks simply by supplying a list of multiple ctrlFeatureSelect argument handlers to the ctrlFS argument below.

ss <-
  ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE)
fs <-
  ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs <-
  ctrlGridSearch(func = "plGrid",
                 how = "buildSVM",
                 top = c(2, 4),
                 kernel = "linear",
                 cost = 10^(-3:3),
                 fold = 10)
boot <-
  plMonteCarlo(arrays[[1]],
               B = 5,
               ctrlSS = ss,
               ctrlFS = fs,
               ctrlGS = gs)

Next, we reduce the results of plMonteCarlo to a single performance metric by feeding the returned ExprsPipeline object through calcMonteCarlo. Note that this helper function will fail unless plGrid has called plCV during the parameter grid-search.

calcMonteCarlo(boot, colBy = "valid.auc")
## Averaging best accuracies across all boots...
## [1] 1

Ensemble classifiers

This package provides two ways to build ensemble classifiers. The first involves manually combining multiple ExprsModel objects together through the function buildEnsemble (or ?conjoin). The second involves an orchestrated manipulation of an ExprsPipeline object through the pipeFilter function.

This latter approach filters an ExprsPipeline object in (up to) three steps. First, a threshold filter gets imposed, whereby any model with a performance less than the threshold filter, how, gets excluded. Second, a ceiling filter gets imposed, whereby any model with a performance greater than the ceiling filter, gate, gets excluded. Third, an arbitrary subset occurs, whereby the top N models in the ExprsPipeline object get selected based on the argument top. In the case that the @summary slot contains the column “boot” (e.g., in the results of plMonteCarlo), pipeFilter selects the top N models for each unique bootstrap. The user may skip any one of these three filter steps by setting the respective argument to 0.

When calling the buildEnsemble method for an ExprsPipeline object, any classifiers remaining after the pipeFilter filter will get assembled into a single ensemble classifier. Ensemble classifiers get stored as an ExprsEnsemble object which is simply a container for a list of multiple ExprsModel objects.

In the example below, we we will build an ensemble using the single best classifier from each plMonteCarlo bootstrap, and then deploy that ensemble on the withheld test set from above.

ens <- buildEnsemble(boot, top = 1, colBy = "valid.auc")
pred <- predict(ens, array.test, how = "majority")
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
##          actual
## predicted Control Case
##   Control      10    0
##   Case          0   10
##   acc sens spec
## 1   1    1    1

Owing to how the pipeFilter function handles ExprsPipeline objects that contain a “boot” column in the performance summary (i.e., @summary), we include the pipeUnboot function to rename this “boot” column to “unboot”. To learn more about how ExprsEnsemble predicts class labels, we refer the user to the documentation, ?'exprso-predict'. In addition, we encourage the user to visit the documentation, ?'ExprsPipeline-class' and ?'ExprsEnsemble-class'.

Multi-class classification methods

We conclude this vignette by alerting the user that the exprso package also includes a framework for performing multi-class classification in an automated manner. These methods use the “1-vs-all” approach to multi-class classification, whereby each individual class label has a turn getting treated as the positive class label in a dichotomous (binary) scheme. Then, the results of each iteration get integrated into a single construct. To learn more about multi-class classification, we refer the user to the documentation for ?doMulti and the companion vignette, “Advanced Topics for the exprso package”.

Regression methods

The exprso package framework also extends to building and deploying regression models to predict continuous outcomes. All pipeline and ensemble methods discussed here also apply to regression, although some fs and build methods work only for classification.

Final remarks

Thank you for your interest in exprso. Although we have made tremendous progress in formalizing this library in a reliable package framework, some of the tools included here may change. To the best of knowledge, we have followed the machine learning “best practices” when developing this software, but if you know better than us, please let us know! File any and all issues at GitHub. In addition, we always welcome suggestions for new tools that we could include in future releases. Happy learning!