Running Experiments

The simplest way to use SKLL is to create configuration files that describe experiments you would like to run on pre-generated features. This document describes the supported feature file formats, how to create configuration files (and layout your directories), and how to use run_experiment to get things going.

Quick Example

If you don’t want to read the whole document, and just want an example of how things work, do the following from the command prompt:

$ cd examples
$ python make_example_iris_data.py          # download a simple dataset
$ cd iris
$ run_experiment --local evaluate.cfg        # run an experiment

Feature file formats

The following feature file formats are supported:

arff

The same file format used by Weka with the following added restrictions:

  • Only simple numeric, string, and nomimal values are supported.
  • Nominal values are converted to strings.
  • If the data has instance IDs, there should be an attribute with the name specified by id_col in the Input section of the configuration file you create for your experiment. This defaults to id. If there is no such attribute, IDs will be generated automatically.
  • If the data is labelled, there must be an attribute with the name specified by label_col in the Input section of the configuartion file you create for your experiment. This defaults to y. This must also be the final attribute listed (like in Weka).

csv/tsv

A simple comma or tab-delimited format with the following restrictions:

  • If the data is labelled, there must be a column with the name specified by label_col in the Input section of the configuartion file you create for your experiment. This defaults to y.
  • If the data has instance IDs, there should be a column with the name specified by id_col in the Input section of the configuration file you create for your experiment. This defaults to id. If there is no such column, IDs will be generated automatically.
  • All other columns contain feature values, and every feature value must be specified (making this a poor choice for sparse data).

libsvm

While we can process the standard input file format supported by LibSVM, LibLinear, and SVMLight, we also support specifying extra metadata usually missing from the format in comments at the of each line. The comments are not mandatory, but without them, your labels and features will not have names. The comment is structured as follows:

ID | 1=ClassX | 1=FeatureA 2=FeatureB

The entire format would like this:

2 1:2.0 3:8.1 # Example1 | 2=ClassY | 1=FeatureA 3=FeatureC
1 5:7.0 6:19.1 # Example2 | 1=ClassX | 5=FeatureE 6=FeatureF

Note

IDs, labels, and feature names cannot contain the following characters: | # =

megam

An expanded form of the input format for the MegaM classification package with the -fvals switch.

The basic format is:

# Instance1
CLASS1    F0 2.5 F1 3 FEATURE_2 -152000
# Instance2
CLASS2    F1 7.524

where the optional comments before each instance specify the ID for the following line, class names are separated from feature-value pairs with a tab, and feature-value pairs are separated by spaces. Any omitted features for a given instance are assumed to be zero, so this format is handy when dealing with sparse data. We also include several utility scripts for converting to/from this MegaM format and for adding/removing features from the files.

Creating configuration files

The experiment configuration files that run_experiment accepts are standard Python configuration files that are similar in format to Windows INI files. [1] There are four expected sections in a configuration file: General, Input, Tuning, and Output. A detailed description of each possible settings for each section is provided below, but to summarize:

  • If you want to do cross-validation, specify a path to training feature files, and set task to cross_validate. Please note that the cross-validation currently uses StratifiedKFold. You also can optionally use predetermined folds with the cv_folds_file setting.
  • If you want to train a model and evaluate it on some data, specify a training location, a test location, and a directory to store results, and set task to evaluate.
  • If you want to just train a model and generate predictions, specify a training location, a test location, and set task to predict.
  • If you want to just train a model, specify a training location, and set task to train.

Example configuration files are available here.

General

Both fields in the General section are required.

experiment_name

A string used to identify this particular experiment configuration. When generating result summary files, this name helps prevent overwriting previous summaries.

task

What types of experiment we’re trying to run. Valid options are: cross_validate, evaluate, predict, and train.

Input

The Input section has only one required field, learners, but also must contain either train_file or train_directory.

learners

List of scikit-learn models to try using. A separate job will be run for each combination of classifier and feature-set. Acceptable values are described below. Custom learners can also be specified. See custom_learner_path.

Classifiers:

Regressors:

For all regressors you can also prepend Rescaled to the beginning of the full name (e.g., RescaledSVR) to get a version of the regressor where predictions are rescaled and constrained to better match the training set.

train_file (Optional)

Path to a file containing the features to train on. Cannot be used in combination with featuresets, train_directory, or test_directory.

Note

If train_file is not specified, train_directory must be.

train_directory (Optional)

Path to directory containing training data files. There must be a file for each featureset. Cannot be used in combination with train_file or test_file.

Note

If train_directory is not specified, train_file must be.

test_file (Optional)

Path to a file containing the features to test on. Cannot be used in combination with featuresets, train_directory, or test_directory

test_directory (Optional)

Path to directory containing test data files. There must be a file for each featureset. Cannot be used in combination with train_file or test_file.

featuresets (Optional)

List of lists of prefixes for the files containing the features you would like to train/test on. Each list will end up being a job. IDs are required to be the same in all of the feature files, and a ValueError will be raised if this is not the case. Cannot be used in combination with train_file or test_file.

Note

If specifying train_directory or test_directory, featuresets is required.

suffix (Optional)

The file format the training/test files are in. Valid option are .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv.

If you omit this field, it is assumed that the “prefixes” listed in featuresets are actually complete filenames. This can be useful if you have feature files that are all in different formats that you would like to combine.

id_col (Optional)

If you’re using ARFF, CSV, or TSV files, the IDs for each instance are assumed to be in a column with this name. If no column with this name is found, the IDs are generated automatically. Defaults to id.

label_col (Optional)

If you’re using ARFF, CSV, or TSV files, the class labels for each instance are assumed to be in a column with this name. If no column with this name is found, the data is assumed to be unlabelled. Defaults to y. For ARFF files only, this must also be the final column to count as the label (for compatibility with Weka).

ids_to_floats (Optional)

If you have a dataset with lots of examples, and your input files have IDs that look like numbers (can be converted by float()), then setting this to True will save you some memory by storing IDs as floats. Note that this will cause IDs to be printed as floats in prediction files (e.g., 4.0 instead of 4 or 0004 or 4.000).

shuffle (Optional)

If True, shuffle the examples in the training data before using them for learning. This happens automatically when doing a grid search but it might be useful in other scenarios as well, e.g., online learning. Defaults to False.

class_map (Optional)

If you would like to collapse several labels into one, or otherwise modify your labels (without modifying your original feature files), you can specify a dictionary mapping from new class labels to lists of original class labels. For example, if you wanted to collapse the labels beagle and dachsund into a dog class, you would specify the following for class_map:

{'dog': ['beagle', 'dachsund']}

Any labels not included in the dictionary will be left untouched.

cv_folds (Optional)

The number of folds to use for cross-validation. Defaults to 10.

random_folds (Optional)

Whether to use random folds for cross-validation. Defaults to False.

cv_folds_file (Optional)

Path to a csv file specifying folds for cross-validation. The first row must be a header. This header row is ignored, so it doesn’t matter what the header row contains, but it must be there. If there is no header row, whatever row is in its place will be ignored. The first column should consist of training set IDs and the second should be a string for the fold ID (e.g., 1 through 5, A through D, etc.). If specified, the CV and grid search will leave one fold ID out at a time. [2]

custom_learner_path (Optional)

Path to a .py file that defines a custom learner. This file will be imported dynamically. This is only required if a custom learner in specified in the list of learners. Custom learners must implement the fit and predict methods and inherit from sklearn.base.BaseEstimator. Custom regressors must also inherit from sklearn.base.RegressorMixin. Models that require dense matrices should implement a method requires_dense that returns True.

sampler (Optional)

It performs a non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms. Valid options are: Nystroem, RBFSampler, SkewedChi2Sampler, and AdditiveChi2Sampler. For additional information see the scikit-learn documentation.

sampler_parameters (Optional)

dict containing parameters you want to have fixed for the sampler. Any empty ones will be ignored (and the defaults will be used).

The default fixed parameters (beyond those that scikit-learn sets) are:

Nystroem
{'random_state': 123456789}
RBFSampler
{'random_state': 123456789}
SkewedChi2Sampler
{'random_state': 123456789}

feature_hasher (Optional)

If “true”, this enables a high-speed, low-memory vectorizer that uses feature hashing for converting feature dictionaries into NumPy arrays instead of using a DictVectorizer. This flag will drastically reduce memory consumption for data sets with a large number of features. If enabled, the user should also specify the number of features in the hasher_features field. For additional information see the scikit-learn documentation.

hasher_features (Optional)

The number of features used by the FeatureHasher if the feature_hasher flag is enabled.

Note

To avoid collisions, you should always use the power of two larger than the number of features in the data set for this setting. For example, if you had 17 features, you would want to set the flag to 32.

featureset_names (Optional)

Optional list of names for the feature sets. If omitted, then the prefixes will be munged together to make names.

fixed_parameters (Optional)

List of dicts containing parameters you want to have fixed for each classifier in learners list. Any empty ones will be ignored (and the defaults will be used).

The default fixed parameters (beyond those that scikit-learn sets) are:

LogisticRegression
{'random_state': 123456789}
LinearSVC
{'random_state': 123456789}
SVC
{'cache_size': 1000}
DecisionTreeClassifier and DecisionTreeRegressor
{'random_state': 123456789}
RandomForestClassifier and RandomForestRegressor
{'n_estimators': 500, 'random_state': 123456789}
GradientBoostingClassifier and GradientBoostingRegressor
{'n_estimators': 500, 'random_state': 123456789}
SVR
{'cache_size': 1000, 'kernel': b'linear'}

Note

This option allows us to deal with imbalanced data sets by using the parameter class_weight for the classifiers: SVC, LogisticRegression, LinearSVC and SGDClassifier.

Two possible options are available. The first one is auto, which automatically adjust weights inversely proportional to class frequencies, as shown in the following code:

{'class_weight': 'auto'}

The second option allows you to assign an specific weight per each class. The default weight per class is 1. For example:

{'class_weight': {1: 10}}

Additional examples and information can be seen here.

feature_scaling (Optional)

Whether to scale features by their mean and/or their standard deviation. If you scale by mean, your data will automatically be converted to dense, so use caution when you have a very large dataset. Valid options are:

none
perform no feature scaling at all.
with_std
Scale feature values by their standard deviation.
with_mean
Center features by subtracting their mean.
both
perform both centering and scaling.

Defaults to none.

Tuning

grid_search (Optional)

Whether or not to perform grid search to find optimal parameters for classifier. Defaults to False.

grid_search_folds (Optional)

The number of folds to use for grid search. Defaults to 3.

grid_search_jobs (Optional)

Number of folds to run in parallel when using grid search. Defaults to number of grid search folds.

min_feature_count (Optional)

The minimum number of examples for which the value of a feature must be nonzero to be included in the model. Defaults to 1.

objective (Optional)

The objective function to use for tuning. Valid options are:

Classification:

  • accuracy: Overall accuracy
  • precision: Precision
  • recall: Recall
  • f1: The default scikit-learn F1 score (F1 of the positive class for binary classification, or the weighted average F1 for multiclass classification)
  • f1_score_micro: Micro-averaged F1 score
  • f1_score_macro: Macro-averaged F1 score
  • f1_score_weighted: Weighted average F1 score
  • f1_score_least_frequent: F:1 score of the least frequent class. The least frequent class may vary from fold to fold for certain data distributions.
  • average_precision: Area under PR curve (for binary classification)
  • roc_auc: Area under ROC curve (for binary classification)

Regression or classification with integer labels:

  • unweighted_kappa: Unweighted Cohen’s kappa (any floating point values are rounded to ints)
  • linear_weighted_kappa: Linear weighted kappa (any floating point values are rounded to ints)
  • quadratic_weighted_kappa: Quadratic weighted kappa (any floating point values are rounded to ints)
  • uwk_off_by_one: Same as unweighted_kappa, but all ranking differences are discounted by one. In other words, a ranking of 1 and a ranking of 2 would be considered equal.
  • lwk_off_by_one: Same as linear_weighted_kappa, but all ranking differences are discounted by one.
  • qwk_off_by_one: Same as quadratic_weighted_kappa, but all ranking differences are discounted by one.

Regression or classification with binary labels:

Regression:

Defaults to f1_score_micro.

param_grids (Optional)

List of parameter grids to search for each learner. Each parameter grid should be a list of dictionaries mapping from strings to lists of parameter values. When you specify an empty list for a learner, the default parameter grid for that learner will be searched.

The default parameter grids for each learner are:

AdaBoostClassifier and AdaBoostRegressor
[{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}]
DecisionTreeClassifier and DecisionTreeRegressor
[{'max_features': ["auto", None]}]
ElasticNet, Lasso, and Ridge
[{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}]
GradientBoostingClassifier and GradientBoostingRegressor
[{'max_depth': [1, 3, 5]}]
KNeighborsClassifier and KNeighborsRegressor
[{'n_neighbors': [1, 5, 10, 100],
  'weights': ['uniform', 'distance']}]
LinearSVC
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]
LogisticRegression
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]
MultinomialNB
[{'alpha': [0.1, 0.25, 0.5, 0.75, 1.0]}]
RandomForestClassifier and RandomForestRegressor
[{'max_depth': [1, 5, 10, None]}]
SGDClassifier and SGDRegressor
[{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
  'penalty': ['l1', 'l2', 'elasticnet']}]
SVC
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
  'gamma': [0.01, 0.1, 1.0, 10.0, 100.0]}]
SVR
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]

pos_label_str (Optional)

The string label for the positive class in the binary classification setting. If unspecified, an arbitrary class is picked.

Output

probability (Optional)

Whether or not to output probabilities for each class instead of the most probable class for each instance. Only really makes a difference when storing predictions. Defaults to False.

results (Optional)

Directory to store result files in. If omitted, the current working directory is used.

log (Optional)

Directory to store result files in. If omitted, the current working directory is used.

models (Optional)

Directory to store trained models in. Can be omitted to not store models.

predictions (Optional)

Directory to store prediction files in. Can be omitted to not store predictions.

Note

You can use the same directory for results, log, models, and predictions.

Using run_experiment

Once you have create the configuration file for your experiment, you can usually just get your experiment started by running run_experiment CONFIGFILE. That said, there are a few options that are specified via command-line arguments instead of in the configuration file:

-a <num_features>, --ablation <num_features>

Runs an ablation study where repeated experiments are conducted with the specified number of feature files in each featureset in the configuration file held out. For example, if you have three feature files (A, B, and C) in your featureset and you specifiy --ablation 1, there will be three experiments conducted with the following featuresets: [[A, B], [B, C], [A, C]]. Additionally, since every ablation experiment includes a run with all the features as a baseline, the following featureset will also be run: [[A, B, C]].

If you would like to try all possible combinations of feature files, you can use the run_experiment --ablation_all option instead.

-A, --ablation_all

Runs an ablation study where repeated experiments are conducted with all combinations of feature files in each featureset.

Warning

This can create a huge number of jobs, so please use with caution.

-k, --keep-models

If trained models already exist for any of the learner/featureset combinations in your configuration file, just load those models and do not retrain/overwrite them.

-r, --resume

If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes.

-v, --verbose

Print more status information. For every additional time this flag is specified, output gets more verbose.

--version

Show program’s version number and exit.

GridMap options

If you have GridMap installed, run_experiment will automatically schedule jobs on your DRMAA- compatible cluster. You can use the following options to customize this behavior.

-l, --local

Run jobs locally instead of using the cluster. [3]

-q <queue>, --queue <queue>

Use this queue for GridMap. (default: all.q)

-m <machines>, --machines <machines>

Comma-separated list of machines to add to GridMap’s whitelist. If not specified, all available machines are used.

Note

Full names must be specified, (e.g., nlp.research.ets.org).

Output files

The result, log, model, and prediction files generated by run_experiment will all share the following automatically generated prefix EXPERIMENT_FEATURESET_LEARNER, where the following definitions hold:

EXPERIMENT
The name specified as experiment_name in the configuration file.
FEATURESET
The feature set we’re training on joined with “+”.
LEARNER
The learner the current results/model/etc. was generated using.

For every experiment you run, there will also be a result summary file generated that is a tab-delimited file summarizing the results for each learner-featureset combination you have in your configuration file. It is named EXPERIMENT_summary.tsv.

Footnotes

[1]We are considering adding support for YAML configuration files in the future, but we have not added this functionality yet.
[2]K-1 folds will be used for grid search within CV, so there should be at least 3 fold IDs.
[3]This will happen automatically if GridMap cannot be imported.