Train a new model from scratch
usage: radon-defect-predictor train [-h] [--balancers BALANCERS] [--normalizers NORMALIZERS] path_to_csv classifiers
positional arguments:
path_to_csv the path to the csv file containing the data for training
classifiers a list of classifiers to train. Possible choices [dt, logit, nb, rf, svm]
optional arguments:
-h, --help show this help message and exit
--balancers BALANCERS
a list of balancer to balance training data. Possible choices [none, rus, ros]
--normalizers NORMALIZERS
a list of normalizers to normalize data. Possible choices [none, minmax, std]
Output
This command will generate a radondp_model.joblib file in the user working directory.
The file contains information about the best estimator, subset of features selected by the training, and the results
of cross-validation.
Note: The radondp_model.joblib will override the existing one in the user working directory, if any.
path_to_csv
radon-defect-predictor train --path-to-csv path/to/repository-data.csv
The path to the training data (a .csv file). You can generate training data for IaC defect-prediction through radon-miner. An example observation is the following:
| filepath | commit | committed_at | failure_prone | metric_1 | ... | metric_n |
|---|---|---|---|---|---|---|
| roles/tasks/main.yml | 25c04... | 1526444640 | 1 | value_1 | ... | value_n |
filepath: stringis the path to the file from the repository root;commit: stringis the commit sha the file belongs to. In per-release based defect-prediction, it is the commit sha of a release, and it is used to group observations of the same release;committed_at: stringis the commit datetime. In release-based defect-prediction, it is the release date. In just-in-time defect-prediction is the commit date. It is used to sort releases/commits for walk-forward validation;failure_prone: integer1 if the observation is failure-prone; 0 otherwise;metric_i: floata metric.
Warning
Missing one of the following columns will raise an error: filepath, commit, committed_at, failure_prone.
classifiers
radon-defect-predictor train --classifiers="dt logit nb rf svm"
-
dt- Train a model using a sklearn.tree.DecisionTreeClassifier classifier; -
logit- Train a model using a sklearn.linear_model.LogisticRegression classifier; -
nb- Train a model using a sklearn.naive_bayes.GaussianNB classifier; -
rf- Train a model using a sklearn.ensemble.RandomForestClassifier classifier; -
svm- Train a model using a sklearn.tree.DecisionTreeClassifier classifier.
--balancers
radon-defect-predictor train --balancers="none rus ros"
-
none- Do not balance training data; -
rus- Do balance training data using Random Under-Sampling; -
ros- Do balance training data using Random Over-Sampling.
Not providing any options is the same as passing the option none.
However, this option can be passed along the others to train the model by either balancing and not balancing the training set.
--normalizers
radon-defect-predictor train --normalizers="none minmax std"
-
none- Do not normalize training data; -
minmax- Transform features by scaling each feature to the range [0,1]. It uses the sklearn.preprocessing.MinMaxScaler; -
std- Standardize features by removing the mean and scaling to unit variance. It uses the sklearn.preprocessing.StandardScaler.
Not providing any options is the same as passing the option none.
However, this option can be passed along the others to train the model by either normalizing and not normalizing the training set.
Example
Assuming we are training a new model for the ansible-community/molecule project, download the following training set molecule.csv generated using radon-miner. This is the "ground truth" to train a model for that project.
You can now run the radon-defect-predictor train ... wherever on your system.
For the sake of example, let's create and move to a new working directory:
mkdir radon_example
cd radon_example
mv /home/<user>/Downloads/molecule.csv .
Now run:
radon-defect-predictor train molecule.csv "dt" --balancers "none rus" --normalizers "minmax"
or (equivalent)
radon-defect-predictor train molecule.csv "dt" -b "none rus" -n "minmax"
The previous command loads and prepares the .csv file. Then, it builds a model using:
- the Decision Tree classifier (
"dt"); - the Random Under-Sampling technique to balance the training data (
rus), or none (none); - the minmax normalization to scale data within the range [0,1] (
minmax).
The built model (radondp_model.joblib) is saved into the current working directory.
You can see it by running:
ls
molecule.csv
radondp_model.joblib

You can run the same command with different combinations of balancers, normalizers, and classifiers, as explained in previous sections.