Train a new model from scratch
usage: radon-defect-predictor train [-h] [--balancers BALANCERS] [--normalizers NORMALIZERS] path_to_csv classifiers
positional arguments:
path_to_csv the path to the csv file containing the data for training
classifiers a list of classifiers to train. Possible choices [dt, logit, nb, rf, svm]
optional arguments:
-h, --help show this help message and exit
--balancers BALANCERS
a list of balancer to balance training data. Possible choices [none, rus, ros]
--normalizers NORMALIZERS
a list of normalizers to normalize data. Possible choices [none, minmax, std]
Output
This command will generate a radondp_model.joblib
file in the user working directory.
The file contains information about the best estimator, subset of features selected by the training, and the results
of cross-validation.
Note: The radondp_model.joblib will override the existing one in the user working directory, if any.
path_to_csv
radon-defect-predictor train --path-to-csv path/to/repository-data.csv
The path to the training data (a .csv file). You can generate training data for IaC defect-prediction through radon-miner. An example observation is the following:
filepath | commit | committed_at | failure_prone | metric_1 | ... | metric_n |
---|---|---|---|---|---|---|
roles/tasks/main.yml | 25c04... | 1526444640 | 1 | value_1 | ... | value_n |
filepath: string
is the path to the file from the repository root;commit: string
is the commit sha the file belongs to. In per-release based defect-prediction, it is the commit sha of a release, and it is used to group observations of the same release;committed_at: string
is the commit datetime. In release-based defect-prediction, it is the release date. In just-in-time defect-prediction is the commit date. It is used to sort releases/commits for walk-forward validation;failure_prone: integer
1 if the observation is failure-prone; 0 otherwise;metric_i: float
a metric.
Warning
Missing one of the following columns will raise an error: filepath
, commit
, committed_at
, failure_prone
.
classifiers
radon-defect-predictor train --classifiers="dt logit nb rf svm"
-
dt
- Train a model using a sklearn.tree.DecisionTreeClassifier classifier; -
logit
- Train a model using a sklearn.linear_model.LogisticRegression classifier; -
nb
- Train a model using a sklearn.naive_bayes.GaussianNB classifier; -
rf
- Train a model using a sklearn.ensemble.RandomForestClassifier classifier; -
svm
- Train a model using a sklearn.tree.DecisionTreeClassifier classifier.
--balancers
radon-defect-predictor train --balancers="none rus ros"
-
none
- Do not balance training data; -
rus
- Do balance training data using Random Under-Sampling; -
ros
- Do balance training data using Random Over-Sampling.
Not providing any options is the same as passing the option none
.
However, this option can be passed along the others to train the model by either balancing and not balancing the training set.
--normalizers
radon-defect-predictor train --normalizers="none minmax std"
-
none
- Do not normalize training data; -
minmax
- Transform features by scaling each feature to the range [0,1]. It uses the sklearn.preprocessing.MinMaxScaler; -
std
- Standardize features by removing the mean and scaling to unit variance. It uses the sklearn.preprocessing.StandardScaler.
Not providing any options is the same as passing the option none
.
However, this option can be passed along the others to train the model by either normalizing and not normalizing the training set.
Example
Assuming we are training a new model for the ansible-community/molecule project, download the following training set molecule.csv generated using radon-miner. This is the "ground truth" to train a model for that project.
You can now run the radon-defect-predictor train ...
wherever on your system.
For the sake of example, let's create and move to a new working directory:
mkdir radon_example
cd radon_example
mv /home/<user>/Downloads/molecule.csv .
Now run:
radon-defect-predictor train molecule.csv "dt" --balancers "none rus" --normalizers "minmax"
or (equivalent)
radon-defect-predictor train molecule.csv "dt" -b "none rus" -n "minmax"
The previous command loads and prepares the .csv file. Then, it builds a model using:
- the Decision Tree classifier (
"dt"
); - the Random Under-Sampling technique to balance the training data (
rus
), or none (none
); - the minmax normalization to scale data within the range [0,1] (
minmax
).
The built model (radondp_model.joblib
) is saved into the current working directory.
You can see it by running:
ls
molecule.csv
radondp_model.joblib
You can run the same command with different combinations of balancers, normalizers, and classifiers, as explained in previous sections.