The xFusionCorp Industries ML team manages model hyperparameters through params.yaml so experiments can vary without code changes. The fraud-detection project’s train stage already wires params.yaml for n_estimators, but dvc repro currently fails. Correct the parameter wiring and demonstrate that DVC re-runs the train stage when the parameter changes.
A project exists at /root/code/fraud-detection/ with a three-stage DVC pipeline (process_data, split_data, train) and a params.yaml already in place. Do not modify the Python files.
The train stage in dvc.yaml references the n_estimators parameter. Every name listed under params: must resolve to a key in params.yaml.
Review params.yaml, correct whatever prevents dvc repro from completing, and run the full pipeline.
Demonstrate that DVC tracks parameter changes by updating n_estimators to a different value (for example 200). Run dvc repro again—only the train stage should re-execute, the new value must be recorded in dvc.lock, and models/model.pkl must be regenerated.
The DVC extension’s PARAMS section under the DVC view will surface the values from
params.yamldirectly in the editor.
First, let’s run the dvc repro to see what prevents to execute the pipeline:
dvc repro
Running stage 'process_data':
> python src/data/process_data.py
Processed 15 rows
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Running stage 'split_data':
> python src/data/split_data.py
Train: 12 rows, Test: 3 rows
Updating lock file 'dvc.lock'
ERROR: failed to reproduce 'train': Parameters 'n_estimators' are missing from 'params.yaml'.
So, we can see our dvc pipeline is looking for
n_estimatorsparameter which is missing fromparams.yaml
Let’s update params.yaml:
n_estimators: 100
Now, we can run dvc repro again:
dvc repro
Stage 'process_data' didn't change, skipping
Stage 'split_data' didn't change, skipping
Running stage 'train':
> python src/models/train.py
Trained RandomForestClassifier with n_estimators=100
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add models/.gitignore dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
We notice, it’s running successfully.
Let’s change the number in params.yaml from 100 to something bigger and run dvc repro again.
n_estimators: 200
dvc repro
Stage 'process_data' didn't change, skipping
Stage 'split_data' didn't change, skipping
Running stage 'train':
> python src/models/train.py
Trained RandomForestClassifier with n_estimators=200
Updating lock file 'dvc.lock'
We can confirm
models/model.pklanddvc.lockwere updated as expected.
params.yaml. When a value changes, DVC re-runs only the stages that depend on that parameter, so experiments stay fast.dvc.lock records the exact parameter values used by each stage, which keeps the pipeline reproducible.