100-days-mlops-kodekloud

Parameterize a DVC Pipeline

Problem

The xFusionCorp Industries ML team manages model hyperparameters through params.yaml so experiments can vary without code changes. The fraud-detection project’s train stage already wires params.yaml for n_estimators, but dvc repro currently fails. Correct the parameter wiring and demonstrate that DVC re-runs the train stage when the parameter changes.

A project exists at /root/code/fraud-detection/ with a three-stage DVC pipeline (process_data, split_data, train) and a params.yaml already in place. Do not modify the Python files.
The train stage in dvc.yaml references the n_estimators parameter. Every name listed under params: must resolve to a key in params.yaml.
Review params.yaml, correct whatever prevents dvc repro from completing, and run the full pipeline.
Demonstrate that DVC tracks parameter changes by updating n_estimators to a different value (for example 200). Run dvc repro again—only the train stage should re-execute, the new value must be recorded in dvc.lock, and models/model.pkl must be regenerated.

The DVC extension’s PARAMS section under the DVC view will surface the values from params.yaml directly in the editor.

Solution

First, let’s run the dvc repro to see what prevents to execute the pipeline:

 dvc repro

 Running stage 'process_data':
 > python src/data/process_data.py
 Processed 15 rows
 Generating lock file 'dvc.lock'
 Updating lock file 'dvc.lock'

 Running stage 'split_data':
 > python src/data/split_data.py
 Train: 12 rows, Test: 3 rows
 Updating lock file 'dvc.lock'

 ERROR: failed to reproduce 'train': Parameters 'n_estimators' are missing from 'params.yaml'.

So, we can see our dvc pipeline is looking for n_estimators parameter which is missing from params.yaml

Let’s update params.yaml:
```
 n_estimators: 100
```

Now, we can run dvc repro again:

 dvc repro
 Stage 'process_data' didn't change, skipping
 Stage 'split_data' didn't change, skipping
 Running stage 'train':
 > python src/models/train.py
 Trained RandomForestClassifier with n_estimators=100
 Updating lock file 'dvc.lock'

 To track the changes with git, run:

         git add models/.gitignore dvc.lock

 To enable auto staging, run:

         dvc config core.autostage true
 Use `dvc push` to send your updates to remote storage.

We notice, it’s running successfully.

Let’s change the number in params.yaml from 100 to something bigger and run dvc repro again.

 n_estimators: 200

 dvc repro
 Stage 'process_data' didn't change, skipping
 Stage 'split_data' didn't change, skipping
 Running stage 'train':
 > python src/models/train.py
 Trained RandomForestClassifier with n_estimators=200
 Updating lock file 'dvc.lock'

We can confirm models/model.pkl and dvc.lock were updated as expected.

Good to Know?

DVC tracks parameters through params.yaml. When a value changes, DVC re-runs only the stages that depend on that parameter, so experiments stay fast.
dvc.lock records the exact parameter values used by each stage, which keeps the pipeline reproducible.

This site is open source. Improve this page.