100-days-mlops-kodekloud

Parameterize a DVC Pipeline

Problem

The xFusionCorp Industries ML team manages model hyperparameters through params.yaml so experiments can vary without code changes. The fraud-detection project’s train stage already wires params.yaml for n_estimators, but dvc repro currently fails. Correct the parameter wiring and demonstrate that DVC re-runs the train stage when the parameter changes.

  1. A project exists at /root/code/fraud-detection/ with a three-stage DVC pipeline (process_data, split_data, train) and a params.yaml already in place. Do not modify the Python files.

  2. The train stage in dvc.yaml references the n_estimators parameter. Every name listed under params: must resolve to a key in params.yaml.

  3. Review params.yaml, correct whatever prevents dvc repro from completing, and run the full pipeline.

  4. Demonstrate that DVC tracks parameter changes by updating n_estimators to a different value (for example 200). Run dvc repro again—only the train stage should re-execute, the new value must be recorded in dvc.lock, and models/model.pkl must be regenerated.

The DVC extension’s PARAMS section under the DVC view will surface the values from params.yaml directly in the editor.

Solution

  1. First, let’s run the dvc repro to see what prevents to execute the pipeline:

     dvc repro
    
     Running stage 'process_data':
     > python src/data/process_data.py
     Processed 15 rows
     Generating lock file 'dvc.lock'
     Updating lock file 'dvc.lock'
    
     Running stage 'split_data':
     > python src/data/split_data.py
     Train: 12 rows, Test: 3 rows
     Updating lock file 'dvc.lock'
    
     ERROR: failed to reproduce 'train': Parameters 'n_estimators' are missing from 'params.yaml'.
    

    So, we can see our dvc pipeline is looking for n_estimators parameter which is missing from params.yaml

  2. Let’s update params.yaml:

     n_estimators: 100
    
  3. Now, we can run dvc repro again:

     dvc repro
     Stage 'process_data' didn't change, skipping
     Stage 'split_data' didn't change, skipping
     Running stage 'train':
     > python src/models/train.py
     Trained RandomForestClassifier with n_estimators=100
     Updating lock file 'dvc.lock'
    
     To track the changes with git, run:
    
             git add models/.gitignore dvc.lock
    
     To enable auto staging, run:
    
             dvc config core.autostage true
     Use `dvc push` to send your updates to remote storage.
    

    We notice, it’s running successfully.

  4. Let’s change the number in params.yaml from 100 to something bigger and run dvc repro again.

     n_estimators: 200
    
     dvc repro
     Stage 'process_data' didn't change, skipping
     Stage 'split_data' didn't change, skipping
     Running stage 'train':
     > python src/models/train.py
     Trained RandomForestClassifier with n_estimators=200
     Updating lock file 'dvc.lock'
    

    We can confirm models/model.pkl and dvc.lock were updated as expected.

Good to Know?