100-days-mlops-kodekloud

Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.

  1. A project exists at /root/code/ml-pipeline/ with Git and DVC initialised. The params.yaml is in place and the .dvc/config is pre-configured to push to the SeaweedFS bucket dvc-storage at http://localhost:8333.

  2. The ingest, validate, and preprocess stages are already declared in dvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it.

  3. The remaining two stages need to be added:

    • train – Depends on the preprocessed dataset and scripts/train.py; reads n_estimators, max_depth, test_size, and random_seed from params.yaml; outputs models/model.pkl and data/processed/test_split.csv; declares metrics.json as a DVC metric with cache: false.
    • evaluate – Depends on models/model.pkl, data/processed/test_split.csv, and scripts/evaluate.py; outputs reports/evaluation.json declared with cache: false.
  4. The two scripts you need are pre-staged at /root/code/ml-pipeline/scripts-staging/train.py and scripts-staging/evaluate.py. Copy them into scripts/ before adding the stages.

  5. Run the full pipeline with dvc repro, push the cache to the SeaweedFS remote with dvc push, and tag the current state as v1.0.

  6. Commit every change to Git so the release is fully captured.

Open the SeaweedFS Filer button at the top of the lab and navigate to /buckets/dvc-storage/ to confirm that the bucket holds the pushed artefacts under the files/md5/... layout.

Solution

  1. Let’s move into project directory and run dvc repro to identify the issue:

     cd ml-pipeline
     dvc repro
    
     dvc repro
     Running stage 'ingest':
     > python scripts/ingest.py
     Data ingested successfully: 20 rows, 5 columns
     Generating lock file 'dvc.lock'
     Updating lock file 'dvc.lock'
    
     Running stage 'validate':
     > python scripts/validate.py
     Validation: 20 rows, valid=True
     Updating lock file 'dvc.lock'
    
     Running stage 'preprocess':
     > python scripts/preprocess.py
     Preprocessed: 20 clean rows
     ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist
    

    So, we can its failing at preprocess stage, and if we look at the preprocess.py script the actual output file is clean.csv. Let’s rename and run the dvc repro again.

  2. Add remaining two stage:

    let’s add two more stages train and evaluate to complete the pipeline according to requirements.

     stages:
         ingest:
             cmd: python scripts/ingest.py
             deps:
                 - scripts/ingest.py
                 - data/raw/data.csv
    
         validate:
             cmd: python scripts/validate.py
             deps:
                 - data/raw/data.csv
                 - scripts/validate.py
             outs:
                 - reports/validation.json:
                     cache: false
    
         preprocess:
             cmd: python scripts/preprocess.py
             deps:
                 - data/raw/data.csv
                 - scripts/preprocess.py
             outs:
                 - data/processed/clean.csv
         train:
             cmd: python scripts/train.py
             deps:
                 - data/processed/clean.csv
                 - scripts/train.py
             params:
                 - n_estimators
                 - random_seed
                 - test_size
                 - max_depth
             outs:
                 - models/model.pkl
                 - data/processed/test_split.csv
             metrics:
                 - metrics.json:
                     cache: false
         evaluate:
             cmd: python scripts/evaluate.py
             deps:
                 - scripts/evaluate.py
                 - models/model.pkl
                 - data/processed/test_split.csv
             outs:
                 - reports/evaluation.json:
                     cache: false
    

    We have added two missing stages. Full source code of day 19 dvc.yaml

  3. Add missing scripts:

    Let’s copy train.py and evaluate.py scripts from staging area to main scripts directory:

     cp scripts-staging/train.py scripts/
     cp scripts-staging/evaluate.py scripts/
    
  4. Let’s run the dvc repro

     dvc repro
     dvc push
    
  5. Commit all changes to Git

     git add .
     git commit -m "Build complete DVC full pipeline v1.0"
     git tag v1.0
    

Good to Know?

Want to test locally?

Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.