100-days-mlops-kodekloud

Build Complete DVC ML Pipeline with Remote Storage and Experiments

Problem

Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.

A project exists at /root/code/ml-pipeline/ with Git and DVC initialised. The params.yaml is in place and the .dvc/config is pre-configured to push to the SeaweedFS bucket dvc-storage at http://localhost:8333.
The ingest, validate, and preprocess stages are already declared in dvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it.
The remaining two stages need to be added:
- train – Depends on the preprocessed dataset and scripts/train.py; reads n_estimators, max_depth, test_size, and random_seed from params.yaml; outputs models/model.pkl and data/processed/test_split.csv; declares metrics.json as a DVC metric with cache: false.
- evaluate – Depends on models/model.pkl, data/processed/test_split.csv, and scripts/evaluate.py; outputs reports/evaluation.json declared with cache: false.
The two scripts you need are pre-staged at /root/code/ml-pipeline/scripts-staging/train.py and scripts-staging/evaluate.py. Copy them into scripts/ before adding the stages.
Run the full pipeline with dvc repro, push the cache to the SeaweedFS remote with dvc push, and tag the current state as v1.0.
Commit every change to Git so the release is fully captured.

Open the SeaweedFS Filer button at the top of the lab and navigate to /buckets/dvc-storage/ to confirm that the bucket holds the pushed artefacts under the files/md5/... layout.

Solution

Let’s move into project directory and run dvc repro to identify the issue:

 cd ml-pipeline
 dvc repro

 dvc repro
 Running stage 'ingest':
 > python scripts/ingest.py
 Data ingested successfully: 20 rows, 5 columns
 Generating lock file 'dvc.lock'
 Updating lock file 'dvc.lock'

 Running stage 'validate':
 > python scripts/validate.py
 Validation: 20 rows, valid=True
 Updating lock file 'dvc.lock'

 Running stage 'preprocess':
 > python scripts/preprocess.py
 Preprocessed: 20 clean rows
 ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist

So, we can its failing at preprocess stage, and if we look at the preprocess.py script the actual output file is clean.csv. Let’s rename and run the dvc repro again.

Add remaining two stage:

let’s add two more stages train and evaluate to complete the pipeline according to requirements.

 stages:
     ingest:
         cmd: python scripts/ingest.py
         deps:
             - scripts/ingest.py
             - data/raw/data.csv

     validate:
         cmd: python scripts/validate.py
         deps:
             - data/raw/data.csv
             - scripts/validate.py
         outs:
             - reports/validation.json:
                 cache: false

     preprocess:
         cmd: python scripts/preprocess.py
         deps:
             - data/raw/data.csv
             - scripts/preprocess.py
         outs:
             - data/processed/clean.csv
     train:
         cmd: python scripts/train.py
         deps:
             - data/processed/clean.csv
             - scripts/train.py
         params:
             - n_estimators
             - random_seed
             - test_size
             - max_depth
         outs:
             - models/model.pkl
             - data/processed/test_split.csv
         metrics:
             - metrics.json:
                 cache: false
     evaluate:
         cmd: python scripts/evaluate.py
         deps:
             - scripts/evaluate.py
             - models/model.pkl
             - data/processed/test_split.csv
         outs:
             - reports/evaluation.json:
                 cache: false

We have added two missing stages. Full source code of day 19 dvc.yaml

Add missing scripts:

Let’s copy train.py and evaluate.py scripts from staging area to main scripts directory:
```
 cp scripts-staging/train.py scripts/
 cp scripts-staging/evaluate.py scripts/
```
Let’s run the dvc repro
```
 dvc repro
 dvc push
```

Commit all changes to Git

 git add .
 git commit -m "Build complete DVC full pipeline v1.0"
 git tag v1.0

Good to Know?

Run git add . and git commit before git tag v1.0; the tag must point to the final committed release, not uncommitted work.
After dvc push, confirm artifacts land in the SeaweedFS bucket under files/md5/... by opening the SeaweedFS Filer and browsing /buckets/dvc-storage/.
If dvc repro fails on preprocess, check that dvc.yaml uses data/processed/clean.csv, not data/processed/cleaned.csv.

Want to test locally?

Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.

This site is open source. Improve this page.