Complete the xFusionCorp Industries fraud-detection production DVC pipeline. Three stages are already wired in dvc.yaml, two remain, and the pipeline must finish as a reproducible, SeaweedFS-backed, v1.0-tagged release.
A project exists at /root/code/ml-pipeline/ with Git and DVC initialised. The params.yaml is in place and the .dvc/config is pre-configured to push to the SeaweedFS bucket dvc-storage at http://localhost:8333.
The ingest, validate, and preprocess stages are already declared in dvc.yaml, but one of them contains an incorrect output path that prevents dvc repro from completing. Find and fix it.
The remaining two stages need to be added:
train – Depends on the preprocessed dataset and scripts/train.py; reads n_estimators, max_depth, test_size, and random_seed from params.yaml; outputs models/model.pkl and data/processed/test_split.csv; declares metrics.json as a DVC metric with cache: false.evaluate – Depends on models/model.pkl, data/processed/test_split.csv, and scripts/evaluate.py; outputs reports/evaluation.json declared with cache: false.The two scripts you need are pre-staged at /root/code/ml-pipeline/scripts-staging/train.py and scripts-staging/evaluate.py. Copy them into scripts/ before adding the stages.
Run the full pipeline with dvc repro, push the cache to the SeaweedFS remote with dvc push, and tag the current state as v1.0.
Commit every change to Git so the release is fully captured.
Open the SeaweedFS Filer button at the top of the lab and navigate to
/buckets/dvc-storage/to confirm that the bucket holds the pushed artefacts under thefiles/md5/...layout.
Let’s move into project directory and run dvc repro to identify the issue:
cd ml-pipeline
dvc repro
dvc repro
Running stage 'ingest':
> python scripts/ingest.py
Data ingested successfully: 20 rows, 5 columns
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Running stage 'validate':
> python scripts/validate.py
Validation: 20 rows, valid=True
Updating lock file 'dvc.lock'
Running stage 'preprocess':
> python scripts/preprocess.py
Preprocessed: 20 clean rows
ERROR: failed to reproduce 'preprocess': output 'data/processed/cleaned.csv' does not exist
So, we can its failing at preprocess stage, and if we look at the
preprocess.pyscript the actual output file isclean.csv. Let’s rename and run thedvc reproagain.
Add remaining two stage:
let’s add two more stages train and evaluate to complete the pipeline according to requirements.
stages:
ingest:
cmd: python scripts/ingest.py
deps:
- scripts/ingest.py
- data/raw/data.csv
validate:
cmd: python scripts/validate.py
deps:
- data/raw/data.csv
- scripts/validate.py
outs:
- reports/validation.json:
cache: false
preprocess:
cmd: python scripts/preprocess.py
deps:
- data/raw/data.csv
- scripts/preprocess.py
outs:
- data/processed/clean.csv
train:
cmd: python scripts/train.py
deps:
- data/processed/clean.csv
- scripts/train.py
params:
- n_estimators
- random_seed
- test_size
- max_depth
outs:
- models/model.pkl
- data/processed/test_split.csv
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python scripts/evaluate.py
deps:
- scripts/evaluate.py
- models/model.pkl
- data/processed/test_split.csv
outs:
- reports/evaluation.json:
cache: false
We have added two missing stages. Full source code of day 19 dvc.yaml
Add missing scripts:
Let’s copy train.py and evaluate.py scripts from staging area to main scripts directory:
cp scripts-staging/train.py scripts/
cp scripts-staging/evaluate.py scripts/
Let’s run the dvc repro
dvc repro
dvc push
Commit all changes to Git
git add .
git commit -m "Build complete DVC full pipeline v1.0"
git tag v1.0
git add . and git commit before git tag v1.0; the tag must point to the final committed release, not uncommitted work.dvc push, confirm artifacts land in the SeaweedFS bucket under files/md5/... by opening the SeaweedFS Filer and browsing /buckets/dvc-storage/.dvc repro fails on preprocess, check that dvc.yaml uses data/processed/clean.csv, not data/processed/cleaned.csv.Clone the the full project and follow the guidelines to test the ml-pipeline full dvc pipeline locally.