100-days-mlops-kodekloud

Create a DVC Pipeline for Data Processing

Problem

The xFusionCorp Industries ML team uses DVC pipelines to keep data processing reproducible. A draft dvc.yaml exists in the fraud-detection project, but dvc repro does not complete the full pipeline. Correct the pipeline definition so it runs cleanly end to end.

A project exists at /root/code/fraud-detection/ with DVC initialised. Python scripts are at src/data/process_data.py and src/data/split_data.py; raw input is at data/raw/transactions.csv. Do not modify the Python files or the input data.
The corrected pipeline must declare two stages with the following behaviour:
- process_data – Depends on data/raw/transactions.csv and src/data/process_data.py; produces data/processed/clean_transactions.csv.
- split_data – Depends on data/processed/clean_transactions.csv and src/data/split_data.py; produces data/processed/train.csv and data/processed/test.csv.
Review the existing dvc.yaml and correct everything that prevents dvc repro from completing.
After your changes, dvc repro must run end to end and dvc status must report no stale stages.

Once the pipeline is valid, the DVC extension’s PIPELINES section under the DVC view will list both stages and visualise the dependency graph between them.

Solution

The problem is in the pipeline definition, not the Python code. process_data points to the wrong script and output path, so dvc repro fails immediately.

 dvc repro

 root@controlplane fraud-detection on  main ➜  dvc repro
 Running stage 'process_data':
 > python src/data/process.py
 python: can't open file '/root/code/fraud-detection/src/data/process.py': [Errno 2] No such file or directory
 ERROR: failed to reproduce 'process_data': failed to run: python src/data/process.py, exited with 2

Correct the process_data stage so it uses the right script, depends on the raw input, and writes the expected cleaned file:

 process_data:
   cmd: python src/data/process_data.py
   deps:
     - data/raw/transactions.csv
     - src/data/process_data.py
   outs:
     - data/processed/clean_transactions.csv

Add the downstream split_data stage so it depends on the cleaned output and produces both train and test files:

 split_data:
   cmd: python src/data/split_data.py
   deps:
     - data/processed/clean_transactions.csv
     - src/data/split_data.py
   outs:
     - data/processed/train.csv
     - data/processed/test.csv

Re-run the pipeline and confirm the workspace is clean:
```
 dvc repro
 dvc status
```
Review the full dvc.yaml pipeline before submission.

Key Points

dvc repro runs stages in dependency order and stops as soon as a stage is misconfigured.
A wrong script path, missing dependency, or incorrect output path is enough to break the pipeline.
split_data must depend on data/processed/clean_transactions.csv so DVC knows when the upstream data changes.
dvc status should report no stale stages once the pipeline is fixed.
The DVC extension can visualize the dependency graph after both stages are valid.

This site is open source. Improve this page.