The xFusionCorp Industries ML team uses DVC pipelines to keep data processing reproducible. A draft dvc.yaml exists in the fraud-detection project, but dvc repro does not complete the full pipeline. Correct the pipeline definition so it runs cleanly end to end.
A project exists at /root/code/fraud-detection/ with DVC initialised. Python scripts are at src/data/process_data.py and src/data/split_data.py; raw input is at data/raw/transactions.csv. Do not modify the Python files or the input data.
The corrected pipeline must declare two stages with the following behaviour:
data/raw/transactions.csv and src/data/process_data.py; produces data/processed/clean_transactions.csv.data/processed/clean_transactions.csv and src/data/split_data.py; produces data/processed/train.csv and data/processed/test.csv.Review the existing dvc.yaml and correct everything that prevents dvc repro from completing.
After your changes, dvc repro must run end to end and dvc status must report no stale stages.
Once the pipeline is valid, the DVC extension’s PIPELINES section under the DVC view will list both stages and visualise the dependency graph between them.
The problem is in the pipeline definition, not the Python code. process_data points to the wrong script and output path, so dvc repro fails immediately.
dvc repro
root@controlplane fraud-detection on main ➜ dvc repro
Running stage 'process_data':
> python src/data/process.py
python: can't open file '/root/code/fraud-detection/src/data/process.py': [Errno 2] No such file or directory
ERROR: failed to reproduce 'process_data': failed to run: python src/data/process.py, exited with 2
Correct the process_data stage so it uses the right script, depends on the raw input, and writes the expected cleaned file:
process_data:
cmd: python src/data/process_data.py
deps:
- data/raw/transactions.csv
- src/data/process_data.py
outs:
- data/processed/clean_transactions.csv
Add the downstream split_data stage so it depends on the cleaned output and produces both train and test files:
split_data:
cmd: python src/data/split_data.py
deps:
- data/processed/clean_transactions.csv
- src/data/split_data.py
outs:
- data/processed/train.csv
- data/processed/test.csv
Re-run the pipeline and confirm the workspace is clean:
dvc repro
dvc status
Review the full dvc.yaml pipeline before submission.
dvc repro runs stages in dependency order and stops as soon as a stage is misconfigured.split_data must depend on data/processed/clean_transactions.csv so DVC knows when the upstream data changes.dvc status should report no stale stages once the pipeline is fixed.