100-days-mlops-kodekloud

Create a DVC Pipeline for Data Processing

Problem

The xFusionCorp Industries ML team uses DVC pipelines to keep data processing reproducible. A draft dvc.yaml exists in the fraud-detection project, but dvc repro does not complete the full pipeline. Correct the pipeline definition so it runs cleanly end to end.

  1. A project exists at /root/code/fraud-detection/ with DVC initialised. Python scripts are at src/data/process_data.py and src/data/split_data.py; raw input is at data/raw/transactions.csv. Do not modify the Python files or the input data.

  2. The corrected pipeline must declare two stages with the following behaviour:

    • process_data – Depends on data/raw/transactions.csv and src/data/process_data.py; produces data/processed/clean_transactions.csv.
    • split_data – Depends on data/processed/clean_transactions.csv and src/data/split_data.py; produces data/processed/train.csv and data/processed/test.csv.
  3. Review the existing dvc.yaml and correct everything that prevents dvc repro from completing.

  4. After your changes, dvc repro must run end to end and dvc status must report no stale stages.

Once the pipeline is valid, the DVC extension’s PIPELINES section under the DVC view will list both stages and visualise the dependency graph between them.

Solution

  1. The problem is in the pipeline definition, not the Python code. process_data points to the wrong script and output path, so dvc repro fails immediately.

     dvc repro
    
     root@controlplane fraud-detection on  main ➜  dvc repro
     Running stage 'process_data':
     > python src/data/process.py
     python: can't open file '/root/code/fraud-detection/src/data/process.py': [Errno 2] No such file or directory
     ERROR: failed to reproduce 'process_data': failed to run: python src/data/process.py, exited with 2
    
  2. Correct the process_data stage so it uses the right script, depends on the raw input, and writes the expected cleaned file:

     process_data:
       cmd: python src/data/process_data.py
       deps:
         - data/raw/transactions.csv
         - src/data/process_data.py
       outs:
         - data/processed/clean_transactions.csv
    
  3. Add the downstream split_data stage so it depends on the cleaned output and produces both train and test files:

     split_data:
       cmd: python src/data/split_data.py
       deps:
         - data/processed/clean_transactions.csv
         - src/data/split_data.py
       outs:
         - data/processed/train.csv
         - data/processed/test.csv
    
  4. Re-run the pipeline and confirm the workspace is clean:

     dvc repro
     dvc status
    
  5. Review the full dvc.yaml pipeline before submission.

Key Points