100-days-mlops-kodekloud

Version Datasets and Models Across Git Branches

Problem

The xFusionCorp Industries ML team keeps different dataset and model versions on different Git branches so that the team can roll between versions cleanly. Tag the current state as v1.0, produce a v2-improved branch based on a newer dataset, and confirm that switching back restores the original data.

  1. A project exists at /root/code/fraud-detection/ with a working DVC pipeline and the baseline data/raw/transactions.csv already tracked.

  2. An improved dataset has been pre-staged at /root/code/fraud-detection/data/raw/transactions_v2.csv and is visible in the file explorer. Do not delete this file.

  3. On the main branch, tag the current state as v1.0.

  4. Create a new branch named v2-improved. Replace the tracked dataset with the contents of the v2 file, re-track it with DVC, re-run the pipeline, and commit the changes.

  5. Switch back to the main branch and use dvc checkout to restore the v1 dataset on disk. The restored content must match the hash recorded by the v1.0 tag.

The DVC extension’s DVC TRACKED section in the EXPLORER panel will reflect the current branch’s tracked state—it should show different dataset hashes on main and v2-improved.

Solution

To version the dataset across branches, follow these steps:

  1. Go to project directory and tag current main state as v1.0:

    cd /root/code/fraud-detection/
    git switch main
    git tag v1.0
    git push origin v1.0
    
  2. Create branch for improved dataset:

    git switch -c v2-improved
    
  3. Replace tracked dataset with staged v2 file, then re-track it with DVC:

    cp data/raw/transactions_v2.csv data/raw/transactions.csv
    dvc add data/raw/transactions.csv
    
  4. Re-run DVC pipeline and commit updated branch state:

    dvc repro
    git add data/raw/transactions.csv.dvc dvc.lock .gitignore
    git commit -m "Use improved transactions dataset"
    git push origin v2-improved
    
  5. Switch back to main and restore original tracked data from DVC:

    git switch main
    dvc checkout
    
  6. Verify transactions.csv on main matches hash stored at v1.0:

    git show v1.0:data/raw/transactions.csv.dvc
    dvc status
    

The v2-improved branch now points to the newer dataset and regenerated pipeline outputs, while main recovers the original v1.0 tracked state after dvc checkout.

Available DVC commands?