run-cache: cache stage runs with no dependencies? #10258

skshetry · 2024-01-27T16:15:05Z

stages:
  stage1:
    cmd: echo foo > foo
    outs:
    - foo

Let's say, if we have above stage, with no dependencies and an output. When I run it and rerun it again, it says

$ dvc repro
Running stage 'stage1':
> echo foo > foo
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock .gitignore

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Stage 'stage1' didn't change, skipping

$ dvc repro
Stage 'stage1' didn't change, skipping
Data and pipelines are up to date.

But if I have a lock file missing or stage name changed, it will force a rerun.
Ideally, run-cache is supposed to prevent this scenario, but it does not work for a stage without any dependencies. Should it cache those kinds of stages?

cc @efiop

Related: https://iterativeai.slack.com/archives/C044738NACC/p1706207735608469

dberenbaum · 2024-01-30T19:28:43Z

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

skshetry · 2024-01-31T05:43:41Z

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

This no-deps stages which are invalidated (due to no dvc.lock entry, or change in name due to templating) are an edge case. It can be an annoyance for sure, but technically still valid to run. In fact, we used to always run these kinds of stages until 2.0 where it was a breaking change (#5187).

Do we have a good reason not to add no-dep stages to the run-cache?

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

dberenbaum · 2024-01-31T13:18:03Z

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

I think we under-utilize the run-cache, and I have talked to a few people who intuitively expect the run-cache to always work since they think "I ran this before, so DVC should know not to run it again." We have an easy solution if users always want to run it, but no solution for people who want to use the run-cache here.

skshetry · 2024-01-31T15:56:20Z

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

stages:
  load_data:
    cmd: 
    - wget https://example.com/raw.csv
    outs:
    - raw.csv
 
  extract_data:
    cmd: python extract_data.py
    deps:
    - raw.csv
    outs:
    - train.csv
    - test.csv

  train:
    cmd: python train.py
    params:
    - train
    deps:
    - train.csv
    outs:
    - model.joblib

  evaluate:
    cmd: python evaluate.py
    params:
    - evaluate
    deps:
    - test.csv
    - train.csv
    - model.joblib
    metrics:
    - reference.json

flowchart TD
        node1["evaluate"]
        node2["extract_data"]
        node3["load_data"]
        node4["train"]
        node2-->node1
        node2-->node4
        node3-->node2
        node4-->node1

dberenbaum · 2024-01-31T16:07:06Z

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

skshetry · 2024-01-31T16:15:32Z

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

The pipeline is very simple and does the job. External deps and import-url are additional concepts to learn.

Besides, this example is "inspired" from a recent tutorial, but I have seen a lot of dvc.yaml like this.

https://github.com/iterative/evidently-dvc/blob/f0ed5c0f526c9eaf2b5dde57d500abc08d063614/pipelines/train/dvc.yaml

You can find similar examples like this through GitHub Search:

https://github.com/search?q=path%3A**%2Fdvc.yaml+cmd+wget+OR+curl&type=code

dberenbaum · 2024-01-31T16:51:59Z

Thanks, look like this is indeed common.

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

Still not sure I agree with this concern, though. In the examples I see, it looks like it's a static dataset and expected to only run once, and enabling the run-cache makes it more likely that it is only run once.

skshetry added the A: run-cache Related to the run-cache feature label Jan 27, 2024

skshetry mentioned this issue Mar 25, 2024

run-cache: Stages using only templating are not supported #9895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-cache: cache stage runs with no dependencies? #10258

run-cache: cache stage runs with no dependencies? #10258

skshetry commented Jan 27, 2024

dberenbaum commented Jan 30, 2024

skshetry commented Jan 31, 2024 •

edited

dberenbaum commented Jan 31, 2024

skshetry commented Jan 31, 2024

dberenbaum commented Jan 31, 2024

skshetry commented Jan 31, 2024 •

edited

dberenbaum commented Jan 31, 2024

run-cache: cache stage runs with no dependencies? #10258

run-cache: cache stage runs with no dependencies? #10258

Comments

skshetry commented Jan 27, 2024

dberenbaum commented Jan 30, 2024

skshetry commented Jan 31, 2024 • edited

dberenbaum commented Jan 31, 2024

skshetry commented Jan 31, 2024

dberenbaum commented Jan 31, 2024

skshetry commented Jan 31, 2024 • edited

dberenbaum commented Jan 31, 2024

skshetry commented Jan 31, 2024 •

edited

skshetry commented Jan 31, 2024 •

edited