Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run-cache: cache stage runs with no dependencies? #10258

Open
skshetry opened this issue Jan 27, 2024 · 7 comments
Open

run-cache: cache stage runs with no dependencies? #10258

skshetry opened this issue Jan 27, 2024 · 7 comments
Labels
A: run-cache Related to the run-cache feature

Comments

@skshetry
Copy link
Member

stages:
  stage1:
    cmd: echo foo > foo
    outs:
    - foo

Let's say, if we have above stage, with no dependencies and an output. When I run it and rerun it again, it says

$ dvc repro
Running stage 'stage1':
> echo foo > foo
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock .gitignore

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Stage 'stage1' didn't change, skipping

$ dvc repro
Stage 'stage1' didn't change, skipping
Data and pipelines are up to date.

But if I have a lock file missing or stage name changed, it will force a rerun.
Ideally, run-cache is supposed to prevent this scenario, but it does not work for a stage without any dependencies. Should it cache those kinds of stages?

cc @efiop

Related: https://iterativeai.slack.com/archives/C044738NACC/p1706207735608469

@skshetry skshetry added the A: run-cache Related to the run-cache feature label Jan 27, 2024
@dberenbaum
Copy link
Contributor

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

@skshetry
Copy link
Member Author

skshetry commented Jan 31, 2024

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

This no-deps stages which are invalidated (due to no dvc.lock entry, or change in name due to templating) are an edge case. It can be an annoyance for sure, but technically still valid to run. In fact, we used to always run these kinds of stages until 2.0 where it was a breaking change (#5187).

Do we have a good reason not to add no-dep stages to the run-cache?

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

@dberenbaum
Copy link
Contributor

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

I think we under-utilize the run-cache, and I have talked to a few people who intuitively expect the run-cache to always work since they think "I ran this before, so DVC should know not to run it again." We have an easy solution if users always want to run it, but no solution for people who want to use the run-cache here.

@skshetry
Copy link
Member Author

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

stages:
  load_data:
    cmd: 
    - wget https://example.com/raw.csv
    outs:
    - raw.csv
 
  extract_data:
    cmd: python extract_data.py
    deps:
    - raw.csv
    outs:
    - train.csv
    - test.csv

  train:
    cmd: python train.py
    params:
    - train
    deps:
    - train.csv
    outs:
    - model.joblib

  evaluate:
    cmd: python evaluate.py
    params:
    - evaluate
    deps:
    - test.csv
    - train.csv
    - model.joblib
    metrics:
    - reference.json
flowchart TD
        node1["evaluate"]
        node2["extract_data"]
        node3["load_data"]
        node4["train"]
        node2-->node1
        node2-->node4
        node3-->node2
        node4-->node1

@dberenbaum
Copy link
Contributor

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

@skshetry
Copy link
Member Author

skshetry commented Jan 31, 2024

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

The pipeline is very simple and does the job. External deps and import-url are additional concepts to learn.

Besides, this example is "inspired" from a recent tutorial, but I have seen a lot of dvc.yaml like this.

https://github.com/iterative/evidently-dvc/blob/f0ed5c0f526c9eaf2b5dde57d500abc08d063614/pipelines/train/dvc.yaml

You can find similar examples like this through GitHub Search:

https://github.com/search?q=path%3A**%2Fdvc.yaml+cmd+wget+OR+curl&type=code

@dberenbaum
Copy link
Contributor

Thanks, look like this is indeed common.

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

Still not sure I agree with this concern, though. In the examples I see, it looks like it's a static dataset and expected to only run once, and enabling the run-cache makes it more likely that it is only run once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: run-cache Related to the run-cache feature
Projects
None yet
Development

No branches or pull requests

2 participants