-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run-cache: cache stage runs with no dependencies? #10258
Comments
Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies. |
This no-deps stages which are invalidated (due to no dvc.lock entry, or change in name due to templating) are an edge case. It can be an annoyance for sure, but technically still valid to run. In fact, we used to always run these kinds of stages until 2.0 where it was a breaking change (#5187).
There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs. No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline. |
Sorry, I don't follow what type of stage you have in mind. Could you show an example? I think we under-utilize the run-cache, and I have talked to a few people who intuitively expect the run-cache to always work since they think "I ran this before, so DVC should know not to run it again." We have an easy solution if users always want to run it, but no solution for people who want to use the run-cache here. |
stages:
load_data:
cmd:
- wget https://example.com/raw.csv
outs:
- raw.csv
extract_data:
cmd: python extract_data.py
deps:
- raw.csv
outs:
- train.csv
- test.csv
train:
cmd: python train.py
params:
- train
deps:
- train.csv
outs:
- model.joblib
evaluate:
cmd: python evaluate.py
params:
- evaluate
deps:
- test.csv
- train.csv
- model.joblib
metrics:
- reference.json flowchart TD
node1["evaluate"]
node2["extract_data"]
node3["load_data"]
node4["train"]
node2-->node1
node2-->node4
node3-->node2
node4-->node1
|
Why not include |
The pipeline is very simple and does the job. External Besides, this example is "inspired" from a recent tutorial, but I have seen a lot of dvc.yaml like this. You can find similar examples like this through GitHub Search: https://github.com/search?q=path%3A**%2Fdvc.yaml+cmd+wget+OR+curl&type=code |
Thanks, look like this is indeed common.
Still not sure I agree with this concern, though. In the examples I see, it looks like it's a static dataset and expected to only run once, and enabling the run-cache makes it more likely that it is only run once. |
Let's say, if we have above stage, with no dependencies and an output. When I run it and rerun it again, it says
But if I have a lock file missing or stage name changed, it will force a rerun.
Ideally, run-cache is supposed to prevent this scenario, but it does not work for a stage without any dependencies. Should it cache those kinds of stages?
cc @efiop
Related: https://iterativeai.slack.com/archives/C044738NACC/p1706207735608469
The text was updated successfully, but these errors were encountered: