-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589
base: master
Are you sure you want to change the base?
[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589
Conversation
for more information, see https://pre-commit.ci
…to bugfix/17958_multi_optimizer_step_count_behaviour
@carmocca Tagging you as you seem most relevant regarding the manual optimization loop, sorry.. 😬 I'm eager to continue this / clean it up / adjust according to discussions - Any opinion? |
…gy stateful so it has access each time when re-assigning. Other bug left: multi dict does not return all should_increment, ending with mismatch of optimizer and should_increment size
…to bugfix/17958_multi_optimizer_step_count_behaviour
for more information, see https://pre-commit.ci
@awaelchli - didn't hear back from carmocca yet, trying you instead :) Any feedback is much appreciated 🙏 |
PR UNFINISHED BUT MAJOR SETUP READY FOR REVIEW TO DISCUSS HOW TO CONTINUE (i.e. review global logic & ignore 'DO NOT SUBMIT' blocks)
(before updating docs & tests in vain in case community/maintainers settle on a different approach or default)
What does this PR do?
Background
Pre this PR, in a setup using multiple optimizers (and thus a manual optimization loop) each optimizer contributes to the global step counter.
For many people this is unexpected (see #17958), as (my conjecture) the default expectation [1] is that each call to
training_step
counts as 1 increment totrainer.global_step
This unexpected behaviour can easily go unnoticed, and leads to issues:
global_step
based logic (e.g. logging, manual lr schedulers, etc.) is unexpected/wrong (in the eye of the user who does not expect this behaviour)step
on both in eachtraining_step
, which somehow indirectly caused my gradients to become 0: My model stopped training without me initially realizing why. The quick fix mentioned in incorrect global_step with multiple optimizers and automatic_optimization=False #17958 immediately resolved this again..Proposed behaviour
step
being called in eachtraining_step
)global_step
to behave (by updating the documentation accordingly)Actual PR implications
The current PR does NOT implement expectation [1] per se. Instead, its a simpler update in that direction that yet allows for likely sufficient versatility (If there is a single incrementing optimizer, number of calls to
training_step
can still be misaligned if said optimizer is not called each step or multiple times per step)global_step
counterAlternative solutions:
trainer.global_step
can be separated from optimizers and just a counter for the number of function calls. Clean on the surface, this sounds more invasive under the hood (expecting there was a good reason to make it according toopt.steps()
calls and nottraining_step
function calls) (this also does not allow custom counting behaviour anymore, unless we link the two again which sounds like itll lead to messy code setup)Fixes #17958
Indirectly: this changes the
global_step
counter and thus changes behaviour. Anybody who does not expect the former behaviour will have their counter fixed and making sense to them, anybody who does expect the former behaviour has to update theirconfigure_optimizers
to have that behaviour persist.Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--19589.org.readthedocs.build/en/19589/