create mininal universal checkpoint info for client state #5526

xylian86 · 2024-05-13T06:32:06Z

This PR solves the Issue-5430.

The PR enables the universal checkpoint feature for other platforms like HuggingFace Trainer without requiring changes to the HuggingFace code. It does this by creating a minimal universal checkpoint info, specifically the version, as a default action for the client state.

tjruwase · 2024-05-13T09:43:30Z

@xylian86, thanks for this great work. Can you please add convergence curves of an HF model as demo?

tohtana · 2024-05-14T21:44:24Z

deepspeed/runtime/engine.py

@@ -3319,6 +3320,7 @@ def _save_checkpoint(self, save_dir, tag, client_state={}, exclude_frozen_parame
                     ds_config=self.config,
                     ds_version=version)
        state.update(client_state)
+        inject_universal_info(state)


Shoudn't we show a warning when we don't have the necessary info?
It will silently produce an incorrect checkpoint if the checkpoint is loaded for TP or PP.
We can say that the converted checkpoint is only for pure DP scaling.

After I discussed with Tunji, we are considering another approach for this. He will share a new approach. I keep this comment but just disregard it.

@xylian86, the new approach is to do the injection in the conversion script rather than during saving. The injection should be done into ds_checkpoint before this assertion. Furthermore, the injection should be enabled by command-line argument (disabled by default) so that users are fully aware of what is going on. The command-line arg could be called --inject-missing-state.

loadams · 2024-05-28T23:45:41Z

deepspeed/checkpoint/utils.py

@@ -5,7 +5,7 @@

 import os
 import torch
-from .constants import (MODEL_FILE_PREFIX, MODEL_FILE_SUFFIX, OPTIM_FILE_SUFFIX, ZERO_FILE_PREFIX)
+from .constants import (MODEL_FILE_PREFIX, MODEL_FILE_SUFFIX, OPTIM_FILE_SUFFIX, ZERO_FILE_PREFIX, UNIVERSAL_CHECKPOINT_INFO, UNIVERSAL_CHECKPOINT_VERSION_KEY, UNIVERSAL_CHECKPOINT_VERSION_VALUE)


FYI @xylian86 - can you run the precommit formatter on this branch so it will pass our Formatting check?

pre-commit run --all-files

xylian86 · 2024-06-03T15:35:16Z

Close this PR as I opened a new one at PR-5608 with the new implementation as @tjruwase suggested.

create mininal universal checkpoint info for client state

323474a

xylian86 requested review from mrwyattii and tjruwase as code owners May 13, 2024 06:32

tjruwase requested review from samadejacobs, tohtana and lekurile and removed request for mrwyattii May 13, 2024 09:38

tohtana reviewed May 14, 2024

View reviewed changes

Merge branch 'master' into uc4hf

6b00f54

loadams reviewed May 28, 2024

View reviewed changes

xylian86 closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create mininal universal checkpoint info for client state #5526

create mininal universal checkpoint info for client state #5526

xylian86 commented May 13, 2024

tjruwase commented May 13, 2024

tohtana May 14, 2024

tohtana May 14, 2024

tjruwase May 14, 2024

loadams May 28, 2024

xylian86 commented Jun 3, 2024

create mininal universal checkpoint info for client state #5526

create mininal universal checkpoint info for client state #5526

Conversation

xylian86 commented May 13, 2024

tjruwase commented May 13, 2024

tohtana May 14, 2024

Choose a reason for hiding this comment

tohtana May 14, 2024

Choose a reason for hiding this comment

tjruwase May 14, 2024

Choose a reason for hiding this comment

loadams May 28, 2024

Choose a reason for hiding this comment

xylian86 commented Jun 3, 2024