Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize the risk of expired SAS tokens AND simplify node management #627

Open
BMurri opened this issue Feb 29, 2024 · 0 comments
Open

Minimize the risk of expired SAS tokens AND simplify node management #627

BMurri opened this issue Feb 29, 2024 · 0 comments
Labels
Code Quality Improvements Make code make code more readable, maintainable, prevent bugs, improve security enhancement New feature or request Robustness Enable users can run tasks w/o bugs or with mitigation of known bugs Scalability Enable users can scale TES workloads TES Priority: P2 Groomed to a Priority 2 issue
Milestone

Comments

@BMurri
Copy link
Collaborator

BMurri commented Feb 29, 2024

Problem:
For every work task, two things sited in azure blob storage MUST be on the node: the task runner & the task runner's task json. Those things currently must be tagged with a SAS token because they cannot be downloaded without it. Any start task that needs any resource from blob storage suffers from the same issue.

Any task added to a job cannot have its command-line changed (e.g. to update a SAS token) without first terminating and then deleting the task from the job and replacing it with a new one (which will end up at the end of the line). This is a problem when running at scale, because it is very conceivable (and actually has happened) that the token expires before the task finally starts running.

Any start tasks that must download anything requiring a SAS token have it worse, because the start task is generated at pool creation (and thus becomes a long-lived entity). In Terra today, SAS tokens live shorter lives than pools do (and a pools "lifetime" setting is the limit for new task additions to the pool's job, NOT new task STARTs). Start tasks can be updated, but that appears to require either a different batch client (with the C# library) than the one we are currently using, or a different approach to how we call the batch data-plane APIs than what we are currently doing.

Solution:

  1. As proposed in Load node runner binary in start task (so once per node) rather then for each task #520, load the node task runner via the startup task. As an expansion of Move all Batch node initialization and clean-up operations to the runner #363, perform all start-task related work via that runner. Further, use that runner for all tasks scheduled/run on that node.
  2. Alter the runner such that it accepts from its command-line and/or environment variables all information needed to be able to generate a SAS token and download the task JSON (thus, eliminating the need for the TES server to supply any SAS token in the task command-line/task script file).

Describe alternatives you've considered
Do nothing knowing that these issues will continue to be issues, especially as environments ask for shorter SAS token lifetimes as time goes on.

Sub Tasks

Code dependencies
Will this require code changes in:

  • CoA, for new and/or existing deployments? No
  • TES standalone, for new and/or existing deployments? No
  • Terra, for new and/or existing deployments? No
  • Build pipeline? No
  • Integration tests? No

Additional context
Completing this feature will enable easier implementation and/or largely or fully complete the following issues:

@BMurri BMurri added enhancement New feature or request TES Priority: P2 Groomed to a Priority 2 issue Robustness Enable users can run tasks w/o bugs or with mitigation of known bugs Scalability Enable users can scale TES workloads Code Quality Improvements Make code make code more readable, maintainable, prevent bugs, improve security labels Feb 29, 2024
@BMurri BMurri added this to the next milestone Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Quality Improvements Make code make code more readable, maintainable, prevent bugs, improve security enhancement New feature or request Robustness Enable users can run tasks w/o bugs or with mitigation of known bugs Scalability Enable users can scale TES workloads TES Priority: P2 Groomed to a Priority 2 issue
Projects
None yet
Development

No branches or pull requests

1 participant