Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挜 engine: Remove --oci-max-parallelism #7395

Closed
wants to merge 1 commit into from

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented May 16, 2024

Fixes #6894

This is not an option that we should support. Justin puts it best:

I think with (Dagger) Modules, this effectively means that
`max-parallelism` is pretty much just completely broken - it just fits
very badly with the idea of nested containers.

... but to me this kind of highlights that we need a better way of
constraining resources - max-parallelism is just really not fit for
the shape that pipelines tend to take these days. Maybe we should
provide some cgroup config options instead (if the end desire is to
constrain CPU usage/etc).

While this can still be configured if a custom Engine toml config is provided, it means that users went out of their way to have this. It's hard on purpose and we don't recommend using this setting anymore.

This also removes it from the Helm chart.

WDYT @sipsma @jedevc @aluzzardi @vito ?


@matipan if everyone is in agreement with this change, we should remove this from all our infra. We can do that before the release goes out.

@gerhard gerhard added this to the v0.12.0 milestone May 16, 2024
@gerhard gerhard changed the title Remove oci-max-parallelism entirely 馃挜 BREAKING CHANGE 馃挜 Remove oci-max-parallelism May 16, 2024
@gerhard gerhard changed the title 馃挜 BREAKING CHANGE 馃挜 Remove oci-max-parallelism 馃挜 engine: Remove oci-max-parallelism May 16, 2024
@gerhard gerhard changed the title 馃挜 engine: Remove oci-max-parallelism 馃挜 engine: Remove --oci-max-parallelism May 16, 2024
@gerhard gerhard force-pushed the remove-oci-max-parallelism branch 2 times, most recently from c533cee to 3e145f0 Compare May 16, 2024 17:28
This is not an option that we should support. Justin puts it best:

    I think with (Dagger) Modules, this effectively means that
    `max-parallelism` is pretty much just completely broken - it just fits
    very badly with the idea of nested containers.

    ... but to me this kind of highlights that we need a better way of
    constraining resources - max-parallelism is just really not fit for
    the shape that pipelines tend to take these days. Maybe we should
    provide some cgroup config options instead (if the end desire is to
    constrain CPU usage/etc).

While this can still be configured if a custom Engine toml config is
provided, it means that users went out of their way to have this. It's
hard on purpose and we don't recommend using this setting anymore.

This also removes it from the Helm chart.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard force-pushed the remove-oci-max-parallelism branch from 3e145f0 to 5bc265f Compare May 21, 2024 16:48
@gerhard
Copy link
Member Author

gerhard commented May 21, 2024

Speaking with @aluzzardi, these checks may have failed due to a ClickHouse migration which killed some in-flight traces. Rebasing & force pushing so that we force re-run all failed checks.

@jedevc
Copy link
Member

jedevc commented May 22, 2024

If possible, I'd like to hold off on merging this. If I have the time, I think keeping this option would be best for people using it, but we'd need the improved semantics previously described. But we can keep this open at least as a tracking issue.

@matipan if everyone is in agreement with this change, we should remove this from all our infra. We can do that before the release goes out.

IMO, we should do this as soon as possible - it's definitely a potential candidate for some of the weird ephemeral CI failures we've seen. With this option it's possible to get every running test is a stuck state (if the right combination of tests all start at the right time) - which prevents any progress being made (which sounds like a symptom we've seen before).

@sipsma
Copy link
Contributor

sipsma commented May 23, 2024

Agree with @jedevc, this option is used already by multiple users and while it's totally buggy once modules are used with too low a value, that feels more like a bug than enough justification to remove this entirely.

  • BTW, @gerhard you asked about the num-cpu option; that was added because it was requested by a user a while back. It makes it easier for users to scale this setting across many different hosts with many different CPU counts without having to jump through hoops to update the engine config on each host.

I'd totally be in favor of adding a scary warning message to the help for this though and tracking a long term fix to the problem.

  • Actually now that I think about it, all the session related refactoring work should make fixing this a lot more tenable (what would have been a large project to medium at most).

Copy link
Contributor

github-actions bot commented Jun 7, 2024

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jun 7, 2024
@jedevc jedevc added kind/stale and removed stale labels Jun 7, 2024
@gerhard gerhard changed the title 馃挜 engine: Remove --oci-max-parallelism 馃挜 helm: Remove --oci-max-parallelism Jun 10, 2024
@gerhard gerhard changed the title 馃挜 helm: Remove --oci-max-parallelism 馃挜 engine: Remove --oci-max-parallelism Jun 10, 2024
@gerhard
Copy link
Member Author

gerhard commented Jun 10, 2024

Thank you for the extra context @sipsma & @jedevc.

I will close this & re-open it as a non-breaking change that is restricted to the Helm Chart only.

@gerhard gerhard closed this Jun 10, 2024
gerhard added a commit to gerhard/dagger that referenced this pull request Jun 10, 2024
While we are keeping this default so that we don't break existing users,
setting this value is something that we want to move away from.

The problem is that this setting limits how many operations can run in
parallel. It is still possible for a single operation to max out all
available cores. It is also known for a value of `2` to cause deadlocks,
i.e. dagger#6894

For now, we just allow this to be disabled with either `--set
engine.args=''` or by explicitly setting this value to an empty list.

This started as dagger#7395 which turned
out to be too big of a change. We since scaled back the initial ambition
& are taking a smaller step towards eventually phasing this out.

FWIW, all the Dagger Engines that we run inside the Dagger infra do not
use the `--oci-max-parallelism` option.

This also removes the option from tekton-dagger-task docs example.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
gerhard added a commit that referenced this pull request Jun 10, 2024
While we are keeping this default so that we don't break existing users,
setting this value is something that we want to move away from.

The problem is that this setting limits how many operations can run in
parallel. It is still possible for a single operation to max out all
available cores. It is also known for a value of `2` to cause deadlocks,
i.e. #6894

For now, we just allow this to be disabled with either `--set
engine.args=''` or by explicitly setting this value to an empty list.

This started as #7395 which turned
out to be too big of a change. We since scaled back the initial ambition
& are taking a smaller step towards eventually phasing this out.

FWIW, all the Dagger Engines that we run inside the Dagger infra do not
use the `--oci-max-parallelism` option.

This also removes the option from tekton-dagger-task docs example.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

馃悶 Setting --oci-max-parallelism 2 causes deadlocks
3 participants