Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(conf): add concurrency_timeout option #13047

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

outsinre
Copy link
Contributor

@outsinre outsinre commented May 20, 2024

Summary

Currently, our Kong Gateway used a fixed timeout value (60s) for concurrency jobs.

However, some concurrency jobs may take longer. For example, a declarative config reconfigure case:

  1. Operator pushed config A, and Kong is applying it. This job will take 65s.
  2. Operator pushed config B (e.g. changed a plugin config), and Kong will wait for the job A, but timeout after 60s.
  3. Operator thought config B was applied, but not. The operator went for a cup of coffee.

feat(conf): add concurrency_timeout option

Sets the timeout (in ms) for which a concurrency
job (e.g. declaractive reconfigure) should wait
for mutex before giving up. After the timeout,
Kong gateway would skip the job, and log an error
message.

If you happen to see "timeout acquiring ... lock"
log entries, try to adjust this value accordingly.
Please be careful that setting a large value might
result in a long list of concurreny jobs in queue,
which could overflow the queue.

It is fine to timeout if a job is safe to be
overriden by subsequent jobs, like declaractive
reconfigure. Do not touch this option unless you
clear about it.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix #FTI-5969

Sets the timeout (in ms) for which a concurrency
job (e.g. declaractive reconfigure) should wait
for mutex before giving up. After the timeout,
Kong gateway would skip the job, and log an error
message.

If you happen to see "timeout acquiring ... lock"
log entries, try to adjust this value accordingly.
Please be careful that setting a large value might
result in a long list of concurreny jobs in queue,
which could overflow the queue.

It is fine to timeout if a job is safe to be
overriden by subsequent jobs, like declaractive
reconfigure. Do not touch this option unless you
clear about it.
@bungle
Copy link
Member

bungle commented May 20, 2024

Duplicate of already closed PR:
#13027

If these take 60+ secs, we have a bigger problems somewhere. And I have proposed fixes to make this faster.

@bungle
Copy link
Member

bungle commented May 20, 2024

If we want to do something, I would limit the change totally to reconfigure_timeout or _reconfigure_timeout and call it that. Changing concurrency library default value can have unwanted side-effects elsewhere. But I am not sure if we need this at all. It is much better to make reconfigure to not take 60+ seconds. Like faster loading of entities etc (easy to do). And it does definitely not take 60+ secs on 99,99999% cases. I have tested some pretty large configs on my laptop and it never goes to those numbers. It is possible that something has gotten much slower, like router rebuilding or plugin iterator rebuilding.

Also when this issue happens it is already too late to change this value. If customer even asks, and then we ask them to raise _reconfigure_timeout to what? 2000s?

I think it is rather easy to make reconfigure_timeout self-adjusting. E.g. make it math.min(math.max(longest_rebuild_that_has_happened_on_this_worker * 1.5, 60), 600)

Why would you need to restart node for this and make manual config changes when machine can do it.

And if you never want to timeout, then well, why we even have a timeout then? Let's just these reconfigures pile up? Sure we have timeout for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants