Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] adding sample limit to scrape classe #6589

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nicolastakashi
Copy link
Contributor

Description

Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
If it fixes a bug or resolves a feature request, be sure to link to that issue.

When managing sample limits for different targets, scrape class can support default config for different group of targets with different sample limits.

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Verification

Please check the Prometheus-Operator testing guidelines for recommendations about automated tests.

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

scrape class sample limit 

@nicolastakashi nicolastakashi requested a review from a team as a code owner May 14, 2024 15:00
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, haven't had the time to a very detailed review... just a quick walkthrough

// SampleLimit defines per-scrape limit on number of scraped samples that will be accepted.
// Only valid in Prometheus versions 2.45.0 and newer.
//
// +optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain the order in which the configuration is generated, similar to what we do with the other ScrapeClass fields?

We have quite a few options to limits now:

  • Limits set in .*Monitor objects
  • Limits in ScrapeClass, set in Prometheus
  • Enforced Limites, set in Prometheus

The order could get very confusing for beginners 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurSens can you check it back please?

@simonpasquier
Copy link
Contributor

I'm worried that it makes it too complicated to understand which limits are applied eventually. IIUC the use case for a scrape class limit is that an admin wants to apply a sane default value when a scrape object doesn't specify a limit and keep the object's limit when defined (even if greater than the default limit). Is this correct?

I wonder if such feature shouldn't be delegated to policy engines like Kyverno?

@nicolastakashi
Copy link
Contributor Author

I'm worried that it makes it too complicated to understand which limits are applied eventually. IIUC the use case for a scrape class limit is that an admin wants to apply a sane default value when a scrape object doesn't specify a limit and keep the object's limit when defined (even if greater than the default limit). Is this correct?

Here are different use cases for setting sample limits in Prometheus:

Different Sample Limits for Different Target Groups:

  • Scenario: An admin wants to set different sample limits for various target groups while using a single Prometheus server to scrape metrics from these groups.

  • Solution: The admin can define multiple scrape classes, each with its own sample limit, ensuring that each target group has an appropriate limit.

  • Challenge: Since the sample limits vary by target group, the global sample limit in the Prometheus configuration cannot be used.

Soft and Hard Sample Limits:

  • Scenario: An admin wants to set both soft and hard limits for the sample limits in Prometheus.
    Current State: Prometheus has two properties, sampleLimit and enforcedSampleLimit, which are mutually exclusive. If both are set, enforcedSampleLimit takes precedence.

  • Solution: By setting sample limits at the scrape class level, admins can specify an enforcedSampleLimit that only overrides the scrape class's sample limit if it exceeds the enforcedSampleLimit.

  • Challenge: Yet another field to define sample limits.

Comprehensive documentation can help clarify these complexities.

Regarding using an additional tool, I believe this change will allow the operator to handle the use case without needing extra tools. However, I may be biased as I'm proposing this change.

Signed-off-by: Nicolas Takashi <nicolas.tcs@hotmail.com>
@nicolastakashi nicolastakashi force-pushed the chore/adding-sample-limit-scrape-class branch from 1c6c88b to fe9d3bd Compare May 16, 2024 08:53
@simonpasquier
Copy link
Contributor

simonpasquier commented May 16, 2024

Current State: Prometheus has two properties, sampleLimit and enforcedSampleLimit, which are mutually exclusive. If both are set, enforcedSampleLimit takes precedence.

Just to be clear: as of today, the enforced limit only takes precedence when it's less than the scrape object's limit.
Sorry I missed that you're talking about the Prometheus sampleLimit. I was talking about scrape objects.

@simonpasquier
Copy link
Contributor

Scenario: An admin wants to set both soft and hard limits for the sample limits in Prometheus.

sorry I don't understand this use case.

@nicolastakashi
Copy link
Contributor Author

nicolastakashi commented May 16, 2024

Scenario: An admin wants to set both soft and hard limits for the sample limits in Prometheus.

sorry I don't understand this use case.

Ok let me try different!
Imagine you have a service monitor with a sample limit of 2k by default and a enforcedSampleLimit at 4k.
The enforcedSampleLimit will only take precedence over the sample limit defined on the service monitor if the limit on the service monitor were greater than 4k.

The one defined on the service monitor is a soft limit since the service monitor object can increase that value up to the enforcedSampleLimit.

The scrape class in this context acts a default value for sample limit in case the service monitor owner didn't define any.

Does it looks better? @simonpasquier

@simonpasquier
Copy link
Contributor

Thanks it clarifies a lot!

The scrape class in this context acts a default value for sample limit in case the service monitor owner didn't define any.

Could it be solved if we consider that when both sampleLimit and enforcedSampleLimit are specified, we take the min if the scrape object has no limit itself?

Given

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
spec:
  sampleLimit: 1000
  enforcedSampleLimit: 2000
  • For a ServiceMonitor with no sampleLimit defined => 1000.
  • For a ServiceMonitor with a sampleLimits of 500 => 500.
  • For a ServiceMonitor with a sampleLimits of 1500 => 1500.
  • For a ServiceMonitor with a sampleLimits of 2500 => 2000.

WDYT?

@nicolastakashi
Copy link
Contributor Author

sampleLimit: 1000
enforcedSampleLimit: 2000

I thought the same, but the enforced sample limit is taking precedence over the sample limit defined on the prometheus limits.

I think sampleLimit from Prometheus is being configured as a global sample limit and not on the monitor object level @simonpasquier

@simonpasquier
Copy link
Contributor

Reading the code again, I think that we could improve the generated config and leverage the fact that the global sample_limit is always applied if no limit is set (instead of setting the value on all scrape configs).

Going back to my examples:

  • For a ServiceMonitor A with no sampleLimit defined.
  • For a ServiceMonitor B with a sampleLimits of 500.
  • For a ServiceMonitor C with a sampleLimits of 1500.
  • For a ServiceMonitor D with a sampleLimits of 2500.

With a global sampleLimit = 1000 and enforcedSampleLimit = 2000, we should generate:

global:
  sample_limit: 1000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 1000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2000 # clamped down by the enforced limit

With a global sampleLimit = 1000 and no enforcedSampleLimit, we should generate:

global:
  sample_limit: 1000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 1000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2500

With no global sampleLimit and enforcedSampleLimit = 2000, we should generate:

global:
  sample_limit: 2000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 2000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2000

@nicolastakashi
Copy link
Contributor Author

Reading the code again, I think that we could improve the generated config and leverage the fact that the global sample_limit is always applied if no limit is set (instead of setting the value on all scrape configs).

Going back to my examples:

  • For a ServiceMonitor A with no sampleLimit defined.
  • For a ServiceMonitor B with a sampleLimits of 500.
  • For a ServiceMonitor C with a sampleLimits of 1500.
  • For a ServiceMonitor D with a sampleLimits of 2500.

With a global sampleLimit = 1000 and enforcedSampleLimit = 2000, we should generate:

global:
  sample_limit: 1000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 1000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2000 # clamped down by the enforced limit

With a global sampleLimit = 1000 and no enforcedSampleLimit, we should generate:

global:
  sample_limit: 1000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 1000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2500

With no global sampleLimit and enforcedSampleLimit = 2000, we should generate:

global:
  sample_limit: 2000 # applies to all monitors without an explicit limit
scrape_configs:
- job_name: "A" # sample_limit: 2000 because of global
- job_name: "B"
  sample_limit: 500
- job_name: "C"
  sample_limit: 1500
- job_name: "D"
  sample_limit: 2000

@simonpasquier this works fine for my use case, and I'll open another PR doing this implementation, but this will not solve the use case where a Prometheus Admin would like to set different default sample limits for different group targets.

Do you think this PR is still valid?

@ArthurSens
Copy link
Member

ArthurSens commented May 17, 2024

Yeah, I see ScrapeClasses as a great ally for Platform teams that use Prometheus-Operator to offer Prometheus as a Service.

What I envision the most is using ScrapeClasses to automatically add security and default relabeling configuration, but also to offer "Scrape Tiers", where consumers of these Prometheus as a Service could choose their appropriate tiers while negotiating budgets with the Platform Team.

We have a few examples out there, e.g. Cloudflare establishes basic limits to all scrape configurations and allow teams to manually override them. The problem here is that this approach requires consumers of this API to understand Prometheus' limits and this can easily become a barrier.

A much simpler abstraction would be to allow Platform teams to set limits in scrape classes and just offer tiers like:

  • Basic scrape tier
  • Medium scrape tier
  • High scrape tier

@simonpasquier
Copy link
Contributor

I'm definitely not against adding limits to scrape classes but as stated in the Cloudflare article, global limits would probably work for > 90% users.
I'm also not sure how a Prometheus admin would prevent a team from referencing a more generous scrape class than what they're supposed to? But maybe it can be considered a feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants