Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics generation for on-call schedule #4016

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

shurkys
Copy link

@shurkys shurkys commented Mar 5, 2024

What this PR does

This PR adds functionality to generate metrics for on-call schedules. It introduces a method _get_schedule that retrieves on-call schedules for specified organization IDs, iterates through each schedule, and collects metrics for each user and team combination in the schedule. This enhancement improves the monitoring and analysis capabilities of the application by providing insights into the on-call rotation and team responsibilities.

Output example:
oncall_schedule{schedule="Schedule1",team="test_team1",user="oncall"} 1.0

Which issue(s) this PR fixes

Closes #3427

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)

@shurkys shurkys requested a review from a team as a code owner March 5, 2024 22:09
@CLAassistant
Copy link

CLAassistant commented Mar 5, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ joeyorlando
❌ shurkys


shurkys seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@joeyorlando joeyorlando added the release:enhancement PR will be added to "Exciting New Features 🎉" section of release notes label Mar 6, 2024
Copy link
Collaborator

@joeyorlando joeyorlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution @shurkys! Do you mind adding a small section about the new metric to the docs here. Thank you! 🙏

@shurkys shurkys requested a review from a team as a code owner March 7, 2024 09:43
@shurkys
Copy link
Author

shurkys commented Mar 7, 2024

Added documentation and renamed the user label to the username for better readability

Copy link
Collaborator

@matiasb matiasb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This seems a good idea 👍 , adding some suggestion. It would be also great to add a few tests.

)

# Iterate over organization IDs
for org_id in org_ids:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop may be too expensive to run for multiple orgs/schedules (and make the scrape timeout). Have you considered using a similar approach to the other metrics keeping data in cache and just fetching from there when running the collector? Note that we already have a periodical task checking who is on-call for each schedule (refresh_ical_file) which already updates cache (update_cached_oncall_users_for_schedule). I think it should be possible to plug this there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I don’t know how to optimize it even more.

@shurkys shurkys requested a review from a team as a code owner March 14, 2024 18:11
for schedule, users in oncall_users.items():
# Add metrics for each user and team combination in the schedule
for user in users:
metrics_schedule.add_metric([schedule.name, schedule.team.name, user.username], 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, schedule.team may be None, so you would need something like:
schedule.team.name if schedule.team else "No team"
(to match the value used in the other metrics linking to teams)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Corrected

# Retrieve on-call schedules data
schedules = OnCallSchedule.objects.all()
# Process on-call schedule data
oncall_users = get_cached_oncall_users_for_multiple_schedules(schedules)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good update to make things more efficient, but this could still trigger requests to fetch imported ical files from ical-based schedules, making the method too slow in this case.

I think the best path forward for this would be to implement the logic here using cached data only, which potentially means reworking what data we cache for schedules in the refresh task (ie. the data you use here) and implement some additional flexibility to only use information we have in cache (otherwise fetching all schedules from DB and iterate on them, potentially triggering external http requests, and later also getting user details from DB, will make the exporter API timeout).

Hope that makes sense, I can take a look and try to draft something if I get some time available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised the code. API calls also use queryset = OnCallSchedule.objects.all(). There was no caching in the schedulers itself and OnCallSchedule.objects.all().query says that a query is constantly being generated into the database.
Now /api/v1/schedules/ is in use and does not cause any problems with load or delay in calls. I don’t see any restrictions why the same approach cannot be used now for metrics, since there is no other way.
As for optimization, it seems the focus should be on improving the scheduler itself and implementing caching mechanisms. I looked in there, got scared, and closed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I see, that queryset definition there shouldn't be needed and I think it is not being used (note the API class defines and uses the get_queryset method, where the query is limited per org).
In any case, the main issue here is that for schedules not in cache, the get_cached_oncall_users_for_multiple_schedules may trigger http requests to refresh imported ical files, making this exporter method take too long and then the prometheus scrape would timeout (FWIW, I confirmed this manually running the logic in our stack).
I agree this could work ok in a local setup, but we would need some additional work and optimization before merging and enabling it in cloud (which as I said I would be interested to enable, so I will take a look when I get some time). Makes sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, it seems like it indeed makes sense.

Copy link
Contributor

github-actions bot commented May 3, 2024

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 30 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the pr:stale Added to a PR that has been deemed "stale". Managed by the actions/stale GitHub Action label May 3, 2024
@github-actions github-actions bot removed the pr:stale Added to a PR that has been deemed "stale". Managed by the actions/stale GitHub Action label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:enhancement PR will be added to "Exciting New Features 🎉" section of release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grafana panel for who's on call
5 participants