Add metrics generation for on-call schedule #4016

shurkys · 2024-03-05T22:09:41Z

What this PR does

This PR adds functionality to generate metrics for on-call schedules. It introduces a method _get_schedule that retrieves on-call schedules for specified organization IDs, iterates through each schedule, and collects metrics for each user and team combination in the schedule. This enhancement improves the monitoring and analysis capabilities of the application by providing insights into the on-call rotation and team responsibilities.

Output example:
oncall_schedule{schedule="Schedule1",team="test_team1",user="oncall"} 1.0

Which issue(s) this PR fixes

Closes #3427

Checklist

Unit, integration, and e2e (if applicable) tests updated
Documentation added (or pr:no public docs PR label added if not required)

CLAassistant · 2024-03-05T22:09:47Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ joeyorlando
❌ shurkys

shurkys seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

joeyorlando

thanks for the contribution @shurkys! Do you mind adding a small section about the new metric to the docs here. Thank you! 🙏

shurkys · 2024-03-07T09:47:28Z

Added documentation and renamed the user label to the username for better readability

matiasb

Nice! This seems a good idea 👍 , adding some suggestion. It would be also great to add a few tests.

matiasb · 2024-03-11T12:41:51Z

engine/apps/metrics_exporter/metrics_collectors.py

+        )
+
+        # Iterate over organization IDs
+        for org_id in org_ids:


I think this loop may be too expensive to run for multiple orgs/schedules (and make the scrape timeout). Have you considered using a similar approach to the other metrics keeping data in cache and just fetching from there when running the collector? Note that we already have a periodical task checking who is on-call for each schedule (refresh_ical_file) which already updates cache (update_cached_oncall_users_for_schedule). I think it should be possible to plug this there?

Updated. I don’t know how to optimize it even more.

…e function

matiasb · 2024-03-14T18:29:10Z

engine/apps/metrics_exporter/metrics_collectors.py

+        for schedule, users in oncall_users.items():
+            # Add metrics for each user and team combination in the schedule
+            for user in users:
+                metrics_schedule.add_metric([schedule.name, schedule.team.name, user.username], 1)


FYI, schedule.team may be None, so you would need something like:
schedule.team.name if schedule.team else "No team"
(to match the value used in the other metrics linking to teams)

Good point. Corrected

matiasb · 2024-03-14T18:29:56Z

engine/apps/metrics_exporter/metrics_collectors.py

+        # Retrieve on-call schedules data
+        schedules = OnCallSchedule.objects.all()
+        # Process on-call schedule data
+        oncall_users = get_cached_oncall_users_for_multiple_schedules(schedules)


This is a good update to make things more efficient, but this could still trigger requests to fetch imported ical files from ical-based schedules, making the method too slow in this case.

I think the best path forward for this would be to implement the logic here using cached data only, which potentially means reworking what data we cache for schedules in the refresh task (ie. the data you use here) and implement some additional flexibility to only use information we have in cache (otherwise fetching all schedules from DB and iterate on them, potentially triggering external http requests, and later also getting user details from DB, will make the exporter API timeout).

Hope that makes sense, I can take a look and try to draft something if I get some time available.

Revised the code. API calls also use queryset = OnCallSchedule.objects.all(). There was no caching in the schedulers itself and OnCallSchedule.objects.all().query says that a query is constantly being generated into the database.
Now /api/v1/schedules/ is in use and does not cause any problems with load or delay in calls. I don’t see any restrictions why the same approach cannot be used now for metrics, since there is no other way.
As for optimization, it seems the focus should be on improving the scheduler itself and implementing caching mechanisms. I looked in there, got scared, and closed it.

Hm.. I see, that queryset definition there shouldn't be needed and I think it is not being used (note the API class defines and uses the get_queryset method, where the query is limited per org).
In any case, the main issue here is that for schedules not in cache, the get_cached_oncall_users_for_multiple_schedules may trigger http requests to refresh imported ical files, making this exporter method take too long and then the prometheus scrape would timeout (FWIW, I confirmed this manually running the logic in our stack).
I agree this could work ok in a local setup, but we would need some additional work and optimization before merging and enabling it in cloud (which as I said I would be interested to enable, so I will take a look when I get some time). Makes sense?

Understood, it seems like it indeed makes sense.

github-actions · 2024-05-03T01:46:24Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 30 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Generate metrics for on-call schedule

652a0e9

shurkys requested a review from a team as a code owner March 5, 2024 22:09

joeyorlando added the release:enhancement PR will be added to "Exciting New Features 🎉" section of release notes label Mar 6, 2024

Merge branch 'dev' into dev

d2a7d27

joeyorlando reviewed Mar 6, 2024

View reviewed changes

joeyorlando temporarily deployed to github-pages March 6, 2024 19:18 — with GitHub Actions Inactive

Rename user to username and add docs for oncall_schedule

6278bc7

shurkys requested a review from a team as a code owner March 7, 2024 09:43

joeyorlando requested review from Ferril and matiasb March 7, 2024 12:36

matiasb reviewed Mar 11, 2024

View reviewed changes

retrieves all on-call schedules using and processes the data using th…

df5b1ae

…e function

shurkys requested a review from a team as a code owner March 14, 2024 18:11

matiasb reviewed Mar 15, 2024

View reviewed changes

shurkys and others added 2 commits March 15, 2024 20:11

Fix handling of 'No team' value for schedule.team

635224a

Merge branch 'dev' into dev

69a5d1b

joeyorlando temporarily deployed to github-pages April 2, 2024 15:47 — with GitHub Actions Inactive

github-actions bot added the pr:stale Added to a PR that has been deemed "stale". Managed by the actions/stale GitHub Action label May 3, 2024

iskhakov added needs triage and removed needs triage labels May 21, 2024

github-actions bot removed the pr:stale Added to a PR that has been deemed "stale". Managed by the actions/stale GitHub Action label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics generation for on-call schedule #4016

Add metrics generation for on-call schedule #4016

shurkys commented Mar 5, 2024 •

edited by joeyorlando

CLAassistant commented Mar 5, 2024 •

edited

joeyorlando left a comment

shurkys commented Mar 7, 2024

matiasb left a comment

matiasb Mar 11, 2024

shurkys Mar 14, 2024

matiasb Mar 14, 2024

shurkys Mar 15, 2024

matiasb Mar 14, 2024

shurkys Mar 15, 2024

matiasb Mar 18, 2024

shurkys Mar 18, 2024

github-actions bot commented May 3, 2024

Add metrics generation for on-call schedule #4016

Are you sure you want to change the base?

Add metrics generation for on-call schedule #4016

Conversation

shurkys commented Mar 5, 2024 • edited by joeyorlando

What this PR does

Which issue(s) this PR fixes

Checklist

CLAassistant commented Mar 5, 2024 • edited

joeyorlando left a comment

Choose a reason for hiding this comment

shurkys commented Mar 7, 2024

matiasb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 3, 2024

shurkys commented Mar 5, 2024 •

edited by joeyorlando

CLAassistant commented Mar 5, 2024 •

edited