-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics generation for on-call schedule #4016
base: dev
Are you sure you want to change the base?
Conversation
shurkys seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation and renamed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This seems a good idea 👍 , adding some suggestion. It would be also great to add a few tests.
) | ||
|
||
# Iterate over organization IDs | ||
for org_id in org_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loop may be too expensive to run for multiple orgs/schedules (and make the scrape timeout). Have you considered using a similar approach to the other metrics keeping data in cache and just fetching from there when running the collector? Note that we already have a periodical task checking who is on-call for each schedule (refresh_ical_file
) which already updates cache (update_cached_oncall_users_for_schedule
). I think it should be possible to plug this there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I don’t know how to optimize it even more.
for schedule, users in oncall_users.items(): | ||
# Add metrics for each user and team combination in the schedule | ||
for user in users: | ||
metrics_schedule.add_metric([schedule.name, schedule.team.name, user.username], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, schedule.team
may be None
, so you would need something like:
schedule.team.name if schedule.team else "No team"
(to match the value used in the other metrics linking to teams)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Corrected
# Retrieve on-call schedules data | ||
schedules = OnCallSchedule.objects.all() | ||
# Process on-call schedule data | ||
oncall_users = get_cached_oncall_users_for_multiple_schedules(schedules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good update to make things more efficient, but this could still trigger requests to fetch imported ical files from ical-based schedules, making the method too slow in this case.
I think the best path forward for this would be to implement the logic here using cached data only, which potentially means reworking what data we cache for schedules in the refresh task (ie. the data you use here) and implement some additional flexibility to only use information we have in cache (otherwise fetching all schedules from DB and iterate on them, potentially triggering external http requests, and later also getting user details from DB, will make the exporter API timeout).
Hope that makes sense, I can take a look and try to draft something if I get some time available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised the code. API calls also use queryset = OnCallSchedule.objects.all(). There was no caching in the schedulers itself and OnCallSchedule.objects.all().query
says that a query is constantly being generated into the database.
Now /api/v1/schedules/
is in use and does not cause any problems with load or delay in calls. I don’t see any restrictions why the same approach cannot be used now for metrics, since there is no other way.
As for optimization, it seems the focus should be on improving the scheduler itself and implementing caching mechanisms. I looked in there, got scared, and closed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I see, that queryset
definition there shouldn't be needed and I think it is not being used (note the API class defines and uses the get_queryset
method, where the query is limited per org).
In any case, the main issue here is that for schedules not in cache, the get_cached_oncall_users_for_multiple_schedules
may trigger http requests to refresh imported ical files, making this exporter method take too long and then the prometheus scrape would timeout (FWIW, I confirmed this manually running the logic in our stack).
I agree this could work ok in a local setup, but we would need some additional work and optimization before merging and enabling it in cloud (which as I said I would be interested to enable, so I will take a look when I get some time). Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, it seems like it indeed makes sense.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 30 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
What this PR does
This PR adds functionality to generate metrics for on-call schedules. It introduces a method _get_schedule that retrieves on-call schedules for specified organization IDs, iterates through each schedule, and collects metrics for each user and team combination in the schedule. This enhancement improves the monitoring and analysis capabilities of the application by providing insights into the on-call rotation and team responsibilities.
Output example:
oncall_schedule{schedule="Schedule1",team="test_team1",user="oncall"} 1.0
Which issue(s) this PR fixes
Closes #3427
Checklist
pr:no public docs
PR label added if not required)