Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should IRQHandler caps have a way to set whether Acknowledgement happens automatically? #1185

Open
kent-mcleod opened this issue Feb 2, 2024 · 16 comments

Comments

@kent-mcleod
Copy link
Member

Context: Currently interrupt handling can be delegated to userlevel through the use of IRQHandler capabilities. These capabilities allow a Notification capability (with send rights) to be selected for delivering IRQ events to user level. Currently every IRQ delivered to user level has to be acknowledged via an Ack invocation on the IRQHandler capability before another one will be delivered.

Problem: There are cases where the Ack invocation is requesting an operation that the kernel could perform immediately after sending the irq event on to the notification. Software generated interrupts and other IRQ types that support Edge-triggering are an example of cases where there's already a mechanism for not delivering additional interrupts until the cause of the first IRQ is handled whereby the level of the interrupt line is reset in time to generate the next edge.

Solution: A way to configure whether an interrupt can be auto-acknowledged by the kernel could avoid the need to perform an invocation on each IRQ handler's operation.

A configuration is likely needed because there are examples of edge triggered IRQs that still need explicit acking, eg when the input signal is a clock or something.

This change may need to go through an RFC.

@lsf37
Copy link
Member

lsf37 commented Feb 2, 2024

That would need to go through an RFC, yes, but it sounds reasonably simple to do and like a good idea to support.

How would the config mechanism work? An additional invocation of the IRQ handler cap to set some flag?

@kent-mcleod
Copy link
Member Author

How would the config mechanism work? An additional invocation of the IRQ handler cap to set some flag?

That, or maybe part of the SetHandler invocation. It can only have a single handler right?

@gernotheiser
Copy link
Member

If we're talking config options, wouldn't it also make sense to have the option to implicitly ACK the IRQ when waiting on the next one, saving one syscall?

@kent-mcleod
Copy link
Member Author

kent-mcleod commented Feb 2, 2024

If we're talking config options, wouldn't it also make sense to have the option to implicitly ACK the IRQ when waiting on the next one, saving one syscall?

Your suggestion would apply to level triggered interrupts also, but what if there's multiple pending IRQs the kernel would need to ack multiples at the end of the thread's slice?

@Indanz
Copy link
Contributor

Indanz commented Feb 2, 2024

For level triggered interrupts you need to tell the device you handled them before ACKing. If the kernel ACKs them automatically, the same interrupt will just trigger immediately.

So this feature only makes sense for edge-triggered interrupts. For those it would make a lot of sense to automatically ACK them, as they need to be ACKed before being handled on a device level any way (to avoid race codnitions nad missing interrupts). The only downside being an occasional extra interrupt to handle if it arrives again before user space handled the previous one, but this should be rare.

However, edge-triggered interrupts are not performance critical, they per definition don't happen often (they would be level triggered otherwise). So although I agree with the proposal, I think it is not really worth the effort, as it would introduce extra complexity in the general IRQ handling path and thus slightly slow that down for level-triggered interrupts.

@kent-mcleod
Copy link
Member Author

However, edge-triggered interrupts are not performance critical, they per definition don't happen often (they would be level triggered otherwise).

This seems a surprising claim to me? Do you have a longer explanation or a link to something that explains this more?

@Indanz
Copy link
Contributor

Indanz commented Feb 2, 2024

However, edge-triggered interrupts are not performance critical, they per definition don't happen often (they would be level triggered otherwise).

This seems a surprising claim to me? Do you have a longer explanation or a link to something that explains this more?

Okay, I may have worded it a bit too strongly, and strictly speaking it doesn't need to be true. But in practice I think it is, at least I've only seen edge triggered interrupts on non-performance sensitive hardware peripherals. But maybe I'm biased by my experience with simpler hardware.

With an edge-triggered interrupt you need to ACK the IRQ before handling the interrupt, or at least double check there wasn't a new event after ACKing. If you don't, you may miss events. If you get many interrupts, this early ACKing can be inefficient because the chance of getting a new interrupt is relatively high. The extra checking for new events after ACKing late is less efficient or isn't always possible.

With a level-triggered interrupt the interrupt depends on a condition, e.g. FIFO not empty, and usually you can handle the interrupt and, when done, ACK the interrupt on a CPU level. Then if a new event happened in the race window between the handling and the ACKing, the interrupt will be set immediately again instead of missing it.

That said, with seL4 and deferred IRQ handling the difference is mostly gone, because IRQs get disabled while they're being handled.

@kent-mcleod
Copy link
Member Author

With an edge-triggered interrupt you need to ACK the IRQ before handling the interrupt,

When this ack is unconditional and happens before handling the interrupt, its redundant for the kernel to require a separate call for the ack.

What's motivating this from my side is when forwarding SGIs to userlevel (they're edge triggered), the requirement for an ack means that the thread receiving events on the notification object needs to know that it has to ack events that come from threads on a different core, but doesn't have to ack events that come from threads on the same core. But the same pattern exists for PCI devices with edge triggered MSIs on x86. (In fact the kernel already just silently ignores the ack that userlevel performs)

@gernotheiser
Copy link
Member

gernotheiser commented Feb 3, 2024

With an edge-triggered interrupt you need to ACK the IRQ before handling the interrupt, or at least double check there wasn't a new event after ACKing. If you don't, you may miss events. If you get many interrupts, this early ACKing can be inefficient because the chance of getting a new interrupt is relatively high. The extra checking for new events after ACKing late is less efficient or isn't always possible.

That's why I'd like to see implicit ACKin on waiting on the next IRQ. I.e. the driver code should be something like:
while (1) {
while (device.work_available) {
process(...);
}
NBSendWait(Client,IRQ);
}
... where the Send part of the syscall signals a notification of the client and the Wait part implicitly ACKs the IRQ. There is no point in ACKing until there are no queued events from the device, and having to explicitly ACK just adds an otherwise unnecessary system call.

@Indanz
Copy link
Contributor

Indanz commented Feb 3, 2024

When this ack is unconditional and happens before handling the interrupt, its redundant for the kernel to require a separate call for the ack.

Agreed it's just a hassle when the ACK happens immediately anyway. (At startup one ACK may be needed to start receiving interrupts again, depending on platform and launch.)

What's motivating this from my side is when forwarding SGIs to userlevel (they're edge triggered), the requirement for an ack means that the thread receiving events on the notification object needs to know that it has to ack events that come from threads on a different core, but doesn't have to ack events that come from threads on the same core.

In this case SGIs are used for cross-core notification propagation and there is no other peripheral involved, so you're not really dealing with traditional interrupts and you'd ideally want an one-to-one mapping with the notification state.

However, you may still want to tell the other core not to send new SGIs while you haven't handled the current one yet in user space, to avoid unnecessary IPIs. So some form of acknowledgement could be wanted, otherwise one core can easily flood another with IRQs. Doing this automatically via notification state seems too hard to implement.

But the same pattern exists for PCI devices with edge triggered MSIs on x86. (In fact the kernel already just silently ignores the ack that userlevel performs)

x86's maskInterrupt() has as a comment "currently MSI interrupts can not be disabled", so I think it's a side effect of that instead of a design choice.

RISC-V on QEMU also immediately ACKs all interrupts, as a work around for a QEMU bug, or so the comment claims.

This really should be documented in the manual. How does this affect verification? Can it deal with deviating interrupt handling, or does it assume interrupts are disabled till seL4_IRQHandler_Ack() is called?

There is still a difference between ACKing immediately on receive and ACKing when the task actually unblocks. ACKing immediately is trivial and then there is no need to ACK from user space at all. ACKing on actual unblock instead of when becoming active would be optimal, but seems unrealistic with the current code and verification.

@Indanz
Copy link
Contributor

Indanz commented Feb 3, 2024

That's why I'd like to see implicit ACKin on waiting on the next IRQ.

But this introduces a race condition between the last device check and the implicit ACK. If a new event happened in that window, you will miss it.

There is no point in ACKing until there are no queued events from the device, and having to explicitly ACK just adds an otherwise unnecessary system call.

If you do late ACKing, you do need to do another check, so an extra system call is unavoidable.

The best you can do is automatic ACKing on task return from the wait call. That at least covers the scheduling delay, which can be much bigger than the actual handling time.

@kent-mcleod
Copy link
Member Author

... where the Send part of the syscall signals a notification of the client and the Wait part implicitly ACKs the IRQ. There is no point in ACKing until there are no queued events from the device, and having to explicitly ACK just adds an otherwise unnecessary system call.

@gernotheiser this is already possible with the NBSendWait syscall in MCS. Although that approach can only acknowledge 1 interrupt. If the driver thread has handled multiple interrupts within its slice, or it needs to signal another thread, then it can no longer use the single syscall and has to perform multiple.

The best you can do is automatic ACKing on task return from the wait call. That at least covers the scheduling delay, which can be much bigger than the actual handling time.

@Indanz, After thinking a bit more, I realize that acking earlier than the receiving thread becoming scheduled would introduce an unbounded denial-of-service ability - Just set a level triggered interrupt to auto-ack and nothing else would get to run again. To deal with this we'd need to either:

  • Trust the IRQ sources to not DOS,
  • Have some way of detecting DOS and disabling the IRQ,
    (Maybe on mcs this could be done by requiring IRQHandler Caps are used to assign scheduling contexts so that an IRQ can only be enabled if it's got a scheduling context and has budget)
  • set a priority filter on the interrupt controller CPU interface with the current seL4 thread priority
    (but this wouldn't solve a case where everything is at the same priority level)
  • Have some way of tracking the thread destined to receive the IRQ and only deactivate one that thread gets scheduled
    (I'm referring here to what you've previously suggested)

For my intended use case of user-level SGIs none of these options seem feasible without a large amount of kernel changes, so I think I'll revisit acking when the thread is unblocked at userlevel without kernel changes for now.

In general I think starting to add more configuration to each IRQ in the kernel is likely as we at least need it to implement more efficient VCPU interrupt injection.

@Indanz
Copy link
Contributor

Indanz commented Feb 4, 2024

@Indanz, After thinking a bit more, I realize that acking earlier than the receiving thread becoming scheduled would introduce an unbounded denial-of-service ability

Well, that is always the problem of interrupts: They fall outside of your carefully arranged CPU budget, and you can't control who pays the price for handling them.

Just set a level triggered interrupt to auto-ack and nothing else would get to run again.

Sure, but in practise this isn't a problem, as this would be a trusted process creating the IRQ handles and with that level of trust there are many other ways of breaking the system.

The bigger problem is that a buggy driver or peripheral can cause a DoS. And then enabling auto ACKing for convenience makes the difference between a locked up system and a slow system.

To deal with this we'd need to either:

* Trust the IRQ sources to not DOS,

If you have to write a register before you get more interrupts from the device it's okay, because then the ACK is done in user space and the kernel ACK would be truly redundant.

* Have some way of detecting DOS and disabling the IRQ,

Except if you know for sure the maximum rate of interrupts, this is impossible to get right. And then this would only catch hardware problems. For the user it doesn't matter why the system doesn't work, both DoS and wrongly disabling interrupts can have the same effect.

  (Maybe on mcs this could be done by requiring IRQHandler Caps are used to assign scheduling contexts so that an IRQ can only be enabled if it's got a scheduling context and has budget)

You could charge the SC bound to the IRQ's NF for this time. But what do you do when it runs out of budget? Then it becomes awfully complicated to get all the details right and to verify the correctness of this code.

However, currently x86 can't disable MSI at all, so x86 already has the DoS problem with no way to prevent it, so any solution is a bit theoretical till the model matches the actual implementations.

* set a priority filter on the interrupt controller CPU interface with the current seL4 thread priority
  (but this wouldn't solve a case where everything is at the same priority level)

This only makes seL4 handle the more important interrupts first, it doesn't prevent the DoS. User space still won't be able to run.

For my intended use case of user-level SGIs none of these options seem feasible without a large amount of kernel changes, so I think I'll revisit acking when the thread is unblocked at userlevel without kernel changes for now.

I think automatic ACKing would be a good feature to have, for the use cases that are safe. But it's a nice-to-have feature, and not critical.

In general I think starting to add more configuration to each IRQ in the kernel is likely as we at least need it to implement more efficient VCPU interrupt injection.

Agreed.

@kent-mcleod
Copy link
Member Author

Well, that is always the problem of interrupts: They fall outside of your carefully arranged CPU budget, and you can't control who pays the price for handling them.

No, it's different here because currently seL4 bounds interrupt inter-arrival time by how often the receiving thread can acknowledge it. So while there is ambient authority to preempt a higher priority thread to deliver an interrupt to a lower priority thread's notification object, the lower priority thread won't be able to acknowledge the irq until the higher priority thread has released the CPU. Allowing an auto-ack would mean removing this bound.

This only makes seL4 handle the more important interrupts first, it doesn't prevent the DoS. User space still won't be able to run.

My suggestion here referred to using the thread priorities for setting the interrupt priority mask. Eg interrupts of priority 10 wouldn't be delivered while a thread of priority 9 was running, but switching to a lower priority thread of 11 would result in the interrupt priority mask changing from 9 to 11 and then causing the interrupt to get delivered.

Anyway, I'll give a chance for more replies and then I'll close this issue soon as I've resolved my initial question for now. Thanks for the discussions.

@Indanz
Copy link
Contributor

Indanz commented Feb 4, 2024

No, it's different here because currently seL4 bounds interrupt inter-arrival time by how often the receiving thread can acknowledge it.

Yes, the worst case is much worse, but higher priority tasks are still interrupted at any moment to handle interrupts, and that time is charged from their budget.

Also, many bound interrupts behave the same as a single unbound interrupt.

My suggestion here referred to using the thread priorities for setting the interrupt priority mask.

Ah, that's actually a great idea and makes a lot of sense. Seems to be ARM specific though, and not sure how expensive it is to change the interrupt priority mask at runtime like this.

@kent-mcleod
Copy link
Member Author

Yes, the worst case is much worse, but higher priority tasks are still interrupted at any moment to handle interrupts, and that time is charged from their budget.

Yea. It's another reason why it'd be good to set interrupt priority filter to active thread priority. If your interrupt only preempts when the thread it's activating is a higher priority than the current running thread, then the budget for interrupting and delivering the interrupt to the notification object is charged to the thread that's scheduled at the end of that kernel action, which would be the target thread of the interrupt. Currently, high priority threads need enough budget for the worst-case interruption from all active threads of a lower priority which isn't great.

Ah, that's actually a great idea and makes a lot of sense. Seems to be ARM specific though, and not sure how expensive it is to change the interrupt priority mask at runtime like this.

It's a good thing for arm though. It's set on the CPU interface and described in the reference manual: "Writes to this register must be high performance and must ensure that no interrupt of lower priority
than the written value occurs after the write, without requiring an ISB or an exception boundary."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants