-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86 VM + SMP Issue #1148
Comments
Could you show your kernel and vm.c diffs? It's a bit unclear where exactly the prints are done. |
(For performance you want your IRQ to arrive on the same CPU core as the task handling it, but that doesn't solve the bug though.) |
Here's where the prints are:
For the
|
I recommend keeping counts of each event and printing out the totals now and then. Sharing a UART with multiple cores is unreliable in the sense that output of one may be overwritten by output of another core, depending on details (my UART driver checked for FIFO empty before it would start writing anything, that reduced the problems with concurrent writes a lot). Multiple notifications are coalesced, so counting the number of events is not a good idea if you want a reliable timer server.
But then it should stay in
This can be verified by printing something on every wake-up.
Maybe, or an unexpected corner case. |
I put a Also, the I'm just failing to see a state where I'd expect to see this behavior. |
Agreed. To me |
In One other thing - we're using MSIs, so I was able to route the interrupts directly to the same core as the expected interrupt handler in the guest. That removes the I'm hoping we can push up an example application showcasing the problem soon. |
That is not true: Please add a check for pending notifications to handleVmxFault() to see if it overwrites the notification. You can add an assertion for task state being |
Incidentally, I intend to change the code so that if an IPI is handled when trying to take the lock, we return to user space instead of continuing, as that seems too surprising and subtle a behaviour to have. It's not worth the small performance gain of avoiding trapping twice. |
Ah, you may be right!
During runtime:
Now the question needs to be, how do we address this? |
Let's try a |
That didn't work. Both seemingly got stuck in a |
Bummer, that's unexpected. Is it broken immediately, or does it break after a while? |
It seems to break right away. Here is the scheduler dump:
And the objdump:
|
Oh, and here is my
|
Yes, it is, but.. Oops, we can only do the restore after we released the lock. So set an ipi flag and check the flag at the end of |
Like this?
The VM does print, and FWIW, before it locks up, it does seem to avoid the choppy printing I was seeing. |
No, |
I'm sorry, but I'm not too familiar with the BKL. Can you send me a diff of what you're thinking? I can't get it to work. |
So the below diff is not exactly what I said, but the main thing is that once we request the lock, we are committed and need to finish getting it, so we also need to release it before we return to user space. We didn't do anything visible to other cores in iff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..ddd2c4df5 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -76,6 +76,9 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ /* Go back to user space: */
+ restore_user_context();
+ UNREACHABLE();
}
arch_pause();
@@ -94,6 +97,7 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
{
clh_qnode_t *prev;
big_kernel_lock.node_owners[cpu].node->value = CLHState_Pending;
+ bool_t got_ipi = false;
prev = sel4_atomic_exchange(&big_kernel_lock.head, irqPath, cpu, __ATOMIC_ACQ_REL);
@@ -110,6 +114,8 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ /* We are committed to get the lock, can't return here, we have to wait till we get it: */
+ got_ipi = true;
/* We do not need to perform a memory release here as we would have only modified
* local state that we do not need to make visible */
}
@@ -118,6 +124,13 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
/* make sure no resource access passes from this point */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
+ if (got_ipi) {
+ /* Release the lock we just got: */
+ clh_lock_release(cpu);
+ /* Go back to user space: */
+ restore_user_context();
+ UNREACHABLE();
+ }
}
static inline void FORCE_INLINE clh_lock_release(word_t cpu) |
Alright, I applied that patch, and I'm getting the same behavior. Each VM is stuck in the syscall instruction. |
What if in the |
Worth a try, to see if it works at least. I really want to fix all similar bugs by exiting the kernel after an IPI, but I'll keep working on it in #871. |
Okay, so that kind of ended up working. My final solution was to add a case where we handle both a fault and a notification, and then keep the notification active during this edge case. Unfortunately, that doesn't fix the time drift issue, but it does allow us to not break from the VMEnter every time a notification is received in the VM state. Once your PR is merged, I'll integrate those commits and see if it fixes the problem without the extra state. Thanks for your help! I'll leave this issue open for other to review / comment, since my fix may not be the best approach here. |
Hello,
A while back, I made this PR that fixed an issue we were having with passthrough interrupts on an x86 SMP VM configuration. We've tried with MCS and the Round Robin scheduler, and the issue is present with both.
Here is a simplified version of our configuration (The actual version gives multiple hardware resources and cores to each VM)
Each VM is isolated on its own core with a fault handler:
We noticed that one of the VMs would begin to experience a time drift. Naturally, I assumed it was a problem with the Time Server, or the HPET emulation layer. However, I noticed that there were a bunch of times when the calls to
init_timer_completed()
would return 0. The time server locks all shared resources around a mutex, so this would only make sense if the Time Server notification was being received multiple times.Looking at @kent-mcleod's comments on the closed PR, he mentioned that pending events would be delivered multiple times. This seems to match the behavior that I'm currently seeing. However, when I remove that patch, the VMs print sporadically, and passthrough interrupts don't work at all, which causes our system to break.
So I began to look into the problem a bit. I'm not an expert in the (Bound) Notification path, so please correct me if I'm wrong.
Specifically for SMP VMs, there are two places in the kernel where the Notification is delivered:
VMCheckBoundNotification
:The first case is triggered when the Bound Notification is set to the
Active
state, when in our case, occurs when the VMM is active, either initializing the VM, or handling a VM Exit / Interrupt.In this case, the VMEnter call immediately returns with the
SEL4_VMENTER_RESULT_NOTIF
fault set, which triggers the vm_run loop.The second case is triggered when a thread with a VCPU is running in its VM state, and the notification occurs on a different core than the VM, which happens a lot in our use case.
I'm assuming that the Notification is set to
Active
such that any additional interrupts or signals cause the badge to be updated. In this case, thepossibleSwitchTo
command is called after the VM is switched to its VMM state, which should trigger the same exit with theSEL4_VMENTER_RESULT_NOTIF
fault set.To test this, I added some print statements in each of these cases. I know printing effects the timing of the system, but I don't have access to a debugger, so its the best I can do.
To start, there seems to be a 1:1 correlation between the kernel and userspace prints, with the badge values lining up.
But then I started to notice these events, where the prints in
VMCheckBoundNotification
would print without the corresponding print in thevm_run
loop.I can only think of a few reasons why this would be happening:
msgInfoRegister
register isn't being set with theSEL4_VMENTER_RESULT_NOTIF
valueThe first issue doesn't seem to be the case, as the
VMCheckBoundNotification
only triggers when the Thread State isRunningVM
, and the function sets the thread state toRunning
. That would only happen if the guest calls theVMEnter
syscall. My "fix" was to leave the Notification in theActive
state, which would trigger theVMEnter
conditional, returning with a notification fault, even if thevm_run
loop had already been triggered by theVMCheckBoundNotification
function.For the second case, I explicitly set the
SEL4_VMENTER_RESULT_NOTIF
flag in theVMCheckBoundNotification
function, but that didn't work either.That leaves the third issue, a data race that I just don't know about.
Did I miss something with my investigation? Has anyone else run into this problem? I'll reply to this issue if I do find a fix
The text was updated successfully, but these errors were encountered: