-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many people report scheduler panics recently #6424
Comments
Getting the rustc versions and lockfiles from reporters would be a good start |
Sadly I cannot ping people from the other threads in here. |
Summary of our instance of this issue (neondatabase/neon#7162):
|
@jcsp what rustc version was this? |
@Noah-Kennedy rustc version was 1.76.0 Adding to what John said above, we also hit this assertion in the scheduler. Can provide the stack trace, but I doubt it would be of much use. |
@VladLazar did you do any recent bumps other than to tokio in your dependency tree? I'd like to dig in further with you folks, you are the best repro rn it seems. |
We are currently suspecting that this originates in a dependency, and are trying to track down what that may be. |
Also @jcsp @VladLazar what platform are you guys seeing this on? |
So, I am currently trying to narrow down the list of possible external dependencies which this could be. My current ask to anyone who comes across this issue is that you look through any other recent changes in your dependency tree and bring that information to this thread. |
@Noah-Kennedy thanks for looking into this.
x86_64, Linux 5.10.0-14-amd64 Debian
Here's the list of recent changes which predate the panics from our lockfile:
Let us know if there's anything else we can help with. |
tysm! Next steps here are to go through any other changes in your codebase that stand out to confirm we aren't conflating things, and to go through the changes in this list. I'm about to grab lunch, but will go through these crates after. |
@VladLazar looking at neondatabase/neon#7048 I suspect there may be more to go through. |
The listed deps are all pretty old for the most part, I feel like we would have heard about issues with them by now. |
@VladLazar you or someone else with Neon will need to dig in more, possibly using tools like valgrind. I am not seeing anything noteworthy. |
For #6416, we were on Rust version 1.76. I'm still working out any recent changes in dependencies. We deploy frequently and this issue is rare, so who knows how far back a change was deployed that could have resulted in this. |
@Noah-Kennedy Yeah, I did that initial check manually. I've scripted it and went through all common_deps.json: List of common dependencies between The rust compiler had also been updated pretty recently on 8th of Feb: 1.75.0 -> 1.76.0 |
@hallmaxw have you seen this again since then, and how frequently has this occurred? |
It only happened once. We haven't seen it since then. |
Same situation. It panics three days ago and now runs with no problem. |
@Noah-Kennedy don't mean to impose, but have you looked into this further by any chance? We've hit the same class of life-cycle assertion again yesterday on tokio 1.37.0 (this one more specifically). |
What is the message ( |
I don't know if this is relevant, but I'm getting an illegal opcode that might be of interest? Context: I'm spawning ~400 connections to various database clusters, like so:
What's interesting is that when I run the binary in debug mode, it works, but if I run it through valgrind (or on k8s), it doesn't:
It's the same error (although I've not checked the core dump from k8s). I initially suspected rustls, but I've since changed my dependencies to use native-tls, to narrow down the suspects. Here's my complete Cargo.toml:
Let me know if there's anything else I can do to help. Or, indeed, if I'm in the wrong thread/issue! E: I forgot that you wanted rustc version too (and the lockfile). The above was compiled with
The above was built in debug mode. Building it in release mode with the same settings yields a binary that doesn't panic, but eventually times out:
Building it with Debug Release It should be noted that the code is a refactor of a simpler approach, that looked like this:
Rebuilding the above in rustc 1.77.2 (release) yields a slow but working binary. It's fairly clear that valgrind affects performance, but everything else being equal, the refactored solution is 10x slower. I don't know how much use this data is to you, but I figured I could at least leave it here. |
Over the past few days, we've received an abnormal number of issues reporting a panic in the scheduler that really shouldn't happen. I'm wondering whether a recent change in a dependency or Rust itself is causing memory corruption.
If this is also happening to you, please add a comment to this issue. Please make sure to include your Tokio version and Rust version, and the versions of other relevant dependencies (e.g. include your lock file).
I will be closing the other issues as duplicates of this one, so see the references below for the list.
The text was updated successfully, but these errors were encountered: