-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In some circumstances, TCP receive queue fills up despite sockets being closed #2519
Comments
Very strange, I've never heard about any similar report even on the other versions you mention. Was the "show info" above produced when the problem was happening ? Or can't you connect anymore to the stats socket when the problem is happening ? Does it recover only by restarting haproxy ? If you're able to connect to the stats socket, sending a "show stat" and a "show fd" could help. What I suspect could be related to the size of the backlog: I'm seeing a sessrate of 446 in your "show info" output, which indicates what connection rate is acceptable with SSL negotiation. Let's assume your server can deal with 2k sessions/s including SSL etc. If you receive an attack with more than that, the accept queue will fill up. It will then take 40s to process the last entry in the queue at 2k/s, and by then the client will have aborted but there's no way to know, so it costs a handshake calculation for nothing. In such a case, an approach can be to limit the backlog to a much lower value via the "backlog" keyword on the "bind" line. This way during an attack, you won't be accumulating connections that users gave up, and the recovery can be much faster. Just set that to 2-3x the max rate you can accept so that users don't needlessly wait more than 2-3s before getting an error. |
Yes
Nope, connecting to the socket is fine
That is correct
Will do as soon as I can identify a host has the issue again
We set a max ssl rate of 3k sessions. (in fact the limits are dynamic, we set
Ok I will try tinkering with |
Ok, so on another host with 4 cores, and therefore our dynamic maxconn set to 160000, we get this
This time we hit the non dynamic limit of
And the result of Maybe what these two instances have in common is that they both hit their |
I had not noticed first that you were using maxsslrate. Pretty interesting. Maybe you're facing a race condition that prevents it from properly recovering when the limit it met. That's something reasonably easy to try to reproduce on our side by setting a lower limit. At least your "show fd" shows the listener is active (thus not disabled) in the poller. Thanks for these, we'll need a bit of time to analyse it deeper now. |
Could it be related to #2476 then? |
I checked. At first glance, it seems similar but I doubt it is related. Especially because here the listeners don't seems to be limited when the issue occurred. |
However, the fix was backported, thus it can be tested. |
[EDIT: previous version had typo 2.6.9 -> 2.9.6] |
Ha! now that I re-read my message, I realise my there is a typo, it's 2.9.6! not 2.6.9, this applies to my last comment as well, adding an [EDIT] note. |
Thanks for the confirmation :) So it is indeed another issue. |
Just FYI, we have switched from setting Did you manage to reproduce it on your end? |
hi François. Thanks for the update. No repro on our side for now. What surprises me is that the code used to deal with the maxsslrate is exactly the same (and uses the same code paths) as the one dealing with the global rate limit. So if something is broken there (and it's fairly possible that a race remains), it should affect all limits, not just SSL. |
[EDIT: previous version had typo 2.6.9 -> 2.9.6]
Detailed Description of the Problem
In some circumstances, that I have not been able to establish clearly, HAProxy stops accepting new connections (timeout).
We run haproxy in a number of clusters, all configured exacly the same, and only a couple of them always have this recurring issue, it could be client-behaviour related as these different clusters are used by different clients.
As you can see here the receive queue is full, maxconn is set to 80000 for that particular frontend. And according to
show info
,CurrConns: 5378
Other debug information:
This bug happens with 2.8.3 on Amazon Linux 2023, and this bug report is based on haproxy-next 2.9.6, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007, which I have tried to see if it fixed the issue.
Expected Behavior
If
CurrConns
<maxconn
, haproxy should keep accepting new connections.Steps to Reproduce the Behavior
In the case of the problematic clusters, there doesn't seem any particular trigger...
Do you have any idea what may have caused this?
No
Do you have an idea how to solve the issue?
No
What is your configuration?
haproxy.cfg
conf.d/app.cfg
conf.d/stats.cfg
conf.d/app_be.cfg
/etc/sysctl.d/10-haproxy.conf
Last Outputs and Backtraces
No response
Additional Information
Any local patches applied
haproxy-next 2.6.9, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007
Environment specificities
Arm64, ECDSA certificate
No backend servers defined in configuration, backends are added and removed via the socket using
Weights are recaclulated every minute, and set using
The text was updated successfully, but these errors were encountered: