-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP/2 + TLS spends a lot of time in recv (buffering issue) #13416
Comments
@icing's comment when this was previously brought up on the mailing list: https://curl.se/mail/lib-2024-04/0000.html |
If I understand the reported issue correctly, you would prefer some receive buffering in the socket filter, so that the TLS filter above can do its small reads without involving socket operations. Is that correct? |
While it's a tradeoff, and not exactly what the other thread is about, I believe that when TLS is used, the lack of buffering on recv is likely to increase latency quite a bit. Since TLS issues a 5 byte read and then the payload read, that extra syscall overhead will increase latency quite a bit. If we look at the SSL_read call, the upper layer is expecting to read 16kb (the chunk size for H2); that 16kb requires at least 4 ::recvfrom calls, and maybe a lot more: header:5, body:8k, header:5, body:8k From my understanding, openssl has a buffering BIO mechanism that could be added in; also there's the existing curl bufq available. |
I think that the In |
Citing https://google.github.io/tensorstore/ "Natively supports multiple storage drivers, including Google Cloud Storage" We are talking about improving performance for Google Cloud Storage? |
Not really? Somewhat? The tensorstore supports a lot of potential backends. The HTTP/2 stack is shared across our "http", "aws", and "gcs" access methods--it does not use the google-cloud-cpp libraries nor the aws c++ client libraries--and for most google uses the gRPC backend is higher performing than the HTTP/2 backend. It just happens that testing against GCS is pretty convenient. I could alternatively test against AWS or a generic HTTP/2 dataset. The project itself is developed for our connectomics research use. It is open source so our collaborators have access to it, and it happens to have uses beyond connectomics to many large datasets. See the blog post: https://research.google/blog/tensorstore-for-high-performance-scalable-array-storage |
To be clear, we are observing about 50% of the CPU time spent in calls to recv in profiles. Since ultimately this is just doing a memcpy from some kernel buffer, in principle this should be able to be reduced to almost zero. The remainder of the time is split on decryption performed by OpenSSL and other memory copies. Therefore we expect that the throughput from a single thread could be nearly 2x the current throughput if the excessive small recv calls were eliminated. It would be better if the excessive recv calls could be avoided without additional memory copies -- it is not clear how easily that can be accomplished. But even if additional memory copies are needed, it is likely to improve performance, because memcpy is much cheaper than a recv call. Indeed, failing to use buffered I/O when performing small reads from a file descriptor is a classic performance pitfall. |
I agree that curl can be improved here for better performance in high traffic applications. I propose to add that to the project's TODO list, so that anyone willing to invest time can see it. Maybe such an individual will come forward in the future or some organisation decides it is willing to make a financial contribution to push this. Does this sound good? |
In Line 1456 in 86805bf
If I change |
I did this
As in #13403
In tensorstore we (currently) use a single multi-handle to multiplex a lot of http/2 transfers. Those transfers appear to be slower than expected.
The following benchmark, which downloads 1000 files over http/2 (limited to 32 concurrently), where each file is 64MB, demonstrates an issue with an excess of
::recvfrom
calls:At the lowest level of the stack,
nw_in_read
callsrecv
with a sequence of buffers of (typically) <5>, <1403> ... repeatedly.There is special handling for "small" buffers in
cf_socket_recv
, however that is defined as < 1024 bytes, which relies onctx->buffer_recv
to be set, however I cannot see how it is ever TRUE.Relevant stack trace & buffer lengths.
Repro:
Example strace output:
Here we clearly see the TLS header then the TLS body reads as separate
::recv
calls.I took the histogram of recvfrom calls; here are all the calls with more than 500 occurences in the download.
Notice that about 1/2 the calls are for 5 bytes (TLS headers), while the others vary, but a large number are for
actual 8k blocks. The average recv call size is about 4090 bytes.
I expected the following
Fewer calls to
recvfrom
; buffering at the lowest layer for fewer context switches.curl/libcurl version
curl 8.7.1
operating system
5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: