Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 parsing performance #4

Open
ConnyOnny opened this issue Jan 11, 2017 · 4 comments
Open

utf8 parsing performance #4

ConnyOnny opened this issue Jan 11, 2017 · 4 comments
Labels

Comments

@ConnyOnny
Copy link

Hi, I was eager to benchmark your table-based utf8 parsing approach against the standard library implementation, so I did:
https://github.com/ConnyOnny/utf8perf

If my testing setup is not wrong (see main.rs) it seems branching is not everything.

@jwilm
Copy link
Collaborator

jwilm commented Jan 11, 2017

Thanks for putting this together! I've been wanting to do some benchmark work.

There were a few problems with your test setup. I opened a PR. That said, the results aren't much better, but at least they are correct!

Read 21078000 bytes.
Parser "tbl" needed a median 0.055256400 seconds to parse 11431500 characters.
Parser "std" needed a median 0.029445756 seconds to parse 11431500 characters.

Going to mark this as a bug because we should be able to be std easily.

@jwilm jwilm added the bug label Jan 11, 2017
@carl-erwin
Copy link

carl-erwin commented Jan 12, 2017

Hi, some years ago I implemented an utf8 decoder with the same table,
and used Björn Höhrmann's article as a reference http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for benchmarking.
In his version the state/mask table is more compact than the 8*256 bytes used by utf8parse and thus more cache friendly.

@jwilm
Copy link
Collaborator

jwilm commented Jul 11, 2017

I've done some minimal optimization effort in #8. When I've got a bit more time, I plan to look into Björn Höhrmann's article mentioned by @carl-erwin to see if we can do better.

As to why the std parser does so much better, this seems due to optimizations available when it's possible to view multiple bytes at once.

@luser
Copy link
Contributor

luser commented Dec 6, 2017

You might also be interested in encoding_rs which is currently shipping in Firefox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants