-
-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve WHATWG Url Spec Conformance #802
Comments
Sounds all fine, but a word of caution regarding the use of "standard": WHATWG is not a standard, but rather browser vendors working on what could become a standard. For URL they are trying to replace the previous RFCs (some details: https://en.wikipedia.org/wiki/WHATWG). The reason AngleSharp follows WHATWG is because AngleSharp is interested to unlock the same potential of the web that is usually just available to web browsers. As noted URLs play an important role. The other thing is that I'm glad "maintenance" was mentioned. I don't think it has been fully realized at this point, but WHATWG provides a living standard. Many things change (some just slightly, but even the smallest change may have some impact, e.g., unlocking a new (edge) case). Hence when Besides proper IDN mapping (which we now have) the only thing missing from the previous spec was IPv4/IPv6 parsing. Again, that does not exclude making improvements or being up-to-date - that was just to put what is there in historical perspective and use it as an example for expectations with this issue (whatever will be done now will also be invalid / outdated in weeks / months / years if its not maintained continuously). |
I would like to use term Living Standard to describe WHATWG. Even it is not actual standard, it is factual standard today.
We are looking for a solution follows WHATWG for similar reason. To serve my customer, who is using modern browser today, I need to pay the cost for catching up with the browsers.
I did not realize this point at beginning. That's why I used bugfix prefix of my PRs. I would like to apologize for it. After working on some changes and looking at the changelog of wpt-tests, also by discussing with you, now I'm aware of it.
Being aware of this, I would like to do some change to make it easier to catch up with the living standard in the future. (the test change I proposed)
I have this need as well, bu I did not find any time-frozen URLs about the living standard. Do you have any solution about it? Thanks! |
Not really - there is the Internet Archive and the possibility to link against the Git repo (e.g., here https://github.com/whatwg/url - GitHub allows links also with specific commit hashes). Maybe we should build up our own mirror of the published spec (and others / specs we use) which could be then linked (i.e., having something like https://anglesharp.github.io/whatwg/url -> snapshot of the URL spec we follow; would make it easy to compare against the real / current thing). Best thing would be an annotated mirror with references / entry points of uses within AngleSharp / aux libraries. I feel this will be some effort and we should align / have a good strategy first. I could cover the HTML DOM parts and maybe you can cover URL? |
If we only want to link to the source file, we don't need to mirror the spec. We can use link of source file with commit hash, however it will be inconvenient to read. We can not get link by mirroring the spec since it seems not using Github pages but deployed somewhere else. If we want to have a link with particular version, I think we need to change the build file and publish the compiled pages to
Yes, after we decide the strategy I will cover |
Preview of my test snapshot of URL spec: Steps I used after fork:
$ git checkout --orphan gh-pages
$ git rm -rf --dry-run . # preview files to be deleted
$ git rm -rf . # actually delete the files
$ git add .
$ git commit -m'init'
$ git push origin gh-pages Problem:
Any idea? |
When you fork (or clone) the existing repository you can also go to any point in time (you have also cloned the full history). Hence running git checkout af93f92a7b937ddacfaa7ce8c158a18a83c9c9d7
make
mv url.html index.html
# ... would check out right when some IPv6 clarifications have been done (Feb 2017). Does that help? |
I implemented the latest IPv6, which should be the current version. Why should we worry about the previous version now?
I know how to checkout history version. The problem is, when you generate If you are fine with my solution, please fork the URL standard to This can not be done with PR since there is no I can rename the html file with date so that the the time-frozen URL will like After this is done I will refer to this snapshot when implementing ipv4 or update any logic in |
Okay now I understand - well I would just run a
Sounds great - I'll do it right away. Thanks for your contribution and ongoing efforts! Much appreciated! 🍻 |
You should have received an invite - the new repo can be found at https://github.com/AngleSharp/Specification-Url. Hope I've set all permissions correctly. Just let me know if there are any problems! |
It's active and already published! Looks great! |
Although I have finished Ipv6 parsing but I will hold it. I will try to improve test case first, so that we know how many test cases are failing now. To fix some of the test cases, some structure change may be introduced. Another example is, Ipv4 and Ipv6 serializing need extra property to store the binary format of IP, so we need an I plan to keep track the progress in my fork. |
An early test implementation And current result Trying to have better output format |
Stumbled upon this thread via a reference from wpt: https://url.spec.whatwg.org/commit-snapshots/ exists. |
Great URL thanks @annevk ! |
New Feature Proposal
Description
This Proposal contains 2 parts:
Url
to catch up with latest WHATWG URL Living StandardBackground
After reading the source code, I found that the
Url
is not aligned with lastest WHATWG URL Living StandardThe difference is trivial, but it is blocking me from using
Url
directly in my project.The main difference is that
Url
is not returningparse failure
for some invalid input. In current implementation, it is reflected by the wrong value ofIsInvalid
property after parsing.For example,
Port number validation is not returning failure when port number is larger than 65535.
I heard from @FlorianRappl in another thread about this:
I agree it is highly optional for most of the use cases, but for me it is not. I'm trying to get all links from web pages using
AngleSharp.Html
and trying to log down all invalid links for security related research. I mean "invalid" by not able to be open in browser. If you try the example in browser's address bar, it will redirect you to search engine with the invalid URL as keyword. If you have such link in a HTML file like this:you will see it links to empty page.
I don't see any negative side-effects of make this example URL returning
parse error
here, but I do see that I'm getting wrong result with currentUrl
and I see "potential negative side-effects" of not following the standard. Different than your statement, I found current Browsers is following standard pretty good, and even they fail some test cases, I can find information in Issues like thisI understand that it may not worth it to implement some part of the standard for performance. But instead, developers need to understand which part we are currently not following the Standard and why, so they can make decision easier. Today, I have to find out which part we are not following standard by myself case by case.
My solution will be catch up with "latest living standard as much as possible", and provide the information about which part we are not following the standard by providing information of "failing test cases" in
web-platform-tests
, and have documentation of why we are not doing it.Some of the changes will be trivial, like the example of port number validation, and the change will not affect performance at all. Some of the change may be bigger and will have impact on performance, we can do profiling and make decision.
This change may be painful since
AngleSharp
is focusing on HTML parsing, not URL. But actually URL is highly related and similar libraries in other languages also have their own URL implementation, like jsdom.@FlorianRappl mentioned "what if port number can have more space in future". If it happens, it will be reflected in the Standard first. WHATWG Standard's principle is provide backward compatibility (so it won't directly increase the range, some extra mark to declare the new range for sure), so we should still be fine for existing implementation.
Today, I believe
AngleSharp.Url
is already the closest implementation of WHATWG URL Standard, why not making it even better so that more C# developer can use it? More efforts may be needed to maintain theUrl
Class, but I think it worth it since this is the only option in C# community today.I talked to .Net community about the need of a WHATWG Standard
Uri
implementation. They will not consider it in next milestone but maybe in the future. I would like to push them internally within Microsoft as well, so I wish we can have them to maintain the URL implementation in the future.Specification
I plan to do following changes:
The text was updated successfully, but these errors were encountered: