Skip to content

Make HTML4/XHTML1 Strict doctypes non-conforming#2048

Merged
domenic merged 3 commits into
masterfrom
doctypes-html4-xhtml1-nonconforming
Nov 17, 2016
Merged

Make HTML4/XHTML1 Strict doctypes non-conforming#2048
domenic merged 3 commits into
masterfrom
doctypes-html4-xhtml1-nonconforming

Conversation

@sideshowbarker

@sideshowbarker sideshowbarker commented Nov 16, 2016

Copy link
Copy Markdown
Member

It was never intended that HTML4 Strict and XHTML1/1.1 Strict doctypes would remain conforming forever. Given that HTML4 is nearly 20 years old (and XHTML1 is just a reformulation of HTML4 in XML), it’s time to consider making the HTML4 Strict and XHTML1/1.1 Strict doctypes non-conforming—just as are all other HTML4 and XHTML1/1.1 doctypes (and HTML 3.2, etc., doctypes are).

The spec currently defines the HTML4 Strict and XHTML1/1.1 Strict doctypes are obsolete but still conforming—obsolete permitted DOCTYPEs—and says that “Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long”.

The reason the spec states for allowing them in conforming documents is in order to “help authors transition from HTML4 and XHTML1”.

But at this point continuing to allow HTML4 Strict and XHTML1/1.1 Strict doctypes as conforming isn’t helping authors transition; instead it seems to be having the effect of continuing to proliferate use of those doctypes long past what rightly should have been their proper expiration date.

For the HTML checker I still get issue reports from authors requesting that if they put an HTML4 or XHTML1 doctype on a document, the checker should evaluate it using HTML4/XHTML1 requirements (as the SGML/DTD-based legacy W3C validator does and as validator.nu used to do) instead of requirements in the current HTML spec.

In other words, some authors are continuing to intentionally use the HTML4/XHTML1 doctypes so that their documents can be “valid” even though they contain markup that the current HTML spec defines as non-conforming.

So, it’d be helpful if we made the spec clearly disallow use of all legacy HTML doctypes, including the HTML4 Strict and XHTML1/1.1 Strict doctypes (the only remaining legacy docytpes still allowed).

Comment thread source

<hr>

<!-- see the parser section before changing this bit -->

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was meant by this?

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement. But I guess we shouldn't really let that influence what is okay for text/html.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<!-- see the parser section before changing this bit -->

What was meant by this?

Dunno for sure. Maybe @zcorpan knows better. But anyway I took it as a statement about effects as far as changing the contents of that section—not about dropping the whole thing entirely.

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

hmm yeah I had not thought about that, because the spec doesn’t give it as a reason

@zcorpan

zcorpan commented Nov 16, 2016

Copy link
Copy Markdown
Member

The parser has parse errors for doctypes other than the permitted ones I believe.

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

I'd like to check the reasons we permitted these doctypes in the first place, why they are no longer relevant. Or what the effects will be if we change this (and the behavior of the checker). Will people replace all instances of "new" elements with div instead of switching to <!doctype html>?

@annevk

annevk commented Nov 16, 2016

Copy link
Copy Markdown
Member

Does that not address the issue for the checker?

I think we basically should not allow that kind of behavior. There should be only one path for checking HTML. Not version-dependent paths.

@sideshowbarker

Copy link
Copy Markdown
Member Author

The spec has this:

Conformance checkers may, based on the values (including presence or lack thereof) of the DOCTYPE token's name, public identifier, or system identifier, switch to a conformance checking mode for another language (e.g. based on the DOCTYPE token a conformance checker could recognize that the document is an HTML4-era document, and defer to an HTML4 conformance checker.)

Does that not address the issue for the checker?

No, because the checker does not (any longer) switch into any different modes based on the doctype—because I agree with what @annevk said:

There should be only one path for checking HTML. Not version-dependent paths.

@zcorpan

zcorpan commented Nov 16, 2016

Copy link
Copy Markdown
Member

OK, so then we should remove that paragraph as well. And change the HTML parser to emit more parse errors.

Are you going to remove support for checking HTML4 from the checker completely?

@sideshowbarker

Copy link
Copy Markdown
Member Author

OK, so then we should remove that paragraph as well.

Good point—made it so.

And change the HTML parser to emit more parse errors.

I’d prefer to do that in a separate follow-up PR—since changing the parsing algorithm potentially affects browsers and all other parser implementations, while this PR as currently scoped only affects document conformance/authors and conformance checkers.

@sideshowbarker

Copy link
Copy Markdown
Member Author

Are you going to remove support for checking HTML4 from the checker completely?

Yes, from the HTML checker I’d like to remove any traces of HTML4-related checking that still remain. However, I guess the vnu source still needs to contain an HTML4-checking path as long as the https://validator.nu/ Web UI continues to offer an HTML4-checking option (which https://checker.html5.org/ and https://validator.w3.org/nu/ do not).

The W3C will continue to offer HTML4 and XHTML1 checking using the legacy backend for those that https://validator.w3.org/ relies on. That is anyway what most people who want HTML4/XHTML1 checking actually use (not the https://validator.nu/ HTML4-checking option).

@domenic

domenic commented Nov 16, 2016

Copy link
Copy Markdown
Member

I think we should do the parse errors in this PR too? They won't affect browsers, just checkers, and it seems good for them to be consistent with the requirements changed here.

@sideshowbarker

Copy link
Copy Markdown
Member Author

[changes to parse errors] won't affect browsers, just checkers

The gecko HTML parser exposes parse errors in its View source but yeah changes to parse errors otherwise don’t affect gecko parsing behavior, or behavior in any other browsers.

That said, we do have other parsers that do error reporting—at least two of them I can think of.

I think we should do the parse errors in this PR too… it seems good for them to be consistent with the requirements changed here.

OK, I can add them here.

(FWIW my thinking had been that it would not be ideal to conflate into one PR both (A) document-conformance changes that have no normative requirements for parser implementors and (B) parser changes that do have normative requirements for implementors who have implemented the error-reporting parts of the parsing algorithm).

@annevk

annevk commented Nov 16, 2016

Copy link
Copy Markdown
Member

I tend to agree that we want to land conformance changes on both sides. The parser and syntax section ought to be updated together since they rely on each other to some extent. The specification would be inconsistent otherwise.

@sideshowbarker

Copy link
Copy Markdown
Member Author

change the HTML parser to emit more parse errors.

See 34c4d1b and lemme know if anything more beyond that needs changing in the parsing algorithm.

@sideshowbarker

Copy link
Copy Markdown
Member Author

Also, in XHTML you still need to use a DOCTYPE kinda like this for entities and we still don't have a replacement

See #2056 which eliminates the need for authors to be forced to forever continue putting obsolete XHTML1 doctypes in HTML documents that are served with XML mime types.

Instead it just changes the spec to say:

If the document element of a Document is in the HTML namespace, user agents should attempt to retrieve the URL given by this link (this URL is a DTD containing the entity declarations for the names listed in the named character references section), and should not attempt to retrieve any other external entity's content.[XML]

@domenic

domenic commented Nov 17, 2016

Copy link
Copy Markdown
Member

Looks great, but can you or someone help work on a nice explanatory commit message for this? To avoid misunderstandings, I think we should stress exactly what this does and does not do, i.e. it removes the legacy XHTML and HTML 4 doctypes as conformant, so that only <!DOCTYPE html> and <!DOCTYPE html SYSTEM "about:legacy-compat"> are conformant. But it does not remove XHTML support or impact the browser processing model.

@annevk

annevk commented Nov 17, 2016

Copy link
Copy Markdown
Member
Remove obsolete permitted DOCTYPEs

From now on conformance checkers can only allow <!doctype html> and <!doctype html SYSTEM "about:legacy-compat"> as doctypes in HTML syntax. The HTML4 and XHTML1 DOCTYPEs are no longer allowed.

(XHTML syntax continues to be supported and is not influenced by this change.)

@domenic domenic merged commit 31c20af into master Nov 17, 2016
@domenic domenic deleted the doctypes-html4-xhtml1-nonconforming branch November 17, 2016 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants