Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract attributes for HTML elements #1515

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wbamberg
Copy link

I don't know if this is viable really but thought it was worth asking. From a conversation in the MDN Discord I thought we could try to add attributes to the extracts for HTML elements.

This PR thinks a <dd> item under "Content attributes:" is an attribute if:

  • its first child is a <code> element
  • whose first child is an <a> element
  • whose href has a fragment beginning attr

This is intended to be quite strict, as I thought it's better to exclude legit attributes than to include nonsense.

Anyway, with this test I find the following items excluded (apart from "Global attributes", which they all have):


<link>

  • "Also, the title attribute has special semantics on this element: Title of the link; CSS style sheet set name\n "

<style>

  • "Also, the title attribute has special semantics on this element: CSS style sheet set name\n "

<body>

  • "onafterprint",
  • "onbeforeprint",
  • "onbeforeunload",
  • "onhashchange",
  • "onlanguagechange",
  • "onmessage",
  • "onmessageerror",
  • "onoffline",
  • "ononline",
  • "onpageswap",
  • "onpagehide",
  • "onpagereveal",
  • "onpageshow",
  • "onpopstate",
  • "onrejectionhandled",
  • "onstorage",
  • "onunhandledrejection",
  • "onunload"

<li>

  • "If the element is not a child of an ul or menu element: value — Ordinal value of the list item\n "

<a>

  • "ping — URLs to ping\n ",

<dfn>

  • "Also, the title attribute has special semantics on this element: Full term or expansion of abbreviation\n "

<abbr>

  • "Also, the title attribute has special semantics on this element: Full term or expansion of abbreviation\n "

<bdi>

  • "Also, the dir global attribute has special semantics on this element."

<bdo>

  • "Also, the dir global attribute has special semantics on this element."

<embed>

  • "Any other attribute that has no namespace (see prose)."

<area>

  • "ping — URLs to ping\n ",

<input>

  • "Also, the title attribute has special semantics on this element: Description of pattern (when used with pattern attribute)\n "

<fencedframe>

  • "allow — Permissions policy to be applied to the fencedframe's contents\n "

All the "Also, the XYZ attribute" seem right to exclude: they're not adding a new attribute but qualifying an existing one. The event handler attributes on <body> seem right to exclude, unless there's some reason the spec lists these only for <body>.

That leaves us with a few that are a bit trickier:

  • <a> and <area> list ping, but the fragment ID does not start with "attr", it's just "ping". I'm not sure if there's a good reason for this or if it is fixable in the spec.
  • <fencedframe> lists allow but again its fragment does not start with "attr", it's "element-attrdef-fencedframe-allow".
  • <li> does not list the attribute first - perhaps it could be rewritten like "value - If the element is not a child of an ul or menu element, represents the ordinal value of the list item"?
  • <embed> has "Any other attribute that has no namespace (see prose)." I didn't really understand this, even after referring to the prose. Perhaps it would be OK to omit this? The MDN page for <embed> makes no mention of any extra attributes, either.

@wbamberg wbamberg changed the title Extract HTML attributes for each element Extract attributes for HTML elements Mar 16, 2024
@tidoust
Copy link
Member

tidoust commented Mar 21, 2024

Thanks for the code and exploration, @wbamberg!

Intuitively, I would have started with a dfn-based approach to extract the list of content attributes: extract all the <dfn> that have a data-dfn-type="element-attr", and look at the data-dfn-for attribute to associate the content attribute with the element(s) it is defined for. The data already exists in the dfns extracts but could be added to the elements extracts as well.

I think this approach would capture all the attributes that you manage to extract with your code, although as usual it's going to be interesting to compare the exact results of both approaches to reveal problems worth fixing in the specs ;)

The dfn-based approach would also help capture additional content attributes, some of them you mentioned in your "trickier" list:

  • The content attributes of <portal>, whose definition does not use an element definition table at all.
  • The ping attribute of <a> and <area>
  • The allow attribute of <fencedframe>
  • The value attribute of <li>

This wouldn't be perfect though. For example:

  • This would miss the content attributes of the <model> element because they are correctly defined as element-attr but don't have a data-dfn-for attribute. That should be fixed in the spec.
  • This would also miss the content attributes of MathML Core elements for the same reason.
  • This would not work for SVG-related specs that don't follow the usual dfn contract (SVG, SVG Animations, Filter Effects).

The event handler attributes on <body> seem right to exclude, unless there's some reason the spec lists these only for <body>.

As far as I can tell, the HTML spec lists them for <body> because they are specific to <body> (all other event handler content attributes are common to all elements, see Event handlers section in HTML), so I would argue they ought to be included. Interestingly, the HTML spec has one definition for these attributes that is both for the event handlers and the content attributes. The <dfn> defines an event handler in practice (data-dfn-type="attribute"), and not a content attribute. The dfn-based approach won't work for them...

In the end, to create reliable data, I think we'll have to combine the approaches (as already done for the elements extraction itself) and improve the dfn markup in a couple of specs.

I'm not quite sure what to do with global attributes. In HTML, they come with a data-dfn-for="html-global" which does not target an actual element. It would make sense to include them in the elements extracts somehow, perhaps with a flag that explains that they are not specific to the element?

@wbamberg
Copy link
Author

Thank you @tidoust for such a helpful response.

The event handler attributes on <body> seem right to exclude, unless there's some reason the spec lists these only for <body>.

As far as I can tell, the HTML spec lists them for <body> because they are specific to <body> (all other event handler content attributes are common to all elements, see Event handlers section in HTML), so I would argue they ought to be included.

Fair enough, and I see that MDN already lists these things as <body> attributes: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body#onafterprint.

Interestingly, the HTML spec has one definition for these attributes that is both for the event handlers and the content attributes. The <dfn> defines an event handler in practice (data-dfn-type="attribute"), and not a content attribute. The dfn-based approach won't work for them...

In the end, to create reliable data, I think we'll have to combine the approaches (as already done for the elements extraction itself) and improve the dfn markup in a couple of specs.

By "combine the approaches" do you mean consider a thing an attribute if it meets either the test in this PR or your dfn-based approach?

I'm happy to have a go at that in this PR if you like.

I'm not quite sure what to do with global attributes. In HTML, they come with a data-dfn-for="html-global" which does not target an actual element. It would make sense to include them in the elements extracts somehow, perhaps with a flag that explains that they are not specific to the element?

So a follow-up for me was: as a consumer of this data, I would prefer to distinguish between HTML, MathML, and SVG, rather than just have "elements". For instance, MDN makes this distinction. Do you think it would be desirable to have webref/html as well as webref/elements? And if so, it feels to me like including all the globals in each element would be unnecessary, and we could instead have a separate globalAttributes property, and it can just be understood by consumers that by definition all HTML elements include the global attributes.

@tidoust
Copy link
Member

tidoust commented Mar 22, 2024

By "combine the approaches" do you mean consider a thing an attribute if it meets either the test in this PR or your dfn-based approach?

Yes. I would start with the dfn-based approach because content attributes ought to follow this convention, and fallback to parsing the table structure only when that fails (or to get the onxxx attributes for <body>). Note we may also handle exceptions through a "patch", as already done to capture obsolete elements in HTML.

I'm happy to have a go at that in this PR if you like.

Go go go :)

Do you think it would be desirable to have webref/html as well as webref/elements?

You mean: split the package into multiple packages? I would prefer to avoid that because we manually review the packages before publication (data is curated through the patches I mentioned and we sometimes detect things that look weird at that step). Splitting the data wouldn't create more review work, but would introduce additional manual steps to release the packages. That said, the data in webref/elements is already per spec. Or are you looking at a specific grouping that the current data does not capture? (e.g. <model> is in its own spec but is an HTML element in practice).

It seems fine to break the structure of elements extracts if there's no good place to put the global attributes and/or additional info if needed.

@tidoust
Copy link
Member

tidoust commented Sep 5, 2024

As I bumped into it recently, I note a discussion in the issue tracker of the Pulsar text editor (which leverages Webref packages already) that mentions the need to have an easy-to-use list of elements with attributes, for autocomplete purpose: pulsar-edit/pulsar#393 (comment)

(As said above, the data already exists in the dfns extracts of Webref, but then that data is not part of an npm package, so harder to integrate, and then there are a few exceptions that need to be accounted for).

@dontcallmedom
Copy link
Member

re pulsar, their autocomplete list provides us with a comparison point (for completeness verification) and a possible data model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants