-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract attributes for HTML elements #1515
base: main
Are you sure you want to change the base?
Conversation
Thanks for the code and exploration, @wbamberg! Intuitively, I would have started with a dfn-based approach to extract the list of content attributes: extract all the I think this approach would capture all the attributes that you manage to extract with your code, although as usual it's going to be interesting to compare the exact results of both approaches to reveal problems worth fixing in the specs ;) The dfn-based approach would also help capture additional content attributes, some of them you mentioned in your "trickier" list:
This wouldn't be perfect though. For example:
As far as I can tell, the HTML spec lists them for In the end, to create reliable data, I think we'll have to combine the approaches (as already done for the elements extraction itself) and improve the dfn markup in a couple of specs. I'm not quite sure what to do with global attributes. In HTML, they come with a |
Thank you @tidoust for such a helpful response.
Fair enough, and I see that MDN already lists these things as
By "combine the approaches" do you mean consider a thing an attribute if it meets either the test in this PR or your dfn-based approach? I'm happy to have a go at that in this PR if you like.
So a follow-up for me was: as a consumer of this data, I would prefer to distinguish between HTML, MathML, and SVG, rather than just have "elements". For instance, MDN makes this distinction. Do you think it would be desirable to have webref/html as well as webref/elements? And if so, it feels to me like including all the globals in each element would be unnecessary, and we could instead have a separate |
Yes. I would start with the dfn-based approach because content attributes ought to follow this convention, and fallback to parsing the table structure only when that fails (or to get the
Go go go :)
You mean: split the package into multiple packages? I would prefer to avoid that because we manually review the packages before publication (data is curated through the patches I mentioned and we sometimes detect things that look weird at that step). Splitting the data wouldn't create more review work, but would introduce additional manual steps to release the packages. That said, the data in webref/elements is already per spec. Or are you looking at a specific grouping that the current data does not capture? (e.g. It seems fine to break the structure of elements extracts if there's no good place to put the global attributes and/or additional info if needed. |
As I bumped into it recently, I note a discussion in the issue tracker of the Pulsar text editor (which leverages Webref packages already) that mentions the need to have an easy-to-use list of elements with attributes, for autocomplete purpose: pulsar-edit/pulsar#393 (comment) (As said above, the data already exists in the dfns extracts of Webref, but then that data is not part of an npm package, so harder to integrate, and then there are a few exceptions that need to be accounted for). |
re pulsar, their autocomplete list provides us with a comparison point (for completeness verification) and a possible data model. |
I don't know if this is viable really but thought it was worth asking. From a conversation in the MDN Discord I thought we could try to add attributes to the extracts for HTML elements.
This PR thinks a
<dd>
item under "Content attributes:" is an attribute if:<code>
element<a>
elementattr
This is intended to be quite strict, as I thought it's better to exclude legit attributes than to include nonsense.
Anyway, with this test I find the following items excluded (apart from "Global attributes", which they all have):
<link>
<style>
<body>
<li>
<a>
<dfn>
<abbr>
<bdi>
<bdo>
<embed>
<area>
<input>
<fencedframe>
All the "Also, the XYZ attribute" seem right to exclude: they're not adding a new attribute but qualifying an existing one. The event handler attributes on
<body>
seem right to exclude, unless there's some reason the spec lists these only for<body>
.That leaves us with a few that are a bit trickier:
<a>
and<area>
listping
, but the fragment ID does not start with"attr"
, it's just"ping"
. I'm not sure if there's a good reason for this or if it is fixable in the spec.<fencedframe>
listsallow
but again its fragment does not start with"attr"
, it's"element-attrdef-fencedframe-allow"
.<li>
does not list the attribute first - perhaps it could be rewritten like "value
- If the element is not a child of an ul or menu element, represents the ordinal value of the list item"?<embed>
has "Any other attribute that has no namespace (see prose)." I didn't really understand this, even after referring to the prose. Perhaps it would be OK to omit this? The MDN page for<embed>
makes no mention of any extra attributes, either.