The preview of the new Python 3 port has broken HTML escaping in the XML feeds #582

berrange · 2024-10-24T13:28:20Z

I am using:
O.S: Fedora 40
Browser: Firefox 131.0.2
Platform: desktop

Problem

The preview of the new Python 3 port has broken HTML escaping in the XML feeds

eg try to view this in the browser:

and it will complain about undefined entities, due to having raw unescaped HTML in the XML document

By comparison the original Python 2 code escaped HTML in the feed

$ wget https://planetpython.org/rss10.xml
$ grep "content:encoded" rss10.xml | head -1
	<content:encoded>&lt;p&gt;As is probably apparent from the sequence of blog posts about the topic in the
$ wget https://planetpython.org/3/rss10.xml
$ grep "content:encoded" rss10.xml.1 | head -1
	<content:encoded><p>As is probably apparent from the sequence of blog posts about the topic in the

Details

This problem is caused by a mistake in the python 3 conversion done in #577, specially in commit 86e31f9 replaced code patterns like:

feed[key] = sanitize.HTML(feed[key])

with

feed[key] = Markup(feed[key])

which is not providing functionally equivalent behaviour.

The sanitize.HTML method would parse the HTML and strip out various undesirable elements and attributes, and escaping was later performed by the template processor.

The Markup method will not parse anything, it'll just wrap the str in a Markup class, as a way to designate it as being safe to use as-is without further escaping. As a result when you later try to escape the variable in jinga using ... | e, it will do nothing at all, resulting in raw HTML being put into the XML document, leading to the later parsing errors.

I think either the original sanitizer code needs to be re-instated and made to work with py3, or perhaps an external library such as https://github.com/matthiask/html-sanitizer/ could be leveraged ?

The text was updated successfully, but these errors were encountered:

hugovk · 2024-10-24T14:48:57Z

cc @offbyone

offbyone · 2024-10-24T15:02:10Z

Thanks! I'll try have a look at this on the weekend; work and life have eaten my brain.

(there are several issues with the Python 3 version at this time, including that it can't use the caching layer from the old version, and currently doesn't really have a working cache)

berrange · 2024-10-29T19:35:11Z

Thanks! I'll try have a look at this on the weekend; work and life have eaten my brain.

FYI, we copied the py3 port changes into libvirt's planet code repo, which is how I discovered the mistake. For now, I've made the following changes to fix up the problems described: https://gitlab.com/libvirt/virttools-planet/-/merge_requests/7/diffs?commit_id=4b5e6df409bf4e56139e7acf8d2fc97b54f2bfaa It appeared to be sufficient to make the XML feeds well-formed, but I didn't examine the code too closely. Feel free to copy this solution back, or not, as suits your needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The preview of the new Python 3 port has broken HTML escaping in the XML feeds #582

The preview of the new Python 3 port has broken HTML escaping in the XML feeds #582

berrange commented Oct 24, 2024

hugovk commented Oct 24, 2024

offbyone commented Oct 24, 2024

berrange commented Oct 29, 2024

The preview of the new Python 3 port has broken HTML escaping in the XML feeds #582

The preview of the new Python 3 port has broken HTML escaping in the XML feeds #582

Comments

berrange commented Oct 24, 2024

Problem

Details

hugovk commented Oct 24, 2024

offbyone commented Oct 24, 2024

berrange commented Oct 29, 2024