Here’s an example of unescaped & characters in a A HREF tag attribute.
http://jsfiddle.net/32zbogfw/
It’s working fine.
I know it might break XML and possibly XHTML but who uses that still?
And I know an unescaped & in a href shows as red in the View Source color highlighting.
What can go wrong? Why is it important? Perhaps it used to be in 2009 but no longer the case.
This all started because I was reviewing some that uses python urllib.urlencode(...)
and inserts the results into a Django template with href="{{ result_of_that_urlencode }}"
which would mean you get un-escaped & characters and then I tried to find how and why that is bad but couldn't find any examples of it.
Comments
Post your own commentIf I make a blog post that has the following url: http://www.example.com/checkoutmyguitar&they'reawesome/
I NEED to escape the & or else the & will get processed as an &.
Invalid entities will get ignored, which is what you're seeing. It's the edge cases that are the concern. I think.
But in that case, the & is in the pathname part of the URL. E.g. http://jsfiddle.net/c5b5L4w1/
So not a problem.
The issue is that browsers will close whatever they consider incomplete entity references automatically. I don't know the specific algorithm but href="?foo=1"" still causes Firefox to add a quotation mark to the end of the URL - and that's what you get instead of a parameter named "quot". Now this doesn't happen for parameters that actually have a value but I wouldn't be so sure about browsers other than Firefox.
It really depends on what you have in that attribute.
If you have href="?something&=whatever" you run into a problem if you don't escape the '&'.
If you have href="?something& whatever" you also run into a problem.
Or if you have href="?something&,something" for that matter.
So if you know for a fact that the thing after your maybe-entity-name is an equals char, you're probably OK. Otherwise, likely not.
So as long as I always bundle the key and value with a = in between I'm safe.
Not if the unquoted thing is in the value. "?something=&&" behaves identically to "?something=&&".
Not to mention the fact that, of course, the unquoted '&' will terminated the key-value pair.
I don't think I've ever head of such a thing... escaping ampersands in tag attributes. I mean, I see what you mean about view-source highlighting them as invalid, but I've never written them that way (unless using an XML-based generation tool), nor seen any framework (JSF, etc) that ever renders them that way...
For me it's the opposite. I've always been über careful turning & into & in attributes' values. This is because we used to be so strict when XHTML was all the rage.
Now I stopped to think; is it still important at all.
There was a time when Gecko used to allow HTML entities without the trailing semicolon. (I don't know what the current parsing rules are here.) That meant that if you had a form parameter named e.g. "macroname" then tried to use it in a hardcoded link e.g. "update.php?action=delete¯o=test" the ¯ would get interpreted as a ¯ character.
"There was a time".
I'm guessing that goes way back. Even before people switched from HTML4 to XHTML doctypes.
https://html.spec.whatwg.org/multipage/syntax.html#consume-a-character-reference
If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is either a U+003D EQUALS SIGN character (=) or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a U+003D EQUALS SIGN character (=), then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
--
Basically, your link throws a parse error only because of the equals sign that follows. I did some more testing ( http://jsfiddle.net/8b1h3bqw/ ) and noted that Firefox seems to ignore the rules about ampersands in attributes showing a warning even in the valid case. But then, that's also only in view source, which for some reason, I cannot access via the developer tools. Chrome doesn't report any parse errors anywhere as far as I can.
Anyways, it's important because of those legacy user agents only, and then, only if your parameter has the same name as an HTML entity character reference. In all other cases, there's no problem, probably.
Awesome. Just like others have mentioned in comments here; this means that as long as you follow with a = you're fine.
As your example points out; the really big risk is the example of `href="&"` where you might hope that the server is going to pick that up as a {'amp': ''} or something. It won't. Instead you'd get nothing from the query string.
It would if it was `href="?ampsomething"` then you could get {'ampsomething': ''}
(NB: different servers accept or simply reject CGI params without a =)
You should not URL-encode URLs before inserting them into a href attribute: actually, if you URL-encode them they'll likely break.
But you must HTML-escape them, which is what & turned into & is about. Django templates may be configured to do it automatically anyway, see https://docs.djangoproject.com/en/dev/ref/templates/builtins/
If you don't HTML-escape URLs and other variables before merging them in your HTML (especially if they ultimately come from user input) you risk to make your website vulnerable to cross-site scripting (XSS).
P.S.: why in the hell does this blog require JavaScript to be enabled, for extra 3rd party sources too, in order to protect your comment form against CSRF? :(