[conspire] URL sanity

Michael Paoli Michael.Paoli at cal.berkeley.edu
Fri Jul 3 21:19:33 PDT 2020


> From: "paulz at ieee.org" <paulz at ieee.org>
> Subject: [conspire] URL sanity
> Date: Fri, 3 Jul 2020 17:44:28 +0000 (UTC)

> I received an email which included the following line:
>
> listed here: http://ncpa.n0ary.org/ncpabandplan.html.  Please have a look.
> I clicked on the link and got error 404.
(thwap the sender on the nose with a rolled up newspaper?)

Once upon a time (by not later than sometime in 1995), the convention
for disambiguating URLs, and their start/end within text, such as
in email, was to enclose them between "<URL:" and ">", e.g.:
<URL:https://tools.ietf.org/html/rfc1866>
That was at least the case with, or by the time RFC 1866 was published,
it being the specification for HTML 2.0.  I thought I read that within
that RFC itself ... perhaps it is in there, but not easily spotting
exactly where at present, but regardless, one can clearly see it
used and done that way numerous places within RFC 1866 itself.
Somewhere between then and RFC 2854, the practice changed to:
<https://tools.ietf.org/html/rfc2854> - as can well be seen
numerous times throughout that RFC - that was 2000.  As far as
I'm aware, that's still the common practice and standard or,
if not standard, de facto standard.
Even if we look at most recent of RFCs, we see, that in 2020-06,
this is still the (de facto?) standard:
<https://www.rfc-editor.org/rfc/rfc8820.txt>

So, if one uses form:
<URL:https://tools.ietf.org/html/rfc1866>
(obsoleted, but backwards compatibility) or:
<https://www.rfc-editor.org/rfc/rfc8820.txt>
Then all should be fine.  E.g. in most any reasonably well behaved
client, where it attempts to automagically link, these should
all work just fine (and especially the newer forms):
...<URL:https://tools.ietf.org/html/rfc1866>...
xxx<URL:https://tools.ietf.org/html/rfc1866>xxx
...<https://www.rfc-editor.org/rfc/rfc8820.txt>...
xxx<https://www.rfc-editor.org/rfc/rfc8820.txt>xxx
If one has a client that tries to automagically link hyperlinks,
and especially https/http (typically most common among them),
if it can't handle those - and especially the newer form thereof,
I'd say one has a seriously drain bamaged client.

Now, for better and/or worse, lots of clients will attempt to
automagically link what might appear (at least to the client)
to be hyperlinks - and even when folks (typically, egad, dang
humans) fail to use proper forms as noted above,
to disambiguate when and where the URLs start and
end within text.  So, results will tend to vary, and this can,
when using non-standard forms, result in anything from
inconveniences, inconsistencies, and hazards, to great
embarrassments and security problems.  (Can't trust the dang
humans ... ugh).  So, many clients quite aggressively try to
hyperlink stuff.  E.g. some, if anyone types a mere www or
www. will hyperlink that - regardless of context (or nearly
so), and will try to guess where the URL ends.  Likewise
any occurrence of http or https, or especially when such
is followed by :, or perhaps that and //
Many will also aggressively hyperlink when they see anything that
looks like it ends in a TLD - and there are a huge and increasing number
of those - and one can't exactly predict what TLD will come into
existence in the future.
<https://tld-list.com/tlds-from-a-z>
So, with such aggressive hyperlinking, much text that was never intended
to be hyperlinked, will be hyperlinked by such clients ... or even
if not today, text of past may become hyperlinked in future, as yet more
TLDs are added.
So, you know,
www.test
http://www.test/
https://www.test/
www.test
Many clients will aggressively hyperlink stuff like that - even
when those can never be valid Internet URLs/domains.  Might make
sense for the client to hyperlink them (even thought they'd still
not be valid) were they properly enclosed and formatted, e.g.:
<https://www.test/>
but some clients go hog wild, and will, e.g.
seeing anything that looks like it has . followed by any
valid TLD ... and there are zillions of them now (well, not quite
that many ... yet, ... since TLDs were basically busted wide open
to most anyone that can put up the cash) ... and that TLD followed by
something that wouldn't be part of a TLD, such as whitespace or
a punctuation mark, will automagically hyperlink such.
So, this can cause, e.g. embarrassments, where someone tweets out
something that happens to have ... well ...
Rudy Giuliani tweets text containing:
for http://G-20.In July
Note no whitespace before the "In".
So, what does twitter do?
Hyperlinks it:
<https://G-20.In>
And then what does someone do to take advantage of that?
They purchase and put to immediate use, the domain:
g-20.in.
And turn a hyperlink to nowhere, into a fully functional link in
the tweet ... "oops" ... not at all what the tweet author intended.
So, yeah, overzealous hyperlinking by clients can be problematic.
If clients only hyperlinked that bracketed within <> (or even <URL:>),
there would be far fewer cases (like almost none, save for typos and
link rot and the like) of things being hyperlinked where that was
not at all intended or desired.  But, oh my gosh, for some reasons(?)
a bunch 'o humans find that inconvenient, and just want bloody dang
near everything hyperlinked ... except when they don't, without
any good standard as to when or when not to have things automagically
hyperlinked.  So, yeah, things won't always go as desired.  "Oops".
I often wish there was a sending option that would specify to clients
how to (not) hyperlink, e.g.:
X-hyperlink: <>, <URL:>
X-hyperlink: go batsh*t crazy and link everything
X-hyperlink: RFC-nnnn #reasonably sane compromise, +- options: ...
But I don't think there's such a standard ... yet, ... nor clients that
would know about and follow such.  Sure, some clients, user might be
able to customize that a bit ... but even where that's the case, it
doesn't necessarily at all follow the intent of the sender/author.

So, behavior varies wildly.

E.g. if a proper URL is immediately followed by ) then whitespace, is
the ) included?  For it to be in a proper URL, it should be
encoded.  Okay, if it's encoded, include that in the URL.
But what about other non-alphanum characters, ... some should be
encoded, but others needn't be, but are commonly used for punctuation
or the like - should those be included in the URL or delimit the URL?
Practices vary, e.g.:
https://en.wikipedia.org/wiki/Parenthesis_(rhetoric)
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric)
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29
https://en.wikipedia.org/wiki/Parenthesis_(rhetoric).
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric).
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29.
https://en.wikipedia.org/wiki/Parenthesis_(rhetoric)!
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric)!
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29!
https://en.wikipedia.org/wiki/Parenthesis_(rhetoric);
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric);
https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29;

URL: https://en.wikipedia.org/wiki/Parenthesis_(rhetoric).  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric).  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29.  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_(rhetoric)!  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric)!  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29!  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_(rhetoric);  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric);  And then ...
URL: https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29;  And then ...

URL:https://en.wikipedia.org/wiki/Parenthesis_(rhetoric).And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric).And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29.And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_(rhetoric)!And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric)!And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29!And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_(rhetoric);And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric);And then ...
URL:https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29;And then ...

So, the moral of the story is:
disambiguate your dang URLs (and properly encode them):
<https://en.wikipedia.org/wiki/Parenthesis_%28rhetoric%29>
(well, () don't have to be encoded, but to defend against stupid
clients ...)
and short of putting them within <>,
at least delimiting by whitespace will sort-of-kind-of-mostly work,
with most clients.  But beware too, that, and especially without
<>, characters intended to be part of the URL (especially non-alphanum),
the client may take those as delimiters and truncate the URL
prematurely.  Or conversely, characters not intended to be part of the
URL, the characters may be taken to be part of the URL.  E.g. these
non-alphanums don't require encoding:
$-_.+!*'(),
However, it would be prudent to do so to defend against
drain bamaged clients, when they're intended to be part of
the URL (and most especially within path portion or at end,
where they might otherwise get interpreted as delimiting the
URL.  Likewise, other characters which are or may be reserved
in the URL scheme:
;/?:@=&
... uhm, good luck, especially within or ending the path portion.
Enclosing within <> is much more likely to work as expected,
when those characters are to be literally in or ending the
path portion.

So ... there's nothing illegal about a URL ending in ".",
e.g. (at least while they last (temporary)):
These are both perfectly valid URLs:
<https://www.balug.org/tmp/dot.html.>
<https://www.balug.org/tmp/dot.html>
And they are distinct, and have different content.
If your client is reasonably sane and does hyperlinking, it should
handle the two cases above just fine.
And, what does your client do with, e.g.:
https://www.balug.org/tmp/dot.html.
https://www.balug.org/tmp/dot.html
www.balug.org/tmp/dot.html.
www.balug.org/tmp/dot.html
Now is the time to https://www.balug.org/tmp/dot.html. and then we'll ...
Now is the time to https://www.balug.org/tmp/dot.html.  And then we'll ...
Now is the time to https://www.balug.org/tmp/dot.html and then we'll ...
Now is the time to www.balug.org/tmp/dot.html. and then we'll ...
Now is the time to www.balug.org/tmp/dot.html.  And then we'll ...
Now is the time to www.balug.org/tmp/dot.html and then we'll ...
But it should well handle these:
Now is the time to <https://www.balug.org/tmp/dot.html.> and then we'll ...
Now is the time to <https://www.balug.org/tmp/dot.html.>  And then we'll ...
Now is the time to <https://www.balug.org/tmp/dot.html> and then we'll ...
Now is the time to <www.balug.org/tmp/dot.html.> and then we'll ...
Now is the time to <www.balug.org/tmp/dot.html.>  And then we'll ...
Now is the time to <www.balug.org/tmp/dot.html> and then we'll ...

So, prudence would dictate:
put your dang URLs within <> (and encode as needed/prudent).
It only costs you a whopping two characters per URL - a very
small price to pay - for much safety and sanity.
Prudence, however, alas, is not a dictator.

> After a couple of tries, I discovered the problem was the URL
> included the period to end the sentence.  Manually deleting the
> period got to to a real web page.
> When explaining the situation to the sender of the original email, he
> questioned my browser software.
>
> I'd appreciate a few of you trying the above link and telling me your
> results.
> Paul




More information about the conspire mailing list