Standard URI

From @christianbundy on Wed Nov 28 2018 16:41:05 GMT+0000 (UTC)

I am reporting: a bug or unexpected behavior.

Bug Report

I’m working on resolving https://github.com/ssbc/ssb-markdown/issues/25 and noticed that some Dat URIs don’t currently conform to URI syntax. URLs like dat://example.com are great, but when linking to a hash the // should be removed, otherwise it’s parsed as a host (which has a maximum of 63 characters).

Instead, I think hash links should be presented (and parsed) as a path section rather than as a host. The only required change is to remove // from the URI when there’s no host.

Expected behavior

64-character hash links should start with dat: rather than dat://.

Actual behavior

64-character hash links start with dat://, so URI parsers throw an error as that’s not a valid host.


Thanks, I appreciate your help on this! Please let me know if there’s any other information I can provide.

Copied from original issue: https://github.com/datproject/dat/issues/1050

From @christianbundy on Wed Nov 28 2018 17:55:32 GMT+0000 (UTC)

This was lightly discussed in the IRC channel and I was informed that “there’s been a thought to split the 64 into two 32 character pairs as host”, which hacks around this particular bug but still doesn’t conform to the URI semantics. My hope is that we can use the path section (just like every other URI without a network-accessible host) rather than continue digging a deeper hole.

The point was made that a hypothetical user could technically have a strange resolv.conf where each Dat URI is a network-reachable host, but I’m having trouble understanding the benefits of keeping the hash digest in the host section rather than just conforming to the standard.

From @vitorio on Wed Nov 28 2018 19:22:09 GMT+0000 (UTC)

Is dat:///hash also an option? The triple-slash is used with file: protocol URLs, where the “missing” host is then assumed to be localhost. This would also move the hash into the “path” portion, at the expense of the theoretical semantics of what your “host” is.

From @christianbundy on Wed Nov 28 2018 20:35:56 GMT+0000 (UTC)

This was discussed a bit more in IRC and it sounds like there are other deviations from the URI spec (like dat://hash+version, which is an invalid hostname). From my perspective, the main takeaway was:

at some point, we just have to make a choice and live with the warts. I’ll need much bigger warts before I suggest changing the URL spec

It sounds like parse-dat-url has a way of parsing Dat identifiers, but since they aren’t URIs they aren’t going to be compatible with any standard parsers. I’m a bit bummed that Dat identifiers aren’t URIs or URLs but I’ll cross my fingers that we have the resources to make that change down the road.

Really excited to see all of the great work being done with Dat, thanks! :heart: :tada:

From @mwarrenus on Thu Jun 06 2019 22:09:54 GMT+0000 (UTC)

Is the “dat:” URI scheme ready to register officially (or provisionally) here https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml ?

From @RangerMauve on Fri Jun 07 2019 14:56:06 GMT+0000 (UTC)

A worry I have regarding putting the dat key in the path rather than the host is that it’ll be harder to use for setting the ‘origin’ for stuff like CORS in browsers.

Would using a Base36 encoding instead of hex help here?

From @RangerMauve on Fri Jun 07 2019 14:57:56 GMT+0000 (UTC)

I like how the :// looks, but I guess I’m not married to the concept. Having dat:{key here} looks just fine.

I’m really only concerned about making it easier for browsers to treat each dat key as a separate origin. :sweat_smile:

From @pfrazee on Fri Jun 07 2019 15:06:47 GMT+0000 (UTC)

@RangerMauve what’s your concern? The length of the key?

The :// and the use of hex was to ensure that existing http URL parsers don’t have too much trouble with the URL format. Domains have to be case insensitive, which iirc base36 is but then base36 has a lot of competing standards (and obviously isn’t as cyberpunk as hex).

From @RangerMauve on Fri Jun 07 2019 15:13:18 GMT+0000 (UTC)

I guess I’ve just seen a lot of mixed results with URL parsers in JS.

Some of them can parse the key as a hostname (Node), some of them place it in the pathname (Firefox), some of them error out like @christianbundy said, some of them have different behavior if I replace the protocol with HTTP beforehand. It’s all really confusing.

I think in dat-webext, the origin is set to something weird when you use Dat which might mess up the security model.

All that to say, Dat using a non-standards-compliant URL format makes it a bit hard to integrate with different places without a bunch of hacking. :sweat_smile:

From @pfrazee on Fri Jun 07 2019 15:15:11 GMT+0000 (UTC)

The other engine that works well with these URLs is chromium, so obviously I’m biased (beaker is on chromium). But I do think we’ll have less pain in our future this way.

From @RangerMauve on Fri Jun 07 2019 15:17:07 GMT+0000 (UTC)

Are you sure? It’s putting the key into the pathname in the latest Chrome Canary on Windows 10:

From @pfrazee on Fri Jun 07 2019 15:23:17 GMT+0000 (UTC)

It might be dependent on registering the protocol with chromium’s engine, which beaker does.

From @christianbundy on Fri Jun 07 2019 19:23:33 GMT+0000 (UTC)

> Is the “dat:” URI scheme ready to register officially (or provisionally)

No, we still don’t conform to the URI standard. The two biggest issues:

  • keys are too long to be place in the host position dat://${host}
  • versions are specified with dat://${host}+${version} which is an invalid hostname

From @pfrazee on Fri Jun 07 2019 19:33:10 GMT+0000 (UTC)

> keys are too long to be place in the host position dat://${host}

Is it?

versions are specified with dat://{host}+{version} which is an invalid hostname

I was going by RFC 3986 which includes + in the sub-delims which can be part of the reg-name which can be part of host.

From @christianbundy on Fri Jun 07 2019 20:28:05 GMT+0000 (UTC)

I was wrong, thanks for correcting me! It looks like we’re complying with everything the standard says we must do, it’s just that we’re going against what the standard says we should do. (Obligatory RFC 2119.)

RFC 3986

A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI’s scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS).
A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034]
and Section 2.1 of [RFC1123].
Such a name consists of a sequence of domain labels separated by “.”,
each domain label starting and ending with an alphanumeric character
and possibly also containing “-” characters. The rightmost domain
label of a fully qualified domain name in DNS may be followed by a
single “.” and should be if it is necessary to distinguish between
the complete domain name and some local domain.

[…] URI producers should use names
that conform to the DNS syntax, even when use of DNS is not
immediately apparent, and should limit these names to no more than
255 characters in length.

RFC 1034

<domain> ::= <subdomain> | " "

<subdomain> ::= <label> | <subdomain> "." <label>

The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.

TL;DR: The host should be 63 or fewer characters of alphanumerics and hyphens, and if you’re using the host for a DNS lookup then you must conform to that standard. We aren’t required to use the 63-char limit or remove the +, but we should be mindful about the fact that what we’re doing is not recommended:

  1. SHOULD NOT This phrase, or the phrase “NOT RECOMMENDED” mean that
    there may exist valid reasons in particular circumstances when the
    particular behavior is acceptable or even useful, but the full
    implications should be understood and the case carefully weighed
    before implementing any behavior described with this label.

In this case the “full implications” are mostly that URI parsers may not recognize Dat URIs or parse them as links, which is the reason I opened this issue (see https://github.com/markdown-it/mdurl/issues/2). Technically our URI scheme doesn’t violate any standards, but it has some properties that aren’t recommended and that hurts interoperability with URI parsers that don’t support our edge-case.

From @vitorio on Sat Jun 08 2019 04:53:10 GMT+0000 (UTC)

Just a reminder that adding a slash (dat:///hash) is valid, making the host “local” (arguably so in a P2P system) and moves the hash into the path (where length and parsing can be more flexible). You could also remove the slashes (dat:hash).

IMO the current URL format is the correct way of handling this. The alternatives proposed here all would result in a null origin, and remove the hostname part. This has implications for browsers, as without an origin the browser security model does not work correctly. This is the issue IPFS has, as all sites share the gateway as origin, meaning data can be leaked across all sites via web storage APIs.

In Firefox, their URL parser was correctly parsing dat URLs, once the protocol had been registered. This has just regressed in the latest nightly, but the discussion of how to deal with dweb URL schemes is open in this ticket. Parsing of unknown protocols is quite inconsistent across parser implementations.

3 Likes