How to encourage seed discovery?

I’ve been experimenting with using dat to sync data from one machine to a cluster of machines. If it matters, the cluster is on a lan and the seed machine is connected to that lan over a VPN. Occasionally, the seed machine will go down for a few hours. During this time, the cluster successfully maintains the health of the swarm so long as one of the machines is always up. The problem I have is that when this seed machine comes back online, it oftentimes is never discovered by the cluster. If I kill one of the cluster machines to trigger a fresh sync in the cluster, this occasionally notices the seed machine and gets it connected to the cluster. Lately however, that hasn’t been working either. The only thing that seems reliable is killing all of the cluster dat nodes enforcing a re download of the dat archive. Because all of the cluster nodes are starting from scratch, they are forced to connect to the seed machine.

Is there some other way I can reliably encourage the cluster to reconnect to the seed machine without nuking the entire thing? Perhaps there is some way to hardcode an ip address or dat node?

It seems unlikely that this is a networking problem since the nodes have no trouble establishing connections with each other when starting from a fresh dat sync.

Hi, and welcome to the forums!

It’s difficult to know for sure what’s going wrong. You can try running dat doctor and see if it can identify the problem. If not you can try running with debug enabled (e.g. DEBUG=dat,dat-node /usr/local/bin/dat share) and see if you can see any activity from the missing machine.

You’re saying that all the machines are on LAN, right? They should discover each other over the local network. However, could you please also try to manually give each of them a different port (--port) and see if the problem goes away? This will let your nodes discover each other on the peer-discovery tracking server. You may need to enable port-forwarding and pin-hole firewall rules in the router. At the very least, changing the ports will make it easier to distinguish the machines in your debug logs.

Please let us know how it goes!

Thanks for the suggestions.

I have a cluster that are all on a LAN. I also have a single machine, where I am sharing from, which has a VPN connection to that LAN. Both the cluster and the single machine have access to the public internet.

running debug mode didn’t show too much:

$ DEBUG=dat,dat-node dat sync . 2>&1 | grep -v IMPORT
2020-03-26T02:09:20.700Z dat Dat DEBUG mode engaged, enabling quiet mode
2020-03-26T02:09:20.710Z dat dat 13.13.1
2020-03-26T02:09:20.710Z dat node v11.11.0
2020-03-26T02:09:21.089Z dat dat sync
2020-03-26T02:09:21.118Z dat-node archive ready. version: 3059
2020-03-26T02:09:21.126Z dat-node importFiles() { watch: true,
  dereference: true,
  count: true,
  indexing: true,
  _: [ '.' ],
  utp: true,
  debug: false,
  quiet: true,
  sparse: false,
  import: true,
  ignoreHidden: true,
  'ignore-hidden': true,
  'show-key': false,
  k: false,
  logspeed: 400,
  selectFromFile: '.datdownload',
  'select-from-file': '.datdownload',
  select: false,
  key: null,
  dir: '.',
  showKey: false,
  createIfMissing: false,
  exit: false,
  selectedFiles: null,
  ignore: [Function: ignore] }

The machines on the LAN all find each other right away and reliably. It is the machine that is connected via the VPN that is unreliable.

P2P tests didn’t show any surprises:

Running a new Peer-to-Peer test

To check connectivity with another computer, run the command:

  dat doctor 312ce633918bc6a512e48d875abc68f109443ddc9b61c4c7ade83cb0e7e58d7f

68.38.151.136:46837 (UTP) SUCCESS!
68.38.151.136:46837 (UTP) SUCCESS!

Trying to Connect:
  68.38.151.136:46837
  68.45.22.55:62039

and

Welcome to Dat Doctor!

Software Info:
  linux x64
  Node v11.15.0
  Dat Doctor v2.1.2
  dat v13.13.1


✖ Who am I?
  The default Dat port (3282) in use, using random port.
  This may impact Dat's connectivity if you have a firewall.
  ERROR: Could not detect public ip / port
✔ Loaded native modules


Joining existing Peer-to-Peer test:

  dat doctor 312ce633918bc6a512e48d875abc68f109443ddc9b61c4c7ade83cb0e7e58d7f

68.45.22.55:62039 (UTP) SUCCESS!
68.45.22.55:62039 (UTP) SUCCESS!

Trying to Connect:
  68.45.22.55:62039
  68.38.151.136:46837

The only reliable error is ERROR: Could not detect public ip / port. To all nodes need to have a public ip/port? It works most of the time, how is it dealing with this issue when it is working?

I’m not seeing much useful debug info. Is there something I am missing? Perhaps if I get that working, it will be easier to experiment with changing ports and seeing what happens.

I just tried a P2P dat doctor test again while my dat sync was failing to get any connections, and the dat doctor test succeeded while the entire time my original dat sync between the same machines never connected.

Hm. I don’t remember the right debug flags. You can get everything with DEBUG=* dat. You can make it less verbose by supplying a comma-separated list of components to log once you’ve run with * and found the names of the components that contain useful info.

I’ve got a more helpful error message now, though it is quite verbose. Does this mean anything to anyone?

2020-03-30T20:48:43.026Z bittorrent-dht [3c0765e] visited 14 nodes
2020-03-30T20:48:43.026Z bittorrent-dht [3c0765e] announce ae23fa2d4dda055874bc9684fbde828c57fbe940 3282
2020-03-30T20:48:43.524Z discovery-swarm connecting 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=1
2020-03-30T20:48:46.024Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
Mon, 30 Mar 2020 20:48:46 GMT dns-discovery retrying probe of undefined at secondary port 53
2020-03-30T20:48:46.524Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:48:46.525Z discovery-swarm timeout 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:47.025Z discovery-swarm onclose utp+tcp 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:48:47.526Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=2
2020-03-30T20:48:48.026Z discovery-swarm connecting 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=2
2020-03-30T20:48:50.530Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:51.030Z discovery-swarm timeout 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:51.037Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:48:51.537Z discovery-swarm onclose utp+tcp 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
Mon, 30 Mar 2020 20:48:53 GMT dns-discovery probe of undefined:NaN failed
2020-03-30T20:48:56.040Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=3
2020-03-30T20:48:56.538Z discovery-swarm connecting 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=3
2020-03-30T20:48:59.040Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:59.538Z discovery-swarm timeout 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:59.586Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:49:00.090Z discovery-swarm onclose utp+tcp 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
...
2020-03-30T20:48:59.040Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:59.538Z discovery-swarm timeout 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:48:59.586Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:49:00.090Z discovery-swarm onclose utp+tcp 68.45.22.55:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
Mon, 30 Mar 2020 20:49:01 GMT dns-discovery MDNS query 192.168.1.157:5353 21Q 0A +0
Mon, 30 Mar 2020 20:49:01 GMT dns-discovery MDNS query 192.168.2.126:5353 22Q 0A +0
...
Mon, 30 Mar 2020 20:49:42 GMT discovery-channel chan=ae23fa..6e dns announce { port: 3282, publicPort: 0, multicast: true }
Mon, 30 Mar 2020 20:49:42 GMT dns-discovery announce() ae23fa2d4dda055874bc9684fbde828c57fbe940
Mon, 30 Mar 2020 20:49:42 GMT dns-discovery probing discovery2.datprotocol.com:5300
Mon, 30 Mar 2020 20:49:42 GMT dns-discovery MDNS query 192.168.1.157:5353 1Q 0A +0
Mon, 30 Mar 2020 20:49:42 GMT dns-discovery Replying known peers via TXT to 192.168.1.157:5353
Mon, 30 Mar 2020 20:49:42 GMT dns-discovery MDNS response 192.168.1.157:5353 1A +0

Here is an example of it not working for a while, and then starting to work when I reboot the cluster:

Mon, 30 Mar 2020 20:55:03 GMT discovery-channel chan=ae23fa..6e dns announce { port: 3282, publicPort: 0, multicast: true }
Mon, 30 Mar 2020 20:55:03 GMT dns-discovery announce() ae23fa2d4dda055874bc9684fbde828c57fbe940
Mon, 30 Mar 2020 20:55:03 GMT dns-discovery probing discovery2.datprotocol.com:5300
Mon, 30 Mar 2020 20:55:03 GMT dns-discovery MDNS query 192.168.1.157:5353 1Q 0A +0
Mon, 30 Mar 2020 20:55:03 GMT dns-discovery Replying known peers via TXT to 192.168.1.157:5353
Mon, 30 Mar 2020 20:55:03 GMT dns-discovery MDNS response 192.168.1.157:5353 1A +0
Mon, 30 Mar 2020 20:55:03 GMT discovery-channel chan=ae23fa..6e dns discovery peer=68.38.151.136:3282
2020-03-30T20:55:03.631Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=0
2020-03-30T20:55:06.635Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:55:06.722Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:55:07.725Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=1
2020-03-30T20:55:10.731Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:55:11.242Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
Mon, 30 Mar 2020 20:55:11 GMT dns-discovery retrying probe of undefined at secondary port 53
2020-03-30T20:55:12.243Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=2
2020-03-30T20:55:15.248Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:55:15.758Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
Mon, 30 Mar 2020 20:55:19 GMT dns-discovery probe of undefined:NaN failed
2020-03-30T20:55:20.764Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=3
2020-03-30T20:55:23.765Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:55:24.298Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:55:39.300Z discovery-swarm connecting 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e retries=4
2020-03-30T20:55:42.301Z discovery-swarm timeout 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e
2020-03-30T20:55:42.404Z discovery-swarm onclose utp+tcp 68.38.151.136:3282@ae23fa2d4dda055874bc9684fbde828c57fbe9406c09d392aae7e17a8c5d696e will-requeue=1
2020-03-30T20:55:43.944Z discovery-swarm inbound connection type=utp ip=68.38.151.136:3282
2020-03-30T20:55:44.919Z dat-network Uploaded data: 34980
2020-03-30T20:55:45.419Z dat-network Uploaded data: 435245
2020-03-30T20:55:45.922Z dat-network Uploaded data: 690016
2020-03-30T20:55:46.422Z dat-network Uploaded data: 1878036
2020-03-30T20:55:46.922Z dat-network Uploaded data: 2745510
2020-03-30T20:55:47.423Z dat-network Uploaded data: 4616289
2020-03-30T20:55:47.922Z dat-network Uploaded data: 6507718

Multicast peer-dicovery (mDNS/DNS-SD) won’t work over VPN without additional configuration. Multicast broadcasts aren’t repeated over the VPN tunnel unless either the tunnel is configured to repeat it or the broadcast software is configured to target it specifically. Try looking online for how to configure it with your VPN setup.

However, globally reachable addresses is the path of least resistance. Dat doesn’t yet support IPv6 so you’ll have to work-around the limitations of IPv4.

Your log shows that you’re connect out via the public internet and not via the local network. I’m assuming one of the other hosts on your network has snagged the 3282 port from your router, so any connection to your public IP will be directed to the wrong host. Assign different ports and poking separate holes in your firewall for each instance should help with that.

Dat doesn’t really auto-configure network stuff behind multi-instance NATs. Everything is hard-coded to use the same port. Ideally, it should use UPnP to get an available port from the network gateway and auto-configure the software to use that port. This is the method similar software use to auto-configure itself. This may have changed very recently, though.

Could you please file a bug for that in dns-discovery? Attempts to connect to an undefined host at the port number not-a-number is definitely a bug.

Hi dwiel you may be suffering from a bug I discovered in discovery-swarm package which DAT uses internally.

Please take a look at this conversation for more details:

I sent a patch that fixed the bug in discovery-swarm, but I think no one updated the DAT commandline utility.

I’ll try to send a patch during this weekend and lets see if that solves something.

Hello dwiel

I have seen that you are using dat v13.13.1
That version was suffering from the bug I told you before.
The thing is that the newest dat version uses a different low-level library that completely replaces discovery-swarm. Therefore I don’t see much sense in trying to go against the current a building a new dat commandline with discovery-swarm patched.

I recommend that you try using the newest version v14.0.2


And see if that solves your problems.

Regards.

First off, thanks for all the help!

I tried changing the port while still using dat version 13.13.1 and that helped most of the time, though again is not working. To be more clear, sometimes it works and sometimes it never connects and sometimes it flickers back and forth between 0 connections and 1 connection, with the dat not actually syncing.

I’ve switched to dat version 14.0.2 and now I’m getting all kinds of odd behavior:

  • dat doctor just shows the help page, no doctor menu
  • dat sync between two processes on the same machine results in slow transfers, often less than 1 MB/s
  • dat sync between two processes on the same machine results in status messages like this: 0 connections | Download 786 KB/s Upload 0 B/s same kind of thing on upload side. 0 connections, yet still transferring data.
  • I haven’t gotten 14.0.2 working at all across my cluster where 13.13.1 works reliably.

Is 14.0.2 a stable release? Should I be posting bugs for each of these? Not sure I have time to trace each of these down unless it seems like they are all somehow related.