Best practices for keeping a Dat archive from bloating

Append-only logging and versions histories are all good in theory. However, how do you keep a Dat archive from ballooning in size over time?

Say you’ve got a static website and change something about the main layout file. That leaves you with one new revision per HTML file in your archive and the total Dat archive has doubled in size. Revert that change a week later and the archive grows by another 33%.

I’ve noticed that Dat will add revisions of identical files as new files. Say you’ve moved an identical copy of a file on top of a file in your Dat archive. Dat will see that as a modification and rewrite the entire file into the archive even though the file is unchanged (only filesystem metadata change). Static site generators will sometime inadvertently do this from time to time.

Archives also start to have performance problems in Bunsen and Beaker when I pass 3000 revisions. The only work-around I’ve found is to delete it and start with a new archive.

So what to do to keep archives as small as possible without having to commit to never ever making any change to them?

1 Like

Hey @Daniel, not sure about this but it reminded me of that part of the whitepaper where it was written:

Because Dat is designed for larger datasets, if it stored all previous file versions in .dat, then the .dat folder could easily fill up the users hard drive inadvertently. Therefore Dat has multiple storage modes based on usage.

So, is this a case of “storage mode”? Unsure how that works in practice, if you need to specify something when you create your Dat or you need to use some specific tool (perhaps someone else can chime in on this aspect?).

There are a couple of approaches to this problem:

  1. Purge old data in the dat. You don’t have to host all versions of the content feed. When you overwrite a file you can discard the old data and only host the current version. This will mean that older versions may become unavailable.
  2. Periodically create a new archive when the size becomes too great (as you mentioned). This is obviously not-ideal.
  3. Incremental file updates. In hyperdrive 10 (part of the dat 2 stack), I believe it will be possible to do partial updates to files. This could help with some of the cases you mention.
1 Like

How do you do this, though? I’m not seeing any options for it in Beaker on the Dat CLI.

This is not something that is available in the CLI as far as I know. It is something that is possible using internal hypercore APIs. I might write a tool to do this.

1 Like

That would be greatly appreciated. The current tools for managing archives assume infinite storage and that absolutely every byte of data is worth preserving.

Ideally, I’d want to either prune data from the archive older than x months or limit the archive to x revision of each file.

Or maybe backup-utility-style “keep max 1 revision per hour for the last week, max 1 per day for the last month, max 1 per week the last quarter, and max 1 per month the last 3 years.”

Being able to specify which file paths to prune and which to keep might enable usecases where most of the archive’s history is left intact. A blog website’s /index.html or /feed.atom changes all the time, for example.

This just stores the differences between the previous and latest revision, right? This alone might be enough to reduce the storage requirements enough for it not to be a problem anymore. Assuming it will be available in the CLI, that is.

Would pruning old archive data leave you with partial files in this mode?

I brought this up at the dat comm comm, and unfortunately this is not in yet, but is on the roadmap.