Overview
Presently I have around 500gb of data in my backup sets including photos, personal documents, and client data. Incremental backups occur every night.
One of the more challenging aspects is storing data from docker containers.
Borgmatic
I’ve been using BorgBackup for several years now. It offers good security with data encrypted during transfer and in-state, but the main advantage is that BorgBackup doesn’t require periodic full backups. On a residential connection uploading all 500gb of data would take several days. With BorgBackup I only need to do that once, and then every incremental thereafter is collated into a full backup in the remote, and previous backups in the set are deduplicated.
Deduplication based on content-defined chunking is used to reduce the number of bytes stored: each file is split into a number of variable length chunks and only chunks that have never been seen before are added to the repository. A chunk is considered duplicate if its id_hash value is identical. A cryptographically strong hash or MAC function is used as id_hash, e.g. (hmac-)sha256. To deduplicate, all the chunks in the same repository are considered, no matter whether they come from different machines, from previous backups, from the same backup or even from the same single file. Compared to other deduplication approaches, this method does NOT depend on: - file/directory names staying the same: So you can move your stuff around without killing the deduplication, even between machines sharing a repo. - complete files or time stamps staying the same: If a big file changes a little, only a few new chunks need to be stored - this is great for VMs or raw disks. - The absolute position of a data chunk inside a file: Stuff may get shifted and will still be found by the deduplication algorithm.
That said, the BorgBackup cli is somewhat arcane, so I use Borgmatic as an abstraction layer. Borgmatic allows configuration via files, with some convenience functions like failure notifications.
Container Data
Borgmatic allows you to simply provide a list of directories to include in the backup set, but getting data into simple directories takes some doing.
There’s no single approach for every docker container.
Simple Volumes
In some cases you can just point borgmatic at the directories mounted as volumes in containers.
For example, my onedrive container exposes all the files it’s managing in a /data
directory, and borgmatic just includes everything in that directory without any interaction with the container itself.
The big caveat here is that the onedrive container might write to that directory while a backup is in progress, so you won’t have a consistent snapshot. In the case of onedrive this doesn’t matter, because I’m only storing the files, not onedrive’s cache or database of changes.
Databases
With a database, the container’s /data
directory will contain files which are interdependent. For example, if you have files A
B
and C
, borgmatic uploads A
, is working on B
, and then the database writes a change effecting A
and C
, you’ll end up with a copy of A
from before the change and C
from after the change.
The way to manage this is to dump the database prior to the backup. That’s pretty easy to do with a simple bash script.
#!/bin/bash
docker exec postgres pg_dumpall --username=postgres -f /dump/all.sql
The caveat with database level data is that to deploy from a backup you need to re-create the entire docker stack.
Project Specific
Some containers or projects offer built in backup strategies. For example, gitea has a dump command which will write everything to a .zip
file.
#!/bin/bash
find $(dirname $(realpath $0))/dump/* -mtime +14 -delete
docker exec -u git -it -w /dump gitea bash -c '/app/gitea/gitea dump -c /data/gitea/conf/app.ini'
This script will purge the dump directory of anything older than 14 days, before creating a new dump.
The caveat with this approach is that borgmatic is going to have trouble deduplicating the zip file. This means that even if there’s a tiny insignificant change in gitea like some log lines, the entire contents of all my git repos will need to be uploaded again in the next backup.