[rsnapshot-discuss] Parallelism and deduplication

Discussion:

Gandalf Corvotempesta

2016-04-25 10:08:54 UTC

Hi to all.
I'm trying to implement a new backup server based on rsnapshot
by adding parallelism (through "parallel") and deduplication.

Just need some confirms, this is what I have right now (and seems
to work perfectly):

/etc/rsnapshot.d/hosts/server1.{conf,passwd}
/etc/rsnapshot.d/hosts/server2.{conf,passwd}
/etc/rsnapshot.d/hosts/server3.{conf,passwd}
/etc/rsnapshot.d/hosts/serverN.{conf,passwd}
/etc/rsnapshot.d/rsnapshot.conf

/etc/rsnapshot.d/rsnapshot.conf is the base configuration,
included from each host's config:

$ cat /etc/rsnapshot.d/hosts/server1.conf
include_conf /etc/rsnapshot.d/rsnapshot.conf
snapshot_root /var/backups/rsnapshot/server1/
logfile /var/log/rsnapshot/server1.log
lockfile /var/run/rsnapshot/server1.pid
backup rsync://***@server1/everything/ ./
+rsync_long_args=--password-file=/etc/rsnapshot.d/hosts/server1.passwd

$ cat rsnapshot.conf
config_version 1.2
no_create_root 0
cmd_cp /bin/cp
cmd_rm /bin/rm
cmd_rsync /usr/bin/rsync
#cmd_ssh /usr/bin/ssh
cmd_logger /usr/bin/logger
cmd_du /usr/bin/du
#cmd_preexec /path/to/preexec/script
#cmd_postexec /path/to/postexec/script
verbose 3
loglevel 5
lockfile /var/run/rsnapshot.pid
sync_first 1
rsync_short_args -a
rsync_long_args --delete --numeric-ids --relative --delete-excluded --stats
link_dest 1
rsync_numtries 3
retain daily 15
retain weekly 2
exclude /backups/
exclude /admin_backups/
exclude /reseller_backups/
exclude /user_backups/
exclude /tmp/
exclude /proc
exclude /sys
exclude /var/cache
exclude /var/log/lastlog
exclude /var/log/rsync*
exclude /var/lib/mlocate
#exclude /var/spool
exclude /media
exclude /mnt
exclude tmp/

Up to this, is pretty easy.
Now I addeded some parallelism, through parallel.
Just a single entry in crontab:

$ cat /etc/cron.d/rsnapshot
0 0 * * * root /usr/local/bin/parallel_rsnapshot 2>&1 > /dev/null

$ cat /usr/local/bin/parallel_rsnapshot
#!/bin/bash
RSNAPSHOT_SCHEDULER="/usr/local/bin/rsnapshot_scheduler"
HOSTS_PATH="/etc/rsnapshot.d/hosts"
PARALLEL_JOBS=5
# Run rsnapshot in parallel
parallel --jobs ${PARALLEL_JOBS} "${RSNAPSHOT_SCHEDULER} {}" :::
${HOSTS_PATH}/*.conf

And a custom rsnapshot scheduler that choose which backup level to run:
(this is a little bit semplified for posting on list)

$ cat /usr/local/bin/rsnapshot_scheduler
#!/bin/bash
RSNAPSHOT="/usr/bin/rsnapshot -v"
CONFIG=$1
HOST=$(/usr/bin/basename ${CONFIG} | sed 's/\.conf$//g')
LOG_PATH="/var/log/rsnapshot"
LOG_FILE="${LOG_PATH}/$(/usr/bin/basename ${CONFIG} | sed
's/\.conf$/\.log/g')"

function rsnap {
${RSNAPSHOT} -c $1 $2
}

if [ "$(date +%j)" -eq "001" ]; then
rsnap ${CONFIG} yearly
fi
if [ $(date +%d) -eq 1 ]; then
rsnap ${CONFIG} monthly
fi
if [ $(date +%w) -eq 0 ]; then
rsnap ${CONFIG} weekly
fi

rsnap ${CONFIG} sync

# Check if sync was OK. Run daily only if sync is OK.
SUCCESS=$(grep -ci "$(date +%d/%b/%Y).*sync: completed successfully"
${LOG_FILE})
if [ ${SUCCESS} -ne 1 ]; then
EMAIL_SUBJECT="Backup FAILED per ${HOST}"
else
EMAIL_SUBJECT="Backup OK per ${HOST}"
rsnap ${CONFIG} daily
fi

# Send full log report
grep -i "$(date +%d/%b/%Y)" ${LOG_FILE} | mailx -s "${EMAIL_SUBJECT}"
***@mydomain.tld

Now, some questions:

1) should I run weekly, monthly and yearly before or after the sync
process? If run before, i'll rotate some backups with no confirm that
the followind sync would be ok. This could lead to some missing backups
(the ones deleted by rotation). Probably, the first thing to do is a
sync. If all is OK, then rotate everything else. Right?

2) On remote server, I have a "pre-xfer" script that run some actions
and output some debug text (like "echo Running xy....")
Is possible to get the output logged by rsnapshot ?

3) I would like to add some deduplication feature. Actually, I can run
"hardlink" over the latest backup (for example, daily.0), but this will
"deduplicate" only files in the same backup pool. How can I deduplicate
common files acroll all pools ? Is safe to run "hardlinks" across the
whole rsnapshot directory? (hardlink /var/backups/rsnapshot/server1/*) ?
This would save much more space....

Thanks in advance.

Gandalf Corvotempesta

2016-04-29 18:08:22 UTC