Discussion:
[rsnapshot-discuss] rsnapshot not detecting rsync exit code
Gandalf Corvotempesta
2016-10-01 13:32:54 UTC
Permalink
Hi to all
i'm using "xargs" to run 2 rsnapshot processes in parallel, something like this:

find ${CFG_PATH}/*.conf -printf "%p\n" | xargs -n1 -P 2 -I{} rsnapshot -c{} sync

parallel jobs are running fine, but sometimes rsnapshot is unable to
detect that rsync ended properly and start the sync process 3 times (i
have num_retries set to 3)

AFAIK, xargs is unable to tell the exit code for each process but is
returning a single return code for all processes, but this should not
interfere with rsnapshot, as the return code from xargs is not seen by
rsnapshot.

Do you have an idea why rsnapshot is not seeing the return code from
rsync ? This doesn't happen always and I don't know how to
troubleshoot this.

Any "execution" limit before rsnapshot mark the process as failed ? In
our case these failed backup was running from 15 hours.
Gandalf Corvotempesta
2016-10-01 13:41:01 UTC
Permalink
This post might be inappropriate. Click to display it.
Christopher Barry
2016-10-01 14:35:44 UTC
Permalink
On Sat, 1 Oct 2016 15:41:01 +0200
Post by Gandalf Corvotempesta
2016-10-01 15:32 GMT+02:00 Gandalf Corvotempesta
Post by Gandalf Corvotempesta
Hi to all
i'm using "xargs" to run 2 rsnapshot processes in parallel,
find ${CFG_PATH}/*.conf -printf "%p\n" | xargs -n1 -P 2 -I{}
rsnapshot -c{} sync
parallel jobs are running fine, but sometimes rsnapshot is unable to
detect that rsync ended properly and start the sync process 3 times
(i have num_retries set to 3)
# find /etc/rsnapshot.d/hosts/*.conf | xargs -n1 -P2 -I{}
/usr/bin/rsnapshot {}
echo 6759 > /var/run/rsnapshot/myserver1.pid
/usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded \
--stats --exclude=var/backups/* --exclude=admin_backups/* \
--exclude=proc/* --exclude=sys/* --exclude=media/* --exclude=mnt/*
\ --exclude=tmp/* --exclude=wp-content/cache/object/* \
--exclude=var/spool/exim/* \
--password-file=/etc/rsnapshot.d/hosts/myserver1.passwd \
--link-dest=/var/backups/rsnapshot/myserver1/daily.0/./ \
/var/backups/rsnapshot/myserver1/.sync/./
echo 6760 > /var/run/rsnapshot/myserver2.pid
/usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded \
--stats --exclude=var/backups/* --exclude=admin_backups/* \
--exclude=var/spool/exim/* \
--password-file=/etc/rsnapshot.d/hosts/myserver2.passwd \
--link-dest=/var/backups/rsnapshot/myserver2/daily.0/./ \
/var/backups/rsnapshot/myserver2/.sync/./
I don't see any strange in this. Two indipendend processes are spawn
by xargs and I don't need any return code at "find" or "xargs" level.
The return code is needed by rsnapshot, but from rsync, not from xargs
What happens when -P1 is used? Do you still get the error?

Also, have you investigated using gnu parallel instead? It was designed
from the beginning for parallel operation, not as an enhancement.
--
Regards,
Christopher
Gandalf Corvotempesta
2016-10-01 15:33:04 UTC
Permalink
Post by Christopher Barry
What happens when -P1 is used? Do you still get the error?
No, i've not tried
Post by Christopher Barry
Also, have you investigated using gnu parallel instead? It was designed
from the beginning for parallel operation, not as an enhancement.
Yes, up to 2 days ago, I was using gnu parallel but I had an issue:
I have to backup 17 hosts at group of 2.
parallel doesn't wait for the whole "pool" to finish and start the next
command that I have in the script.

Let me try to explain:

i have a "scheduler" script like this (the following is a pseudo-code):

-------------------------------------------------------------------------
#!/bin/sh
echo "Start backup"
parallel -P2 ..... rsnapshot sync
echo "Finish backup"
echo "Start clean up"
my_command_to_clean_up
--------------------------------------------------------------------------

parallel starts 2 rsnapshot at once, but as I have 17 hosts, when it
reach the host 17, immediatly exists (as there isn't any command #18 to run)
and start with my_command_to_clean_up
Something like this:

rsnapshot1+rsnapshot2
rsnapshot3+rsnapshot4
rsnapshot5+rsnapshot6
rsnapshot7+rsnapshot8
rsnapshot9+rsnapshot10
rsnapshot11+rsnapshot12
rsnapshot13+rsnapshot14
rsnapshot15+rsnapshot16
rsnapshot17

as there is no "rsnapshot18" to run, parallel exists.

"my_command_to_clean_up" *must* be run after all backups because it's
very I/O intensive and also make the assumption
that no one is writing to the rsnapshot directory

With find+xargs this doesn't happens, xargs exists only when ALL
commands are run.

I've also tried "sem" packaged with gnu parallel, but I don't think is
working properly. I did something like this:

for bkp in /etc/rsnapshot/hosts/*.conf; do
sem --id backup -J2 rsnapshot $bkp sync
done
sem --id backup --wait
David Cantrell
2016-10-03 14:39:58 UTC
Permalink
Post by Gandalf Corvotempesta
Post by Christopher Barry
What happens when -P1 is used? Do you still get the error?
No, i've not tried
Post by Christopher Barry
Also, have you investigated using gnu parallel instead? It was designed
from the beginning for parallel operation, not as an enhancement.
I have to backup 17 hosts at group of 2.
parallel doesn't wait for the whole "pool" to finish and start the next
command that I have in the script.
-------------------------------------------------------------------------
#!/bin/sh
echo "Start backup"
parallel -P2 ..... rsnapshot sync
echo "Finish backup"
echo "Start clean up"
my_command_to_clean_up
--------------------------------------------------------------------------
parallel starts 2 rsnapshot at once, but as I have 17 hosts, when it
reach the host 17, immediatly exists (as there isn't any command #18 to run)
and start with my_command_to_clean_up
rsnapshot1+rsnapshot2
rsnapshot3+rsnapshot4
rsnapshot5+rsnapshot6
...
rsnapshot17
as there is no "rsnapshot18" to run, parallel exists.
Is it acceptable to start jobs 1 and 2, then as soon as the first of
those finishes start job 3 (at which point you might be running jobs 1
and 3 or 2 and 3), then when the first of those finishes start job 4,
and so on, until all jobs are done?

If it is then you may find this script useful:
https://github.com/DrHyde/cpXXXan/blob/master/parallel-builder.pl

You would invoke it something like this ...

./parallel `which rsnapshot` \
"-c firstconfigfile daily" "-c secondconfigfile daily" \
"-c thirdconfigfile daily" ...

It limits itself to running at most two jobs at once (hard-coded at line
13), and also tries to optimise for time by running long-running jobs
first. And it prevents you from accidentally running the parallel script
itself twice at the same time.
--
David Cantrell | Enforcer, South London Linguistic Massive

Do not be afraid of cooking, as your ingredients will know and misbehave
-- Fergus Henderson
Gandalf Corvotempesta
2016-10-03 18:25:27 UTC
Permalink
Post by David Cantrell
Is it acceptable to start jobs 1 and 2, then as soon as the first of
those finishes start job 3 (at which point you might be running jobs 1
and 3 or 2 and 3), then when the first of those finishes start job 4,
and so on, until all jobs are done?
Yes it is.
Post by David Cantrell
https://github.com/DrHyde/cpXXXan/blob/master/parallel-builder.pl
You would invoke it something like this ...
./parallel `which rsnapshot` \
"-c firstconfigfile daily" "-c secondconfigfile daily" \
"-c thirdconfigfile daily" ...
But isn't the same as gnu parallel is supposed to do ?

BTW, I've found the issue: https://github.com/rsnapshot/rsnapshot/issues/152
and I'm almost sure this is a bug, because retcode 23 and 24 are
considered ok by rsnapshot, but not detected in the loop, thus the
sync phase, in my case, is always run 3 times.

I've fixed by replacing the standard while with the following:
while ($tryCount < $rsync_numtries && ($result != 0 && $result != 23
&& $result != 24)) {


(or by lowering rsync_numtries from 3 to 1)

Gandalf Corvotempesta
2016-10-02 09:59:00 UTC
Permalink
Post by Christopher Barry
What happens when -P1 is used? Do you still get the error?
Also, have you investigated using gnu parallel instead? It was designed
from the beginning for parallel operation, not as an enhancement.
The issue seems to be related to rsync return code.
Even today I had 2 backups restarted from scratch.
Hopefully, I was stracying an rsync process just when it finished. The
return code was "23" (because some files was in use thus partially
transferred)
Then rsnapshot started again the whole rsync process.
Gandalf Corvotempesta
2016-10-02 10:23:18 UTC
Permalink
Post by Gandalf Corvotempesta
The issue seems to be related to rsync return code.
Even today I had 2 backups restarted from scratch.
Hopefully, I was stracying an rsync process just when it finished. The
return code was "23" (because some files was in use thus partially
transferred)
Then rsnapshot started again the whole rsync process.
I think it's related to this, that's seems a bug to me

while ($tryCount < $rsync_numtries && $result !=0) {
$result = system(@cmd_stack);
$tryCount += 1;
}

# now we see if rsync ran successfully, and what to do about it
if ($result != 0) {
# bitmask return value
my $retval = get_retval($result);

# print warnings, and set this backup point to rollback if we're
using --link-dest
#
handle_rsync_error($retval, $bp_ref);
} else {
print_msg("rsync succeeded", 5);
}


rsnapshot is looking for the return code.
If != 0, it will re-run rsync up to $rsync_numtries times.
But this is wrong, because some return code != 0 are OK, like 23 or 24.
These return codes are checked in handle_rsync_error thus, rsnapshot should
evaluate the return code at first, and only in case of error retry the
whole rsync phase.

This is why when I'm getting return code 23 or 24, rsnapshot start
everything again.

I thinks it's a bug, as return codes 23 or 24 are considered OK by
rsnapshot (as you can see in handle_rsync_error)
but are not considered ok in the while loop
Loading...