Discussion:
[rsnapshot-discuss] Performance comparison and process
Thierry Lavallee
2015-08-13 12:59:19 UTC
Permalink
Hi,
Here is a shot at my log. I am wondering if this is normal and if there
is a way to optimize this. I find all this copying an removing quite
lengthy. And I am wondering if this is the kind of performance you guys
get and I should await for on a 260G remote source directory. Should I
post my configuration?
Thanks!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[12/Aug/2015:03:00:03] /usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf sync: started
[12/Aug/2015:03:00:03] Setting locale to POSIX "C"
[12/Aug/2015:03:00:03] echo 4493 > /var/run/rsnapshot_Mediaserver02new.pid
[12/Aug/2015:03:00:03] /usr/bin/rsync -avx --delete --numeric-ids
--delete-excluded --rsh="/usr/bin/ssh -i /root/.ssh/id_rsa"
***@Mediaserver02.media-hosting.com:/backup/cpbackup/daily/
/media/backupMediaserver02b/home/.sync/

3 hours here to get the diff files.

[12/Aug/2015:06:12:41] touch /media/backupMediaserver02b/home/.sync/
[12/Aug/2015:06:12:45] rm -f /var/run/rsnapshot_Mediaserver02new.pid
[12/Aug/2015:06:12:45] /usr/bin/logger -i -p user.info -t rsnapshot
/usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf sync:
completed successfully
[12/Aug/2015:06:12:45] /usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf sync:
completed successfully
[12/Aug/2015:06:12:46] /usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf daily: started
[12/Aug/2015:06:12:46] Setting locale to POSIX "C"
[12/Aug/2015:06:12:46] echo 15058 > /var/run/rsnapshot_Mediaserver02new.pid
[12/Aug/2015:06:12:48] /bin/rm -rf /media/backupMediaserver02b/home/daily.6/

6 Hours here to cleanout daily.6

[12/Aug/2015:12:08:55] mv /media/backupMediaserver02b/home/daily.5/
/media/backupMediaserver02b/home/daily.6/
[12/Aug/2015:12:08:55] mv /media/backupMediaserver02b/home/daily.4/
/media/backupMediaserver02b/home/daily.5/
[12/Aug/2015:12:08:56] mv /media/backupMediaserver02b/home/daily.3/
/media/backupMediaserver02b/home/daily.4/
[12/Aug/2015:12:08:56] mv /media/backupMediaserver02b/home/daily.2/
/media/backupMediaserver02b/home/daily.3/
[12/Aug/2015:12:08:56] mv /media/backupMediaserver02b/home/daily.1/
/media/backupMediaserver02b/home/daily.2/
[12/Aug/2015:12:08:56] mv /media/backupMediaserver02b/home/daily.0/
/media/backupMediaserver02b/home/daily.1/
[12/Aug/2015:12:08:56] /bin/cp -al
/media/backupMediaserver02b/home/.sync
/media/backupMediaserver02b/home/daily.0

10 hours here to copy from .sync to daily.0

[12/Aug/2015:22:03:50] rm -f /var/run/rsnapshot_Mediaserver02new.pid
[12/Aug/2015:22:03:50] /usr/bin/logger -i -p user.info -t rsnapshot
/usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf daily:
completed successfully
[12/Aug/2015:22:03:50] /usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf daily:
completed successfully

All done 19 hours later...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

------------------------------------------------------------------------------
David Cantrell
2015-08-13 13:19:20 UTC
Permalink
Post by Thierry Lavallee
Here is a shot at my log. I am wondering if this is normal and if there
is a way to optimize this. ... I am wondering if this is the kind of
performance you guys get and I should await for on a 260G remote
source directory.
...
3 hours here to get the diff files ...
6 Hours here to cleanout daily.6 ...
I presume that you've got lots and lots of small files?

I'm not surprised at those times. 3 hours to read *lots* of directories
and *lots* of inodes (and possibly read the contents of a load of files
too) doesn't seem that unreasonable. It's the sort of work that doesn't
really benefit from caching, so you're mostly limited by disk I/O speed.
On top of that, reading a little bit of information here, a little bit
of information there, means that the disk heads will be thrashing back
and forth all the time, which is sloooooow. Disks are pretty damned fast
at reading big files, because filesystems try to put sequential blocks
next to each other on the disk, but that's pretty much orthogonal to
what rsnapshot needs to do.

Likewise, for the clearout, it has to not only read all those little
bits of information so it can find what to unlink, it also has to
*write* all over the place so it can reduce link counts and possibly
de-allocate blocks altogether.

If you have 'top' running while a backup happens, I expect that you'll
see lots of iowait.
--
David Cantrell | Hero of the Information Age

The Law of Daves: in any gathering of technical people, the
number of Daves will be greater than the number of women.

------------------------------------------------------------------------------
Thierry Lavallee
2015-08-13 14:04:08 UTC
Permalink
Post by David Cantrell
I presume that you've got lots and lots of small files?
This is the source count file... Is this that huge?!
# find /backup -type f | wc -l
# 6044914

Totally this is the time it takes
# 3 hours to get the diff files from source
# 6 Hours to cleanout daily.6
# 10 hours to copy from .sync to daily.0
total 19 hours.


------------------------------------------------------------------------------
Sean Greenslade
2015-08-13 14:21:15 UTC
Permalink
Post by Thierry Lavallee
Post by David Cantrell
I presume that you've got lots and lots of small files?
This is the source count file... Is this that huge?!
# find /backup -type f | wc -l
# 6044914
Totally this is the time it takes
# 3 hours to get the diff files from source
# 6 Hours to cleanout daily.6
# 10 hours to copy from .sync to daily.0
total 19 hours.
Consider this. 19 hours = 68,400 seconds. To process 6,000,000 files in that time, you're talking around 87 files per second. If the head is seeking for each one of those files, I'd absolutely believe that time is the best it could possibly do.

--Sean

------------------------------------------------------------------------------
Gordon Messmer
2015-08-14 06:47:57 UTC
Permalink
Post by Sean Greenslade
Consider this. 19 hours = 68,400 seconds. To process 6,000,000 files in that time, you're talking around 87 files per second. If the head is seeking for each one of those files, I'd absolutely believe that time is the best it could possibly do.
Based on what? I have an ext4 filesystem on a single SATA drive, and
it'll copy (cp -al) a directory tree at about 12500 files/second. It'll
remove a directory tree at about 25000 files per second.

Thierry's system is removing about 278 files per second, or about 1% as
fast as my test system.

------------------------------------------------------------------------------
Sean Greenslade
2015-08-14 14:13:58 UTC
Permalink
Post by Gordon Messmer
Post by Sean Greenslade
Consider this. 19 hours = 68,400 seconds. To process 6,000,000 files
in that time, you're talking around 87 files per second. If the head
is seeking for each one of those files, I'd absolutely believe that
time is the best it could possibly do.
Based on what?
Gut feel and an assumption of heavy fragmentation.
Post by Gordon Messmer
I have an ext4 filesystem on a single SATA drive, and
it'll copy (cp -al) a directory tree at about 12500 files/second. It'll
remove a directory tree at about 25000 files per second.
Thierry's system is removing about 278 files per second, or about 1% as
fast as my test system.
Then we'd need to know more about the differences between the systems.
If my assumption is correct, the process should be bottlenecking on IO.
Something like iotop should be able to confirm this.

--Sean

------------------------------------------------------------------------------
Arne Hüggenberg
2015-08-13 14:07:27 UTC
Permalink
Post by Thierry Lavallee
Hi,
Here is a shot at my log. I am wondering if this is normal and if there
is a way to optimize this. I find all this copying an removing quite
lengthy. And I am wondering if this is the kind of performance you guys
get and I should await for on a 260G remote source directory. Should I
post my configuration?
Thanks!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[12/Aug/2015:03:00:03] /usr/bin/rsnapshot -c
/root/scripts/backup/rsnapshot.Mediaserver02_cpbackup.conf sync: started
[12/Aug/2015:03:00:03] Setting locale to POSIX "C"
[12/Aug/2015:03:00:03] echo 4493 > /var/run/rsnapshot_Mediaserver02new.pid
[12/Aug/2015:03:00:03] /usr/bin/rsync -avx --delete --numeric-ids
--delete-excluded --rsh="/usr/bin/ssh -i /root/.ssh/id_rsa"
/media/backupMediaserver02b/home/.sync/
3 hours here to get the diff files.
6 Hours here to cleanout daily.6
10 hours here to copy from .sync to daily.0
All done 19 hours later...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How much RAM does the System doing the rsnapshot run have?
I'm using a ReadNAS6 to backup several servers with lots of small files.

At the default 1GB RAM a rsnapshot run took 18h.

Going to 4GB RAM took that to 10h.
--
Arne Hüggenberg
System Administrator
_______________________
Sports & Bytes GmbH
Rheinlanddamm 207-209
D-44137 Dortmund
Fon: +49-231-9020-6655
Fax: +49-231-9020-6989

Geschäftsführer: Thomas Treß, Carsten Cramer

Sitz und Handelsregister: Dortmund, HRB 14983
Finanzamt Dortmund - West
Steuer-Nr.: 314/5763/0104
USt-Id. Nr.: DE 208084540

------------------------------------------------------------------------------
Thierry Lavallee
2015-08-13 15:22:01 UTC
Permalink
The system has 12 Gigs memory. Yet I don't know if there is a way to
give more ram or priority to rsnapshot?!

# vmstat -s
12000272 K total memory
11717380 K used memory
5812272 K active memory
4903652 K inactive memory
282892 K free memory
1933016 K buffer memory
6814576 K swap cache
9326584 K total swap
475092 K used swap
8851492 K free swap
How much RAM does the System doing the rsnapshot run have? I'm using a
ReadNAS6 to backup several servers with lots of small files. At the
default 1GB RAM a rsnapshot run took 18h. Going to 4GB RAM took that
to 10h.
Arne Hüggenberg
2015-08-17 18:19:26 UTC
Permalink
Post by Thierry Lavallee
The system has 12 Gigs memory. Yet I don't know if there is a way to
give more ram or priority to rsnapshot?!
# vmstat -s
12000272 K total memory
11717380 K used memory
5812272 K active memory
4903652 K inactive memory
282892 K free memory
1933016 K buffer memory
6814576 K swap cache
9326584 K total swap
475092 K used swap
8851492 K free swap
No need to give rsnapshot priority, the bottleneck i was seeing was not having
enough RAM for the Kernel to properly cache Filesystem Metadata (e.g. 'whats
in thsi directory', 'what permissions does this file have' etc) during the
rsnapshot run

On a 12GB System (with the kernel using ~9GB for filesystem caching) thats most
likely not what you are seeing.

Regards,
--
Arne Hüggenberg
System Administrator
_______________________
Sports & Bytes GmbH
Rheinlanddamm 207-209
D-44137 Dortmund
Fon: +49-231-9020-6655
Fax: +49-231-9020-6989

Geschäftsführer: Thomas Treß, Carsten Cramer

Sitz und Handelsregister: Dortmund, HRB 14983
Finanzamt Dortmund - West
Steuer-Nr.: 314/5763/0104
USt-Id. Nr.: DE 208084540

------------------------------------------------------------------------------
Andrew McGlashan
2015-08-17 19:29:01 UTC
Permalink
Post by Arne Hüggenberg
Post by Thierry Lavallee
The system has 12 Gigs memory. Yet I don't know if there is a way to
give more ram or priority to rsnapshot?!
# vmstat -s
12000272 K total memory
11717380 K used memory
5812272 K active memory
4903652 K inactive memory
282892 K free memory
1933016 K buffer memory
6814576 K swap cache
9326584 K total swap
475092 K used swap
8851492 K free swap
No need to give rsnapshot priority, the bottleneck i was seeing was not having
enough RAM for the Kernel to properly cache Filesystem Metadata (e.g. 'whats
in thsi directory', 'what permissions does this file have' etc) during the
rsnapshot run
On a 12GB System (with the kernel using ~9GB for filesystem caching) thats most
likely not what you are seeing.
It's completely normal for Linux to use "as much RAM as available" for
caching data; but that is only to not totally "waste" the possibility.
If apps need more RAM, then the caching will be less; if apps don't need
RAM, the caching will be more. It probably doesn't make that much
difference to rsnapshot one way or the other.

Cheers
AndrewM

------------------------------------------------------------------------------
Arne Hüggenberg
2015-08-19 16:58:10 UTC
Permalink
Post by Andrew McGlashan
Post by Arne Hüggenberg
Post by Thierry Lavallee
The system has 12 Gigs memory. Yet I don't know if there is a way to
give more ram or priority to rsnapshot?!
# vmstat -s
12000272 K total memory
11717380 K used memory
5812272 K active memory
4903652 K inactive memory
282892 K free memory
1933016 K buffer memory
6814576 K swap cache
9326584 K total swap
475092 K used swap
8851492 K free swap
No need to give rsnapshot priority, the bottleneck i was seeing was not
having enough RAM for the Kernel to properly cache Filesystem Metadata
(e.g. 'whats in thsi directory', 'what permissions does this file have'
etc) during the rsnapshot run
On a 12GB System (with the kernel using ~9GB for filesystem caching) thats
most likely not what you are seeing.
It's completely normal for Linux to use "as much RAM as available" for
caching data; but that is only to not totally "waste" the possibility.
If apps need more RAM, then the caching will be less; if apps don't need
RAM, the caching will be more. It probably doesn't make that much
difference to rsnapshot one way or the other.
It matters to rsnapshot if you don't have enough RAM to cache properly and you
end up constantly rereading metadata from disk, because it got evicted from
the cache due to memory pressure.
Post by Andrew McGlashan
Cheers
AndrewM
----------------------------------------------------------------------------
-- _______________________________________________
rsnapshot-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Regards,
--
Arne Hüggenberg
System Administrator
_______________________
Sports & Bytes GmbH
Rheinlanddamm 207-209
D-44137 Dortmund
Fon: +49-231-9020-6655
Fax: +49-231-9020-6989

Geschäftsführer: Thomas Treß, Carsten Cramer

Sitz und Handelsregister: Dortmund, HRB 14983
Finanzamt Dortmund - West
Steuer-Nr.: 314/5763/0104
USt-Id. Nr.: DE 208084540

------------------------------------------------------------------------------
Gordon Messmer
2015-08-13 16:09:04 UTC
Permalink
Post by Thierry Lavallee
10 hours here to copy from .sync to daily.0
It does seem a tiny bit odd that the copy takes so much longer than the
rsync. What filesystems are in use for the rsnapshot volume? What sort
of disk configuration? That is, what model disks, how many disks, what
disk controller, what RAID level?

------------------------------------------------------------------------------
Thierry Lavallee
2015-08-13 16:48:26 UTC
Permalink
hi...
Post by Gordon Messmer
It does seem a tiny bit odd that the copy takes so much longer than
the rsync.
The rsync takes from a remote server via SSH and only takes the diff. I
guess/hope. ;)
Post by Gordon Messmer
What filesystems are in use for the rsnapshot volume?
Filesystem is ext4
Just a disk, mounted.
SATA
1TB
7200rpm
Post by Gordon Messmer
what model disks,
*Disk size:*931.51 GB |*Make and model:*ATA TOSHIBA DT01ACA1
|*Cylinders:*121601 |*Partition format:*MSDOS
Post by Gordon Messmer
how many disks,
one
Post by Gordon Messmer
what disk controller,
not sure how to get that info right now. Here is the fdisk -l output

Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdf1 1 121601 976760001 83 Linux
Partition 1 does not start on physical sector boundary.
Post by Gordon Messmer
what RAID level?
no raid
Gordon Messmer
2015-08-14 06:42:00 UTC
Permalink
Post by Gordon Messmer
what model disks,
*Disk size:*931.51 GB |*Make and model:*ATA TOSHIBA DT01ACA1
|*Cylinders:*121601 |*Partition format:*MSDOS
Post by Gordon Messmer
what disk controller,
not sure how to get that info right now. Here is the fdisk -l output
...
Device Boot Start End Blocks Id System
/dev/sdf1 1 121601 976760001 83 Linux
Partition 1 does not start on physical sector boundary.
What I meant was, what is the disk connected to? "lspci" should list a
SATA controller. Some are very poorly supported. NVidia chipsets come
to mind. Rotten hardware, those.

But all the same, you do have a big problem there. "Partition 1 does
not start on physical sector boundary." Your disk is listed by the
manufacturer as a 4k sector disk, but your partition doesn't start on a
sector boundary. That's going to cause some very poor IO. You may be
able to shrink and then move the partition to 1M, but it'll take a long
time. It's slightly risky to throw away your backups, but the fastest
resolution would be to delete the partition and create a new one at 1M.
Thierry Lavallee
2015-08-14 12:31:22 UTC
Permalink
Post by Gordon Messmer
What I meant was, what is the disk connected to? "lspci" should list
a SATA controller. Some are very poorly supported. NVidia chipsets
come to mind. Rotten hardware, those.
# lspci
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller
(rev 02)
00:02.0 VGA compatible controller: Intel Corporation Core Processor
Integrated Graphics Controller (rev 02)
00:16.0 Communication controller: Intel Corporation 5 Series/3400 Series
Chipset HECI Controller (rev 06)
00:16.3 Serial controller: Intel Corporation 5 Series/3400 Series
Chipset KT Controller (rev 06)
00:19.0 Ethernet controller: Intel Corporation 82578DM Gigabit Network
Connection (rev 06)
00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset
USB2 Enhanced Host Controller (rev 06)
00:1b.0 Audio device: Intel Corporation 5 Series/3400 Series Chipset
High Definition Audio (rev 06)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI
Express Root Port 1 (rev 06)
00:1c.4 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI
Express Root Port 5 (rev 06)
00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset
USB2 Enhanced Host Controller (rev 06)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6)
00:1f.0 ISA bridge: Intel Corporation 3400 Series Chipset LPC Interface
Controller (rev 06)
00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset
6 port SATA AHCI Controller (rev 06)
00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus
Controller (rev 06)
02:00.0 SATA controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II
Controller (rev 01)
Post by Gordon Messmer
But all the same, you do have a big problem there. "Partition 1 does
not start on physical sector boundary." Your disk is listed by the
manufacturer as a 4k sector disk, but your partition doesn't start on
a sector boundary. That's going to cause some very poor IO. You may
be able to shrink and then move the partition to 1M, but it'll take a
long time. It's slightly risky to throw away your backups, but the
fastest resolution would be to delete the partition and create a new
one at 1M.
I see it on many drives. Some don't start right some don't end right :/

/dev/sdb1 1 267350 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.
WARNING: GPT (GUID Partition Table) detected on '/dev/sdc'! The util
fdisk doesn't support GPT. Use GNU Parted.

/dev/sdc1 1 267350 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.

/dev/sde1 1 121601 976760001 83 Linux
Partition 1 does not start on physical sector boundary.
WARNING: GPT (GUID Partition Table) detected on '/dev/sdg'! The util
fdisk doesn't support GPT. Use GNU Parted.

/dev/sdf1 * 1 32 248832 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sdf2 32 30402 243947521 5 Extended
/dev/sdf5 32 30402 243947520 8e Linux LVM


How could this have happened? Just so I don't repeat it! :/
Should I simply format a new drive and move the whole repository to this
new haven?

Thanks!
Gordon Messmer
2015-08-14 17:25:52 UTC
Permalink
Post by Thierry Lavallee
I see it on many drives. Some don't start right some don't end right :/
/dev/sdb1 1 267350 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.
WARNING: GPT (GUID Partition Table) detected on '/dev/sdc'! The util
fdisk doesn't support GPT. Use GNU Parted.
Ignore any such output on GPT disks. Use parted to view the alignment
of the actual partitions on those disks.
Post by Thierry Lavallee
/dev/sdf1 * 1 32 248832 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sdf2 32 30402 243947521 5 Extended
/dev/sdf5 32 30402 243947520 8e Linux LVM
How could this have happened? Just so I don't repeat it! :/
That depends on how you partitioned the drive. If you use fdisk and
manually align partitions, you should start at 1M as a habit. If you
use parted newer than 2.2, it should align partitions optimally.
Post by Thierry Lavallee
Should I simply format a new drive and move the whole repository to
this new haven?
Using a new disk may be a good option. Or don't move it at all. Just
keep it for six days until your backups are repopulated, and then wipe it.
Scott Hess
2015-08-14 17:37:41 UTC
Permalink
Post by Thierry Lavallee
Should I simply format a new drive and move the whole repository to
this new haven?
Using a new disk may be a good option. Or don't move it at all. Just
keep it for six days until your backups are repopulated, and then wipe it.

If you have the hardware for it, it might be interesting to spin up an
entirely independent backup system just to see if you get better results.
Then you could go back to the original system and start changing variables
to see if you can isolate the specific problem. Different is better for
such an experiment.

-scott
Scott Hess
2015-08-13 19:54:20 UTC
Permalink
Post by Thierry Lavallee
Here is a shot at my log. I am wondering if this is normal and if there
is a way to optimize this. I find all this copying an removing quite
lengthy. And I am wondering if this is the kind of performance you guys
get and I should await for on a 260G remote source directory. Should I
post my configuration?
[rsync populating .sync takes 3 hours.]

My .sync runs across seven local systems, covering around 2M files (to your
6M) taking about 225G of space, and only takes a minute or two to run.

[6 hours to remove daily.6]

I'm running delete-after, so deletes aren't blocking other rsnapshot runs.
AFAICT, my deletes are taking around 7 minutes.

[10 hours to copy .sync to daily.0]

I'm seeing around 5 minutes.

So I would say your results are indicating something substantially wrong
somewhere. My backup server is an Asus Vivo-PC all-in-one box, it's beefy
as far as those go, but it's not a super-beefy machine. The machine it
replaced was a pretty old mini-ITX embedded system which was pretty
modest. That system was slower, but the entire process wasn't too much
slower, I seem to remember on the order of 25-30 minutes for everything.
The current machine has 8GB, but the previous one only had 1GB, since you
mention 12GB I'm inclined to think you have "more than enough" memory
involved.

IMHO this kind of large discrepancy probably indicates something awry in
how your system is running.

Does dmesg say anything? I've seen this kind of huge slowdown on a system
which was throwing errors in the disk or controller, where the slowdown was
basically due to repeated bus resets of some sort. It was fast enough when
running, but every time it hit an error the OS would do something to deal
with the error which injected a pause.

You might check mount options (just run "mount", no args). I don't know
what I'd look for, here, but it's plausible that aggressive journaling
options could result in a slowdown running metadata-heavy operations like
the copy and delete phases. I'd not expect that to have a huge impact on
the rsync phase, though, unless you have a lot of churn. [Note that a lot
of churn in the directories you're backing up may have run-on effects on
the delete operation. With low churn, most files will be retained and just
have their inodes touches, with high churn the deleted files require
updates to bitmaps tracking free blocks and the like.]

I once noticed that one of the external backups I mirror using
rsnapshot-like operations was slower than the other mirrors. AFAICT, the
root cause is that the device is exposing some sort of manufactured view of
the disk different from other devices I have, I suspect the thing is
controlling some virtual device which maps down to the physical device, but
just isn't fast enough to keep up with seek-heavy operations. Basically it
seems adequate for streaming a file to/from disk, but metadata-heavy
operations like deleting an rsnapshot directory are much slower than
expected. I actually don't know how much slower it is, my eventual
"solution" was to just let it run overnight and ignore the problem. But
that device would not be acceptable as a primary backup device.

-scott
David Keegel
2015-08-14 23:57:23 UTC
Permalink
Post by Scott Hess
So I would say your results are indicating something substantially
wrong somewhere.
IMHO this kind of large discrepancy probably indicates something awry
in how your system is running.
Does dmesg say anything? I've seen this kind of huge slowdown on a
system which was throwing errors in the disk or controller, where the
slowdown was basically due to repeated bus resets of some sort. It
was fast enough when running, but every time it hit an error the OS
would do something to deal with the error which injected a pause.
You might check mount options (just run "mount", no args).
Thierry tried to respond with the output of mount and the entire output
of dmesg, which made it a 170KB message - too large for this list which
has a limit of 40 KB per message.

I noticed a few lines like the following in the dmesg output:
[168627.749313] INFO: task cp:31503 blocked for more than 120 seconds.
(followed by a call trace) and a similar one for a "rm" process.

Thierry, can you post the lines from demsg that look like that, and
the following 15-20 lines (including the call trace).
--
___________________________________________________________________________
David Keegel <***@cyber.com.au> Cyber IT Solutions Pty. Ltd.
http://www.cyber.com.au/~djk/ Linux & Unix Systems Administration


------------------------------------------------------------------------------
Thierry Lavallee
2015-08-15 13:41:16 UTC
Permalink
Can you post the lines from demsg that look like that, and the
following 15-20 lines (including the call trace).
Here you go. This if from my original demsg that we discussed.

[168627.749313] INFO: task cp:31503 blocked for more than 120 seconds.
[168627.749319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[168627.749323] cp D ffff88000c20fb30 0 31503 26832
0x00000000
[168627.749330] ffffffff814871f0 0000000000000082 ffff8802029e5bb8
ffff8802029e5bb4
[168627.749337] ffff880091ce45b0 ffff8800474d07e0 000000000000f9e0
ffff8802029e5fd8
[168627.749342] 00000000000157c0 00000000000157c0 ffff8803359faa60
ffff8803359fad58
[168627.749348] Call Trace:
[168627.749361] [<ffffffff8110e06f>] ? __getblk+0x26/0x29a
[168627.749369] [<ffffffff812fd183>] ? __mutex_lock_common+0x122/0x192
[168627.749375] [<ffffffff812fd2ab>] ? mutex_lock+0x1a/0x31
[168627.749393] [<ffffffffa02b532e>] ?
jbd2_log_do_checkpoint+0x4e3/0x5f1 [jbd2]
[168627.749402] [<ffffffffa02b2a2d>] ? start_this_handle+0x363/0x3ec
[jbd2]
[168627.749411] [<ffffffffa02b2c16>] ? jbd2_journal_start+0x96/0xc6
[jbd2]
[168627.749428] [<ffffffffa02ebdc7>] ?
ext4_journal_start_sb+0xd4/0x10e [ext4]
[168627.749436] [<ffffffff810fd8c0>] ? d_rehash+0x2a/0x3a
[168627.749449] [<ffffffffa02db182>] ? ext4_link+0x81/0x15e [ext4]
[168627.749455] [<ffffffff810f69f4>] ? generic_permission+0xe/0x8a
[168627.749461] [<ffffffff810f8227>] ? vfs_link+0xd4/0x13e
[168627.749466] [<ffffffff810fa63a>] ? sys_linkat+0xd3/0x12f
[168627.749473] [<ffffffff810f2931>] ? sys_newlstat+0x23/0x30
[168627.749480] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b


But now I have restarted and I do not see these anymore.
I pasted my demsg online to avoid being blocked by the list policy
See my demsg here : http://justpaste.it/n1f9

thanks all for your continued help!
Loading...