[rsnapshot-discuss] Reaching the limit of rsnapshot: Too many files, only few changes

On Mon, Aug 1, 2016 at 4:56 AM, Thomas Güttler

Well, *that* part would be easy. The deep and unchanging section can
be backed up separately, and possibly less frequently, then the more
dynamic section for most configurations. That's easy to set up with
appropriate "--exclude" targets for the bulkier backup, and a more
directory targeted backup for that the parts you mention that have
specific requirements.

Post by Thomas GÃ¼ttler
Up to now the application which creates the files can't handle a storage API. We need something which can be mounted
like a file system.

This is unclear. The files created by your software need to on a
filesystem, that is then backed up by rsnapshot?

Post by Thomas GÃ¼ttler
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

*Why* is it scattering data among 17 Million files on multiple
Terabytes? Or are the updates more focused?

Post by Thomas GÃ¼ttler
--
Thomas Guettler http://www.thomas-guettler.de/
------------------------------------------------------------------------------
_______________________________________________
rsnapshot-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss

------------------------------------------------------------------------------

Thomas Güttler

2016-08-01 12:15:14 UTC

Post by Nico Kadel-Garcia
On Mon, Aug 1, 2016 at 4:56 AM, Thomas Güttler

Maintaining include/exclude lists by hand does not work in this context.
Updates can happen in any directory.

Post by Thomas GÃ¼ttler
Up to now the application which creates the files can't handle a storage API. We need something which can be mounted
like a file system.

This is unclear. The files created by your software need to on a
filesystem, that is then backed up by rsnapshot?

Yes. The software needs a big filesystem. And this filesystem gets
backed up by rsnapshost.

Post by Thomas GÃ¼ttler
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

*Why* is it scattering data among 17 Million files on multiple
Terabytes? Or are the updates more focused?

This is out of my sphere of influence. I am here
to improve the backup. Changing the system which creates
these files is out-of-scope.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Thomas Güttler

2016-08-01 12:15:19 UTC

Post by Nico Kadel-Garcia
On Mon, Aug 1, 2016 at 4:56 AM, Thomas Güttler

Maintaining include/exclude lists by hand does not work in this context.
Updates can happen in any directory.

Post by Thomas GÃ¼ttler
Up to now the application which creates the files can't handle a storage API. We need something which can be mounted
like a file system.

This is unclear. The files created by your software need to on a
filesystem, that is then backed up by rsnapshot?

Yes. The software needs a big filesystem. And this filesystem gets
backed up by rsnapshost.

Post by Thomas GÃ¼ttler
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

*Why* is it scattering data among 17 Million files on multiple
Terabytes? Or are the updates more focused?

This is out of my sphere of influence. I am here
to improve the backup. Changing the system which creates
these files is out-of-scope.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Christopher Barry

2016-08-01 14:13:25 UTC

On Mon, 1 Aug 2016 14:15:19 +0200

Post by Nico Kadel-Garcia
On Mon, Aug 1, 2016 at 4:56 AM, Thomas Güttler

Maintaining include/exclude lists by hand does not work in this
context. Updates can happen in any directory.

Post by Thomas GÃ¼ttler
Up to now the application which creates the files can't handle a
storage API. We need something which can be mounted like a file
system.

This is unclear. The files created by your software need to on a
filesystem, that is then backed up by rsnapshot?

Yes. The software needs a big filesystem. And this filesystem gets
backed up by rsnapshost.

Post by Thomas GÃ¼ttler
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

*Why* is it scattering data among 17 Million files on multiple
Terabytes? Or are the updates more focused?

This is out of my sphere of influence. I am here
to improve the backup. Changing the system which creates
these files is out-of-scope.
Regards,
Thomas Güttler

I can think of two ideas off the top of my head;

1. Increase the time between rsnapshot backups.

2. Use something like inotify to monitor and identify directories where
files have been changed.

In the former case, simply change the cron.

In the latter case, do something like exclude all, then selectively
include the changed directories via a script to modify an rsnapshot
config file, then run rsnapshot based on that config. This is of
course a simplistic overview, but at first blush seems doable. How that
can play with hard-linking, etc. is not immediately understood by me
however. Others can add reasons why this idea can or cannot work...

Just food for thought...

-C

--
Regards,
Christopher

------------------------------------------------------------------------------

Thomas Güttler

2016-08-01 14:22:28 UTC

Post by Christopher Barry
I can think of two ideas off the top of my head;
1. Increase the time between rsnapshot backups.

Yes, this is a work-around. The drawback is clear: More data gets lost
if there is a crash ... (yes we have RAID, but this is not enough ...)

Post by Christopher Barry
2. Use something like inotify to monitor and identify directories where
files have been changed.

We thought about this, too. But inotify has a limit you need to watch
every directory. Up to now I found no inotify in linux which can work
on the whole filesystem.

Regards,
Thomas

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Christopher Barry

2016-08-01 14:46:21 UTC

On Mon, 1 Aug 2016 16:22:28 +0200

Post by Christopher Barry
I can think of two ideas off the top of my head;
1. Increase the time between rsnapshot backups.

Yes, this is a work-around. The drawback is clear: More data gets lost
if there is a crash ... (yes we have RAID, but this is not enough ...)

Post by Christopher Barry
2. Use something like inotify to monitor and identify directories
where files have been changed.

We thought about this, too. But inotify has a limit you need to watch
every directory. Up to now I found no inotify in linux which can work
on the whole filesystem.
Regards,
Thomas

This may prove useful to you:
http://www.ibm.com/developerworks/library/l-ubuntu-inotify/index.html

It's a bit older, but the concept is the same. You'll need to handle
the recursive actions and updates in user space yourself, but it's
definitely doable.

Good luck, and I (and probably everyone on this list) would love to see
what you come up with.

--
Regards,
Christopher

------------------------------------------------------------------------------

Thomas Güttler

2016-08-02 07:22:45 UTC

Am 01.08.2016 um 16:46 schrieb Christopher Barry:
...

Post by Christopher Barry
http://www.ibm.com/developerworks/library/l-ubuntu-inotify/index.html

inotify does not scale like we need it. There are too many directories.

Nevertheless, thank you for trying to help.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Christopher Barry

2016-08-02 13:51:11 UTC

On Tue, 2 Aug 2016 09:22:45 +0200

Post by Thomas GÃ¼ttler
...

Post by Christopher Barry
http://www.ibm.com/developerworks/library/l-ubuntu-inotify/index.html

inotify does not scale like we need it. There are too many directories.
Nevertheless, thank you for trying to help.
Regards,
Thomas Güttler

OK. Lists are for trying to help :)

Sounds like a hard problem to solve with anything other than simply
more and more time for backing up, unless you change the underlying
filesystem type (zfs sounded like a interesting idea to explore) or
replace the existing hardware to be much faster (or both).

What options have you come up with so far?
What do you think you'll end up doing to solve this problem?

--
Regards,
Christopher

------------------------------------------------------------------------------

Thomas Güttler

2016-08-02 14:28:46 UTC

Post by Christopher Barry
On Tue, 2 Aug 2016 09:22:45 +0200

Post by Thomas GÃ¼ttler
...

Post by Christopher Barry
http://www.ibm.com/developerworks/library/l-ubuntu-inotify/index.html

inotify does not scale like we need it. There are too many directories.
Nevertheless, thank you for trying to help.
Regards,
Thomas Güttler

OK. Lists are for trying to help :)
Sounds like a hard problem to solve with anything other than simply
more and more time for backing up, unless you change the underlying
filesystem type (zfs sounded like a interesting idea to explore) or
replace the existing hardware to be much faster (or both).
What options have you come up with so far?

Up to now we increased the backup interval.

Post by Christopher Barry
What do you think you'll end up doing to solve this problem?

I am unsure. At the moment I see theses solutions:

- Wrapping the filesystem with an overlay-filesystem which logs all changed files. Use this as include-list for
rsnapshot

- Moving the data to a storage and mount it with a tool like s3fs.

- Use a filesystem which can output all changed files.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Oliver Peter

2016-08-02 19:47:33 UTC

Post by Christopher Barry
...
What do you think you'll end up doing to solve this problem?
- Wrapping the filesystem with an overlay-filesystem which logs all changed files. Use this as include-list for
rsnapshot

Or use ZFS, do filesystem based snapshots and transfer the incrementel
changes to a remote site (zfs send | zfs receive).

Post by Christopher Barry
- Moving the data to a storage and mount it with a tool like s3fs.

Or use ZFS and enjoy other benefits like transparent compression and
checksumming.

Post by Christopher Barry
- Use a filesystem which can output all changed files.

Or use ZFS and...

I might sound like a sales drone and/or a one trick pony but all of the
stuff you said sound over engineered and bury much potential to fail
sooner or later.
ZFS is the most pragmatic and easy to go solution I can think of.

--
Oliver PETER ***@gfuzz.de 0x456D688F

------------------------------------------------------------------------------

Thomas Güttler

2016-08-03 07:49:50 UTC

Or use ZFS, do filesystem based snapshots and transfer the incrementel
changes to a remote site (zfs send | zfs receive).

What do you mean with "incrementel changes" in above line?

... let me search ... I guess you mean this: https://en.wikipedia.org/wiki/ZFS#Sending_and_receiving_snapshots

{{{
ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a
stream representation of the file system's state. This stream can either describe complete contents of the file system
at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size
depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g. for
synchronizing offsite backups or high availability mirrors of a pool.
}}}

Yes, this sounds good. Sorry, I did not understand it the first time. First I thouht
I should run rsnapshot on the snapshot. This would not help, since scaning the huge
directory tree for changes would be done.

Post by Christopher Barry
- Moving the data to a storage and mount it with a tool like s3fs.

Or use ZFS and enjoy other benefits like transparent compression and
checksumming.

Post by Christopher Barry
- Use a filesystem which can output all changed files.

Or use ZFS and...

Is ZFS the only filesystem which is open source and available for linux which
can do this?

Post by Oliver Peter
I might sound like a sales drone and/or a one trick pony but all of the
stuff you said sound over engineered and bury much potential to fail
sooner or later.

Yes, you are right. I am more a programmer than an admin. But I know,
that if I do coding here, I go definitely the wrong direction.

Thank you Oliver,

Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Patrick O'Callaghan

2016-08-03 08:42:58 UTC

Post by Thomas GÃ¼ttler
Is ZFS the only filesystem which is open source and available for linux which
can do this?

Before committing to ZFS, you should know that it belongs to Oracle and is
licensed under the CDDL license, i.e. it is "open source" but not "free",
which is why it's not officially supported in many Linux distros (e.g.
Fedora). That may or may not matter to you.

poc

Oliver Peter

2016-08-03 10:12:29 UTC

Or use ZFS, do filesystem based snapshots and transfer the incrementel
changes to a remote site (zfs send | zfs receive).

What do you mean with "incrementel changes" in above line?
... let me search ... I guess you mean this: https://en.wikipedia.org/wiki/ZFS#Sending_and_receiving_snapshots
{{{
ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a
stream representation of the file system's state. This stream can either describe complete contents of the file system
at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size
depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g. for
synchronizing offsite backups or high availability mirrors of a pool.
}}}
Yes, this sounds good. Sorry, I did not understand it the first time. First I thouht
I should run rsnapshot on the snapshot. This would not help, since scaning the huge
directory tree for changes would be done.

What I meant in my first mail is that having for example daily.0 -
daily.14, weekly.0 to weekly.4 and monthly.0 to monthly.6 as rsnapshot
intervals will cause fragmented hardlinks all over the place which someone
already mentioned here consumes lot of RAM and access time.

My idea is to have ZFS on the backup server and only keep daily.0 and
nothing else for the backup. Before rsnapshot fetches the next backup it
creates a transparent snapshot of daily.0 and then it runs rsync. This
way you use the efficent snapshot technology of ZFS and don't waste your
time in resolving fragmented hard links and rotating daily backups.

If you don't like this idea - and this going to be offtopic now - you
could switch from rsnapshot to zfs snapshots which would mean that you
have to migrate your live server to another filesystem and/or operating
system.

Post by Christopher Barry
- Moving the data to a storage and mount it with a tool like s3fs.

Or use ZFS and enjoy other benefits like transparent compression and
checksumming.

Post by Christopher Barry
- Use a filesystem which can output all changed files.

Or use ZFS and...

Is ZFS the only filesystem which is open source and available for linux which
can do this?

There is also BTRFS which has a similar snapshot technology but I would
recommend to give ZOL a try:
http://zfsonlinux.org/

--
Oliver PETER ***@gfuzz.de 0x456D688F

------------------------------------------------------------------------------

Ken Woods

2016-08-02 14:57:35 UTC

Post by Christopher Barry
OK. Lists are for trying to help :)
Sounds like a hard problem to solve with anything other than simply
more and more time for backing up, unless you change the underlying
filesystem type (zfs sounded like a interesting idea to explore) or
replace the existing hardware to be much faster (or both).
What options have you come up with so far?
What do you think you'll end up doing to solve this problem?

.......And I'm curious what the data is and what creates it. That many new/changed files and only 2.x TB has to be interesting

------------------------------------------------------------------------------

Gandalf Corvotempesta

2016-08-02 15:27:53 UTC

Post by Ken Woods
.......And I'm curious what the data is and what creates it. That many

new/changed files and only 2.x TB has to be interesting
Any mass-hosting environment where you have thousands of websites and
emails is able to create millions of small files every day

In example, if you host tons of prestashop ecommerce or WordPress sites,
both with disk cache enabled, you will see millions of small files changed
every night, plus the email accounts used by this sites/customers

Patrick O'Callaghan

2016-08-02 16:26:15 UTC

On 2 August 2016 at 16:27, Gandalf Corvotempesta <

Post by Ken Woods

Post by Ken Woods
.......And I'm curious what the data is and what creates it. That many

new/changed files and only 2.x TB has to be interesting
Any mass-hosting environment where you have thousands of websites and
emails is able to create millions of small files every day
In example, if you host tons of prestashop ecommerce or WordPress sites,
both with disk cache enabled, you will see millions of small files changed
every night, plus the email accounts used by this sites/customers

In such a case the web front end would be the place to log changed files
for later backup.

poc

Gandalf Corvotempesta

2016-08-02 19:02:11 UTC

If 200 sites are generating 2tb of changes in a day, I'd think there's

bigger issues at hand, but.......I'm not in the web hosting world, so
perhaps you're right.
I've wrote less than 1gb, not 2tb

Thomas Güttler

2016-08-03 07:53:23 UTC

Post by Ken Woods

.......And I'm curious what the data is and what creates it. That many new/changed files and only 2.x TB has to be interesting

We develop and maintain archive, workflow and issue systems.

Data: Scans, Mails and PDFs incomming into or leaving companies.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Thomas Güttler

2016-08-03 07:56:21 UTC

Post by Ken Woods

.......And I'm curious what the data is and what creates it. That many new/changed files and only 2.x TB has to be interesting

.. there are only few changes per day. The trouble is to detect these changes. Rsync needs to crawl very long
to find these few changes. The current solution will work for the next months. But sooner or later
I want to switch.

--
Thomas Guettler http://www.thomas-guettler.de/

------------------------------------------------------------------------------

Arne Hüggenberg

2016-08-03 16:43:41 UTC

Post by Christopher Barry
On Tue, 2 Aug 2016 09:22:45 +0200

Post by Thomas GÃ¼ttler
...

Post by Christopher Barry
http://www.ibm.com/developerworks/library/l-ubuntu-inotify/index.html

inotify does not scale like we need it. There are too many directories.
Nevertheless, thank you for trying to help.
Regards,
Thomas Güttler

OK. Lists are for trying to help :)
Sounds like a hard problem to solve with anything other than simply
more and more time for backing up, unless you change the underlying
filesystem type (zfs sounded like a interesting idea to explore) or
replace the existing hardware to be much faster (or both).
What options have you come up with so far?
What do you think you'll end up doing to solve this problem?

One thing that might buy you some time is tuning the vfs layer to spend more
RAM on caching directory/inode objects.
On the assumption that, with more cached directory/inode objects, rsync will
be able to build its list faster. And on the rsnapshot server it will also
speed up the cp/rm steps.

In order to make linux prefer buffer (directorys/inodes) to pagecache (file
contents) you can tune /proc/sys/vm/vfs_cache_pressure
Obviously, this can have severe impacts on other workloads, so you will have
to use your own judgement on wether/how far you can tune this for your specific
usecase, especially as regards the servers being backed up.

Heres the Kernel doc for that proc parameter

###
/proc/sys/vm/vfs_cache_pressure
------------------

This percentage value controls the tendency of the kernel to reclaim
the memory which is used for caching of directory and inode objects.

At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.

Increasing vfs_cache_pressure significantly beyond 100 may have negative
performance impact. Reclaim code needs to take various locks to find freeable
directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.
###

So, assuming you have enough RAM to make it wortwhile and dont need the
pagecache for throughput reasons, you could incrementally lower that value and
see wether it does anything good for you.
Obviously best if you can do it on both ends, but might be worthwhile even
just on the rsnapshot server (esp. if its a dedicated server, just to keep
linux autotuning from reclaiming needed inode buffers for useless pagecache)

I have no experience with *BSD or solaris etc., but i would assume that they
have similar knobs you can fiddle with.

Regards,
--
Arne Hüggenberg
System Administrator
_______________________
Sports & Bytes GmbH
Rheinlanddamm 207-209
D-44137 Dortmund
Fon: +49-231-9020-6655
Fax: +49-231-9020-6989

Geschäftsführer: Thomas Treß, Carsten Cramer

Sitz und Handelsregister: Dortmund, HRB 14983
Finanzamt Dortmund - West
Steuer-Nr.: 314/5763/0104
USt-Id. Nr.: DE 208084540

------------------------------------------------------------------------------

David Cantrell

2016-08-01 14:18:24 UTC

Post by Thomas GÃ¼ttler
Maintaining include/exclude lists by hand does not work in this context.
Updates can happen in any directory.

Some OSes have mechanisms to tell some piece of code when changes
happen in a filesystem, eg inotify on Linux, FSEvents on OS X. Could you
use that to maintain include/exclude lists automagically?

--
David Cantrell

------------------------------------------------------------------------------

Arne Hüggenberg

2016-08-01 13:59:08 UTC

Post by Thomas GÃ¼ttler
Since several years we use rsnapshot for backups.
On some systems we reach the limit. There are too many files and only very
few of them change and need to get backed-up.
Only about 0.1% of all files change on one day!
Unfortunately the changed files are scattered in a deep directory tree.
Rsync needs very long to discover the changes.
I think sooner or alter we need to use a different tool or change the way we use rsnapshot.
Up to now the application which creates the files can't handle a storage
API. We need something which can be mounted like a file system.
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

The only time i ever cam across something that big, they were using tivoli
storage manager and there was a daemon running in the background tracking
every file change. When the backup ran it would use the daemons journal instead
of trawling the filesystem.
Of course the also ran traditional incremental filesystem trawling backups on
the weekends, just in case their daemon missed something.
--
Arne Hüggenberg
System Administrator
_______________________
Sports & Bytes GmbH
Rheinlanddamm 207-209
D-44137 Dortmund
Fon: +49-231-9020-6655
Fax: +49-231-9020-6989

Geschäftsführer: Thomas Treß, Carsten Cramer

Sitz und Handelsregister: Dortmund, HRB 14983
Finanzamt Dortmund - West
Steuer-Nr.: 314/5763/0104
USt-Id. Nr.: DE 208084540

------------------------------------------------------------------------------

Oliver Peter

2016-08-01 14:36:47 UTC

Post by Thomas GÃ¼ttler
Since several years we use rsnapshot for backups.
On some systems we reach the limit. There are too many files and only very few of them change and need to get backed-up.
Only about 0.1% of all files change on one day!
Unfortunately the changed files are scattered in a deep directory tree. Rsync needs very long to discover the changes.
I think sooner or alter we need to use a different tool or change the way we use rsnapshot.
Up to now the application which creates the files can't handle a storage API. We need something which can be mounted
like a file system.
Which tool could fit?
* 17M files (number of files)
* 2.2TBytes of data.
* one host accessing the data via RAID.

Check if you could reduce the interval levels, in other words how many
daily/weekly/monthly backups you keep on the backup server:
Too many spread hard links may slow down the backup.

In the best case, switch the backup server to ZFS, reduce the interval
to a single level and create a zfs snapshot by using the rsnapshot
cmd_preexec[¹]. Rotating backups takes very much time to, this way you take
out the need for doing this by moving the logic to ZFS.

[¹] https://github.com/zfsnap/zfsnap

--
Oliver PETER ***@gfuzz.de 0x456D688F

------------------------------------------------------------------------------

David Keegel

2016-08-01 21:45:12 UTC