Tuesday, March 3, 2020

Liquid Archiver v1.1

I swear I am not trying to post once a year, it's really a coincidence. :-)

I've been using Linux as my main OS for some years now and I still don't like working with TAR, one of the most basic Unix utilities. Every time I think I know how to use TAR, I get hit by some quirk I didn't know or didn't remember about. Like how the order of parameters matters in some cases as tar will ignore "--exclude" arguments if placed in the wrong position and apparently the order is different depending on the implementation of TAR. Also I've never managed to get the exclusions working on the first try as I expect them to, especially if I need to combine multiple of them. Am I trying to match the filename only? The whole path? Is it the relative path or the full path? Etc.

So I had enough and made Lar, an alternative to TAR. Initially it meant "Lua ARchive" because a) it's written in Lua and b) it uses Lua patterns instead of regex for inclusions and exclusions. But then I decided that it would be better to mention something about the nature of the format in the name rather than the language of the first implementation. So "Lar" now stands for "Liquid ARchive" because of how archives flow in and out of output, without ever requiring to seek back to a previous position in the data stream.

I had a few goals in mind when I starting writing Lar. Some of the goals even had some unfortunate consequences, more on that below.

The goals were:

1) I love pipes. Other than reading the file and directory structure that you are trying to archive or creating the file and directory structure that you are trying to extract, there should be no direct reading or writing of the archive itself from the filesystem. Archives are read purely from stdin and written out purely to stdout. That means no ability to seek inside the Lar archive that you are reading and no going back to write a header at the start of the archive you are writing.

2) I wanted deduplication. I honestly just wanted to experiment with this. I did not know at the time how useful it would be to me but, I thought, since I'm bothering to make a new archive format, let's put something interesting in it.

3) I want something that I find easier to use than Tar. That's subjective af, and I can subjectively say I succeeded. :-P I don't have trouble writting inclusions and exclusions, and I find the switches and the general behaviour of the program easier to predict. Of course I'm biased but I made this for myself after all so you can try it and see if you like it as well.

Goal #1 and goal #2 have the unfortunate consequence that it is impossible to extract a subset of the archive without also temporarily extracting useless data. Deduplication is done at a 1MiB block level by default but when Lar encounters a block in the archive, it doesn't know yet if it's useful or not for the files it wants to extract. The only way to know if a block is useful would be to have an index of files and which blocks are used for these files placed at the end of the archive, but since Lar cannot seek, it cannot read that index before it encounters the actual blocks. And it cannot write the index on the front of the archive because that would also require it to seek back to the start of the archive after writing all the blocks, and seeks are not allowed because of goal #1.

4) I don't care (yet) about allowing custom ordering of the files within the archive but the default ordering should try to bring similar data close to each other. So files are ordered primarily by extension, secondarily by their filename, and thirdly by their path.

5) No absolute paths. The archive is always anchored to the place you are when you run the command to create it. You are free to scan only a specific (sub-)*sub-directory for files to include and even use patterns to only include specific files (and directories) from your scanned subdirectory.

After those goals were achieved, I started adding more goals to reach:

6) I wanted differential archives. I wanted to backup my computer one day in a Lar archive, and the next week I wanted to do the same backup but I would pass the old archive in Lar's input and Lar would create a new archive in its output which contains all the information about file metadata (filenames, locations, modification dates etc) but it would not contain any blocks that already exist in the previous week's backup. In fact, I wanted to be able to pass an arbitrary number of older archives, possibly including archives that were already differential themselves, into Lar while creating a new archive which would contain none of the data blocks found in those base archives. Obviously, to extract a differential archive, all the other archives that I used during its creation should be passed to Lar during the extraction as well.

7) Once I had differential archives I got pissed off at how useless differential archives were when backing up MySQL dumps. Think about what happens to the MySQL dump of a 100MiB table when you insert one row in the middle of that table. You have 50MiB of duplicate data before the inserted row, which are exactly the same as the first 50MiB in last week's backup so they are properly deduplicated. But the other 50MiB that are coming AFTER the inserted row, while exactly the same as last week, are shifted by a few bytes and now none of the blocks match any of the blocks from last week, so deduplication doesn't work. I wanted deduplication to work for these cases as well.

The way I solved that last issue was to find a way for blocks to "resynchronize" after the point where data has been inserted or removed. It was obvious that I needed dynamically-sized blocks where the size depends on the contents of the block. Perhaps I could score each possible place in a datastream by a "cutting" score and try to have a block that ends at the best possible cutting point which is not closer than 0.5MiB to the starting point and no further than 1.5MiB from the starting point of the block. That way two desynchronized datastreams will eventually end a block at the same cutting point, and after that, they will continue to deterministically end the next blocks at the same points. If the above doesn't make sense to you, check the help page of lar. If that doesn't make sense either, leave a comment and I'll try to explain.

After writing the code for the self-synchronizing blocks, as I call them, I tried to find if someone else has done something similar. It seems that rsync have a somewhat similar way of finding common substrings in files and also Microsoft had used a very similar algorithm for copying files with small changes over the network. They use a different "cutting score" function and minimum/maximum sizes (I think they don't have a minimum size?) but the algorithm is pretty much the same. Also, fuck patents. Seriously. I spent hours trying to shift through all the shitty little patented things related to differential backups that every asshole has patented. I didn't see any patent covering my algorithm per se, but I saw patents covering my algorithm or some similar one in specific usecases but none of them was for differential archives/backups. Then I remembered that I'm in Europe and software patents don't exist here, as far as I know at least. But, still, patents are disgusting. Any law trying to legislate code is disgusting and people should eventually grow the fuck up and stop feeling entitled to every little shitty idea they come up with when some weird guy on the internet can reinvent it after giving it two days worth of thought.

So yeah, that was the list of goals I guess. Notice how being performant is not one of them. :-D Not that I don't try to speed up things that I can but since I'm almost always piping my Lar archives through lzip, lzip is the bottleneck and not Lar.

Eventually I may turn this into a C program that provides some core functions like SHA256 hashing and filesystem functions to the Lua script which will be embedded inside the C program. Right now this is a pure Lua script, single file, no imported libraries, which depends on sha256sum and other standard Linux binaries to do what pure lua cannot do (e.g. set modification dates and permissions). I could already replace some of them with some Lua Rocks, but I don't like rocks. I'm not sure why but I never use them. I've tested Lar on Linux Mint, Raspbian and Ubuntu and it seems to work as long as you have sha256sum installed. You should be able to install it with "apt install coreutils".

For testing, other than some really minimal unit tests inside the code, I have a suit of weird edge cases that I had encountered that I run the code against every time I change it and I'll keep expanding that suit with more cases.

I have also been using Lar to archive and backup pretty much everything I archive and backup and I haven't seen any data corruption since the very first versions of the program. I've even used Lar to backup and restore a whole Minecraft server that I help admin with multiple shards (deduplication works so nicely for these cases) to a new box.

To install Lar, save the script that you can find here as ~/bin/lar and "chmod a+x lar" it and you should be able to run it by just typing lar.

This is version 1.1, released under GNU/GPL v3. The first number is the file format version, the second is the program version. If I even advance the file format version I will reset the program version (i.e. there will be a 1.14 and then a 2.1).

Here's the help page of Lar with more info about the currently available features:

lar - Liquid Archive version 1.1
Lar reads archives from stdin and writes archives to stdout. Existing files are
overwritten during extraction, if permissions allow that. Liquid archives are
designed to be streamable both during creation and during extraction. They also
eliminate redundancy on a 1MiB block level (or smaller for files smaller than
1MiB). A side effect of these two design goals is that during extraction with
inclusion or exclusion filters, all the data needs to extracted on disk and then
partially discarded. Lar ignores file ownership both when archiving and when
extracting. For excellent and efficient compression of lar archives we suggest
pipping the output through "mbuffer -q -m 1G" and then through "lzip -9" if you
have these tools available.

Usage: lar -c [-v] [-p|-P] [-B SIZ] [-s SNC] [-d NUM] [-b DIR] [-i INC] [-e EXC]
       lar -x [-v] [-p|-P] [-d NUM] [-i INC] [-e EXC] [-f]
       lar -l [-v] [-p|-P] [-d NUM] [-i INC] [-e EXC]
       lar -m [-v]
       lar -C N,T
       lar -H [-v]

-c Create archive. Outputs to stdout an archive of DIR as seen from
the current directory.

-x Extract archive. Reads an archive form stdin and extracts the
contents to the current working directory.

-l List contents. Same as extract but it will not touch your local
filesystem. You can even combine it with -i and -e to view just
some of the files. It does not fully test the archive for all
kinds of inconsistencies though. Some will only be detected when
you actually extract the archive.

-m Create an empty archive, useful if you want to create empty base
archives for a differential archive.

-i Only include files and directories that match the INC Lua
pattern. Works with -c, -x and -l. Additionally, "|" may be used
to separate multiple Lua patterns as a logical OR.
https://www.lua.org/manual/5.3/manual.html#6.4.1

-e Exclude files that match the EXC Lua pattern. Works with -c,
-x and -l.

-v Be more verbose.

-b DIR is the directory or file that will be added to the archive.
It can only be used with -c.

-d Create or extract a differential archive. You need to
sequentially pipe NUM base archives in the stdin when you create
a differential archive. The same base archives need to be passed
in the stdin (in any order) when you want to extract the
resulting differential archive. Differential archives will not
repeat blocks of data that exist in the base archives, so they
are ideal for incremental backups. You may use differential
archives as base archives for a new differential archive. Doing
so will not cause the new differential archive to require the
base archives of its base archives during its extraction. This
allows you to create a sequence of differential archives where
each one depends on the NUM previous archives.

-p Create a text only archive that contains only printable
characters and whitespace at the cost of an increased archive
size and slower archival. It can only be combined with -c.

-P Same as -p but this time even whitespace is disallowed.

-f Force extraction of files with missing blocks. This will allow
you to partially extract a differential archive even if some of
its base archives are missing.

-B Change the block size to SIZ. Default is 1048576 (1MiB). Bigger
blocks are faster but deduplication is done at the block level
so smaller blocks (but not very small) will result in more data
savings. Don't change this if you are not sure. It can only be
used with -c.

-s Enable self-synchronizing block splitting for files matching the
SNC Lua pattern. Data blocks will not be of a fixed size
anymore, they will vary between 66% and 134% of the size defined
by -B (or the default 1MiB). The sizes are picked in such a way
that if you have two files with different sizes but a sustantial
amount of common data at their end, there is a high probability
that the blocks will synchronize, improving the deduplication of
data. This is especially useful if, for example, you are trying
to make differential backups of a MySQL dump file every day when
only a little data changes but also the size of the file
changes. Archival will be much slower. It can only be used
with -c.

Suppose you have these two data streams:
abdefghijklmnopqrstuvwxyz
abdefghijklm123nopqrstuvwxyz

If you split them with fixed size blocks they will look like:
ab|de|fg|hi|jk|lm|no|pq|rs|tu|vw|xy|z
ab|de|fg|hi|jk|lm|12|3n|op|qr|st|uv|wx|yz

Deduplication will work fine for the blocks before the digits
but after the digits the blocks are offset in such a way that
they will never resynchronize and even though the ends of both
streams are the same, deduplication is impossible.

Self-synchronizing blocks will split these data streams like:
ab|def|g|h|ij|kl|mno|pq|rs|t|u|vwx|yz
ab|def|g|h|ij|kl|m1|23n|op|q|rs|t|u|vwx|yz

The blocks eventually resynchronize (after "rs" in the example)
so deduplication of the end of this data stream is possible.

-C This is a handy calculator for differential backups. N is the
number that you intend to pass to -d, i.e. the number of base
archives that each of your differential archives will be based
on. T is the number of differential archives that you intend to
keep. Type those two numbers in (e.g. -C 4,8) and the calculator
will give you information about how much space they will take
and how many archives will be recoverable.

-H If you pipe a Liquid archive through "lar -H", you will get
a special "thin" version of the archive in the output that
contains only the hashes of the blocks in the input. This can be
used to greatly reduce the amount of data transmitted over the
wire in cases where you are creating a differential backup with
the base files being piped over the network as it will move the
block hashing to the remote server instead of transferring the
data to be hashed locally.

Examples:

lar -cv -b Images > Images.lar
Archive all files in the Images folder and name the archive Images.lar.
Verbose output.

lar -xv < Images.lar
Recreate the Images folder that archived in the previous example under
the current working directory. Verbose output.

lar -xv -i '%.c$|%.h$' -e 'example' < code.lar
Extract the archive code.lar into the current directory. Only extract
files and directories that end with ".c" or ".h". Do not extract files
or directories that include the word "example" in their full path.

cat old.lar older.lar | lar -cd 2 -b Images > new.lar
Archive all files in the Images folder and name the archive new.lar. The
output archive will be a differential archive and will not contain any
blocks of data that exist in old.lar and older.lar.

cat old.lar older.lar new.lar | lar -xd 2
Extract the archive that was created in the previous example. Only
new.lar will be extracted but old.lar and older.lar are needed because
they contain data that was omitted from new.lar during its creation. The
order of old.lar and older.lar may be reversed but new.lar, the archive
that you are actually trying to extract, must be the last one.

(cat DB2.lar.gz | gunzip; cat DB.lar.gz | gunzip) | lar -cvb DB -d 2 | gzip > DB3.lar.gz
Archive all files in the DB folder, pass through gzip to compress the
archive and name it DB3.lar.gz. The output archive will be a
differential archive and will not contain any blocks of data that exist
in DB2.lar.gz and DB.lar.gz.

cat yesterday.lar > lar -d 1 -cvb serverbackup/ -s '%.sql$|%.csv$' > today.lar
Archive all the files in the serverbackup folder and turn on
self-synchronizing blocks for files with ".sql" or ".csv". Do not
include the data blocks that exist in yesterday.lar so the resulting
today.lar will be a differential archive. Self-synchronization helps
with redundancy detection in files like SQL dumps or other files that
may have data inserted or removed from one archive to the next one.

(ssh 'user@192.168.0.5' "cat /home/user/backup1.lar.gz" | gunzip | lar -H; ssh 'user@192.168.0.5' "cat /home/user/backup2.lar.gz" | gunzip | lar -H; ) | lar -v -d 2 -c -b . | gzip | ssh 'root@192.168.0.5' "cat > /home/user/backup0.lar.gz"
ssh 'root@192.168.0.5' "rm /home/user/backup5.lar.gz"
ssh 'root@192.168.0.5' "mv /home/user/backup4.lar.gz /home/user/backup5.lar.gz"
ssh 'root@192.168.0.5' "mv /home/user/backup3.lar.gz /home/user/backup4.lar.gz"
ssh 'root@192.168.0.5' "mv /home/user/backup2.lar.gz /home/user/backup3.lar.gz"
ssh 'root@192.168.0.5' "mv /home/user/backup1.lar.gz /home/user/backup2.lar.gz"
ssh 'root@192.168.0.5' "mv /home/user/backup0.lar.gz /home/user/backup1.lar.gz"
Create a differential archive of the current directory with 2 base
archives. The base archives are stored on a remote server at 192.168.0.5
and are remotely processed with "lar -H" so that only the hashes of the
blocks they contain are transferred over the wire. The resulting
differential archive is also stored back at the same remote server.
After the archival is done the archives are renamed and 5 of them in
total are kept. backup1.lar.gz will be the newest archive and
backup5.lar.gz will be the oldest.

To see what 5 differential archives (each based on the previous 2
archives) mean, you can run "lar -C 2,4" which will give you the
following output:

If your differential archives are based on the last 2 archives (-d 2)
and you keep a total of 5 archives, then you should expect to have 3
recoverable archives. Archives older than the last 3 will not have all
their base archives available and you will therefore be unable to
extract them. You should expect that all 5 archives together will take
about the same space as 1.7 full size (non-differential) archives, but
in some cases they will take up about the same space as 2 full size
archives.


Verbose output:

3% (input=134B output=111.74MiB) ./readme.txt (regular file) [NNNHP]
|          |              |            |            |           |
|          |   Data written to stdout  |        File type       |
|          |                    Current filename                |
|  Data read from stdin                                         |
|                            Current file's blocks (N=new, P=previously seen in
Percentage of files done     current archive, H=previously seen in base archive)


Copyright 2019-2020 Tritonio (www.inshame.com)

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License version 3 as published by the Free
Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

If you have not received a copy of the GNU General Public License along with
this program then see https://www.gnu.org/licenses/gpl-3.0.txt or download it
using BitTorrent: magnet:?xt=urn:btih:ftlm2r4zr3eepxypisejceebq7gx3wn7

No comments:

Post a Comment

Popular Posts