Saturday, June 19, 2021

Aperito - Duplicate File Manager

 I often have to tidy up files with lots of duplicates and I've tried quite a few duplicate finder programs but I always found them lacking in the features that I need. So I made my own and called it Aperito which means "plain, austere" as in "something that doesn't have things you don't need" in Greek.
Aperito is a somewhat scriptable duplicate file manager. Cleaning up duplicate files from a directory is as simple as running:

    aperito scancleanup MyDirectory

But Aperito allows you to perform much more complex deduplication. For example, if you want to delete all files under Dir3 that also exist under Dir1 or Dir2 but not touch any files under Dir1 and Dir2 nor deduplicate any files which show up multiple times within Dir3, you could run this:

    aperito scan Dir1 scan Dir2 cleanup Dir3
If you wanted to do the same but also deduplicate the files that show up more than once inside Dir3, you slightly change the command:

    aperito scan Dir1 scan Dir2 scancleanup Dir3
Or, let assume Dir1 is actually an external drive that you don't keep mounted all the time. In that case you could scan Dir1 when it's mounted with:

    aperito scan Dir1 save dir1-files.asd
which would create a file that can later be used like this:

    aperito load dir1-files.asd scan Dir2 scancleanup Dir3
Aperito will try to parallelize operations to some extent if it thinks that the results will be predictable. For example, in the above command, it will load the savefile while scanning Dir2 at the same time.

Aperito will never delete any duplicate files, instead it will create a new directory and move them there. For example if you run:

    aperito scancleanup Dir4
any duplicate files like Dir4/subdir/filename.ext will be moved to: Dir4-Aperito-duplicates/subdir/filename.ext. That way, if you want to revert the deduplication you can simply move the contents of Dir4 into Dir4-Aperito-duplicates and let your OS handle the merges. Finally delete the now empty Dir4 and rename Dir4-Aperito-duplicates to Dir4. You could also merge the contents in reverse (move the contents of Dir4-Aperito-duplicates into Dir4 and then delete Dir4-Aperito-duplicates) which is simpler but it may affect the permissions of the directories as the directories created under Dir4-Aperito-duplicates do not necessarily have the same permissions as the original directories under Dir4.

Aperito starts up with an empty internal state, assuming that no files have been seen and starts reading commands from its command line in sequence. These commands may add files into Aperito's internal state as "seen" or they may deduplicate (move away to a separate directory, as described above) files that have been "seen" more than one time.

The available commands are:

    scan "directory"            Scans a directory tree and adds all the files in it to the internal state of Aperito as "seen". It will not deduplicate anything though.
    cleanup "directory"         Scans a directory tree and deduplicates all the files that have already been seen. It will not add the scanned files into the internal state as "seen" though, therefore, if a file shows up twice in this directory tree, it will not be deduplicated. To be deduplicated a file under this tree needs to be have been "seen" before the cleanup command was run.
    scancleanup "directory"     Like the cleanup command but this time it will not only deduplicate "seen" files, but it will also add all the files into the internal state of Aperito as "seen". Therefore if a file shows up twice (or more times) under this directory tree, it will be deduplicated. Of course, files "seen" before this command is run will be also be deduplicated even the first time they are encountered within this directory.
    save "savefile.asd"         Saves the internal state of Aperito to a file so that you can load it some other time. Useful for scanning external drives once and then being able to deduplicate files from other drives as if the "saved" drive was present. Can also be used to speed up scanning of directories that you know to be unchanged.
    load "savefile.asd"         Loads a saved state. The saved state is merged with the current internal state of Aperito so you can write this command multiple times to load multiple files.
    reset                       Resets the internal state of Aperito. All "seen" files will be forgotten after this command.
    keep shallow/deep                These two commands affect the behavior of any scancleanup commands that follow. "Keep shallow" will cause scancleanup to keep the file which is closest to the root when one or more duplicates are found while "keep deep" does the opposite and keeps the most deeply nested file (this is the default behavior).
    ask                         Similar to the previous two commands but this time it will make Aperito ask you which file you want to keep. You will also be given the choice to select any parent directory of each file so that all files under that directory will be kept. If you select two directories so that all files under them will be kept, and then a duplicate file which exists under both of them is found, it will be kept in both directories.
    wait                        Waits for all previous command to finish before proceeding to the next command(s) even if they could be run in parallel.
    threads n                   Number of threads that will be used to hash the contents of files per command that runs in parallel. By default n=2. Affects commands after it only.
    compare "savefile.asd"      Compare the currently "seen" files with the hashes stored in the given saved state file. It will print out the hashes (and one location for each hash) that exist only on one of the two. Useful for checking if two locations have the same data, without comparing the actual directory tree structure.
    exclude "regex"             Exclude files whose path and filename contain a substring that matches the given regular expression. Matching files will not be scanned at all. This command affects any commands that follow it. Loading a saved state is not affected by exclusions.
    noexclude                   If you have used the exclude command, noexclude can be used to remove all exclusions for all the commands that follow it.
    and                         Not exactly a command by itself but can be used right after the directory paths of scan, cleanup and scancleanup to instruct those commands to modify multiple paths as if they were one. The difference between using "and" and simply using the command twice, once for each directory, for the scan and cleanup is a minor one: When using "and" the number of threads will be used to scan these directories as if they were a single directory, while using the commands multiple times will allow Aperito to run the multiple scan or cleanup commands in parallel, multiplying the number of threads used. On the other hand, the effect on the scancleanup command is more pronounced: Using "and" instead of two scancleanup commands will cause any files that are duplicated in these two directories to be deduplicated properly according to the rules (deepest, shallowest or by asking the user), while using two scancleanup commands (one for each directory) will cause files that exist in both directories to be deduplicated-away from the second directory even if, for example, you have elected to keep the deepest duplicate and the duplicate in the second directory is the deepest. The reason for this behavior is that scancleanup commands do not run in parallel and they behave like a regular clean command with regards to files seen by previous commands (so files seen by the first scancleanup command will be always removed if seen by following scancleanup commands, regardless of rules).
Remember that the internal state (which files have been "seen") is not preserved between runs unless you run the save command and then load it with the load command.

Commands that can be run in parallel if they appear sequentially are:

 * Save(s), cleanup(s) and diff(s).
 * Load(s) and scan(s)

Reset, wait, threads and scancleanup are always run atomically. Keep and ask will wait for any pending scancleanup to finish before being run.

If you have 3 scan commands one after the other, and the default number of threads (i.e. 2) that will give you 2*3=6 threads processing file contents in parallel. If all three directories you are scanning are in the same disk and if the disk is rotational and not an SSD this may cause more overhead due to seek time so you should consider either reducing threads per scan (threads 1) or putting wait commands between the scan commands.

When Aperito explains why it's moving a file to the duplicates directory, the second path may start with [?] which means that this is a path loaded from a saved state with the load command and therefore may not currently exist or, if it is a relative path, may not be relative to the current working directory.

Aperito is written in Go (my first program in that language) and is freeware for now but I'll think about opening the source code later. I'd like to see it included in Debian's repos one day but until this becomes realistic I'll probably stick with simply freeware. This is still the first version after all, and I have more features planned.

You can download Aperito from here. The zipfile contains binaries for Linux (32/64bits and 32/64bit ARM for Raspberry etc), Windows (32/64bits) and Mac (ARM/AMD). You can download the PGP signature for the zipfile from here. My key should be on the sidebar.

Please take care while using Aperito. Do not run commands that others give you unless you understand them.

And before I go, here are some things that I am thinking of adding in the future:

  • Dry-run command. Not super necessary since Aperito doesn't delete files anyhow so to revert whatever it does you just merge directories again. Still, good to have.
  • Check the full script before starting to run it. Right now a mistake in a command won't be discovered until the command is reached.
  •  "Forget" command to selectively remove files that match a regular expression from the "seen" memory.
  • "Include" command. I think you can already emulate an include command with a properly crafted "Exclude" regular expression but it may be worth having an actual easier to use include command.

No comments:

Post a Comment

Popular Posts