1.  Symirr 2 user notes

Symirr means "symmetrically mirror". It is a recursive file/directory synchronisation tool for mirroring changes made to duplicate copies of the same directory tree kept in two or more locations. This is done in a symmetrical fashion, i.e. there is no sense of a "master copy" -- changes may be made on either side (or both sides) and then sync'd to the other. This is ideal when it is necessary to keep duplicate synchronised copies of mail folders, project directories, etc, on several different machines.

Some history: Symirr 1 was written in Perl and performed a very optimal and logically-pure symmetrical mirroring operation between directories on two or more machines. However, it required a complete, perfect sync operation every time, and so it was fragile to all kinds of failures. Also, it could not sync a pair of directories which had been transferred or synchronised using other tools.

This new version 2, written in C, is a more realistic symmetrical mirroring operation which is much more robust and can cope with all the situations above that the original could not. It allows the user to check over and, if necessary, modify all the actions it will take -- using the user's favourite editor. This permits a partial sync if the user wants to deal with some files by hand.

Fully-automated syncs are also safe after the first sync because if files are changed on both machines then they generate "clashes", meaning that extra files will appear in the directories with a .CLASH-* extension which the user can then merge (or otherwise deal with) before the next sync to resolve the situation on both machines. Any files deleted or overwritten in the sync can also be saved in per-session .tgz files as a final safeguard by using the -b option.

This version is also designed to work with a small memory footprint. Everything is streamed -- it is never necessary to store the whole tree in memory. This makes it very much more efficient than some other programs for mirroring small changes in immense trees: for example, a tree of MH-style mail folders, or a large set of project directories.

Advantages:

Basis for sync:

Limitations:

The current code of Symirr has been refactored many many times. I'm happier having a small codebase so that I can make sweeping global changes whenever I find a better approach. To me this is much better than having a huge codebase that I'm afraid to touch for fear of breaking something. This means always looking for the small/simple/efficient/optimal way of getting the job done, rather than going for the comprehensive/extensive-feature-set style, which perhaps explains to some extent why Symirr's design is the way it is.

1.1.  Legal

1.2.  Usage and options

Command-line usage:

Command-line options:

The editor to use to confirm the action command-list is selected by the first of these found to work:

Debugging/testing usage:

1.3.  Special files:

1.4.  Sync operation

There are really two distinct sync operations done by Symirr: the first sync and subsequent syncs. (See the IMPLEMENTATION section below for full details of the algorithms.)

The first time Symirr does a sync, it does not know anything about the history of the files it finds there. Ideally the first sync you would do with Symirr would be between the original directory tree and an empty directory (i.e. letting Symirr copy the files across for you), or you would be syncing to an identical copy of the file tree that you have copied across yourself, either using cp -a or tar or some other tool. However, it is possible that you made a duplicate copy of the file tree at some point in the past and you have made changes on both machines and now you want to use Symirr. Since there is no history information in this situation, Symirr cannot detect deleted files, and neither can it detect clashes (files changed on both sides). For this reason, on the first sync it is important to check over the action command-list file carefully to see that Symirr is doing what you expect it to be doing, to avoid any possible problems. In some cases Symirr inserts comments into this file which provide alternatives.

However, after the first sync, Symirr can accurately detect updates and deletions, and recognise clearly when clashes occur. At this point it is safe to allow Symirr to run unattended (e.g. to run automatically from cron). If clashes are detected, i.e. if file changes have been made on both machines, then *.CLASH-######## files are generated which allow the situation to be examined and resolved later on with no loss of information. This is a much more complicated and accurate sync operation (see later for the algorithm).

In addition to the default symmetric basic first-time sync (-B), there are also two asymmetric basic first-time syncs, -S 'local-is-secondary' and -M 'local-is-primary'. These are designed to duplicate exactly the version of the tree on the primary, overwriting or deleting anything on the secondary that disagrees. Even a newer file on the secondary will be overwritten with an older version from the primary, although a warning is given in this case. These are intended for the first-time sync when it is certain that no useful changes have been made to the secondary. One example use is to sync an old copy of an MH-style mail tree with the current version, where there may be many deletions required to the old copy which would result in (incorrect) attempts to copy files the other way if the default -B were used. So, -S and -M provide an alternative way to start Symirr off, better in some situations. After this point the option should be dropped, allowing the extended sync mode to take care of all deletions and updates.

1.5.  For hackers: picking up from _SYMIRR_BASE corruption

If for some reason the _SYMIRR_BASE history gets corrupted, then rather than resorting to a basic (-B/-S/-M) sync to pick things up again, you might prefer to fix things up manually in the history.

Under _SYMIRR_BASE you will find a directory for each machine this machine has synced with, i.e. for each remote machine with which it holds history. Find the directory that applies to the machine you are interested in. In this directory is a time-stamped directory for each session with that remote machine. In each directory that holds valid history data, there will be both a local-pc-tree.gz and a local-pc-changes file. If either of these files is missing, then the directory is invalid and will be ignored when Symirr is looking for history (e.g. maybe it is the result of a broken connection, or of a sync aborted with ^C). Symirr scans backwards through these directories looking for the first directory with valid history. Knowing this, you can stop Symirr reading corrupt history data by simply deleting it. However, you will have to do this on the remote machine as well otherwise Symirr will complain.

It doesn't matter if you have to take the history back several syncs, because files or directories updated more recently will appear as "new-but-identical", which will not cause a re-sync of these files or directories. You can check all this in the actions list anyway before allowing Symirr to make any updates.

2.  IMPLEMENTATION

Internally most of Symirr's operations are performed by various 'operation node' objects which are connected up in different networks depending on the requirements of the activity at hand. These nodes communicate via text strings in various formats described later on, which also appear in the various files generated by Symirr: the trees and action list and so on. In some ways this structure imitates UNIX-style tools and pipes, i.e. small tool-objects taking small well-defined roles in the process and flow of data, but extended to a full network rather than a simple pipe-chain.

2.1.  Operation nodes

Each of these operations is a node that consumes input strings and generates output strings. Input strings are passed via pointers to pointer variables which hold either a pointer to a StrDup'd string, or zero. When the input string is consumed, the pointer variable is set to zero. The output strings operate similarly; they are either 0 or a StrDup'd string. End-of-file is indicated by a StrDup'd string containing an end-of-file marker (see SMEOF and isEOF in the source).

Normally the node input string pointers are set to point to output strings of other nodes, allowing data to be passed from one node to another automatically. The node *_do() call tries to do more work, moving data from inputs to outputs. Due to nodes that consume more data than they produce, or produce more than they consume, at times there may be 0 values in these pointers, which indicate "no data available right now". So, if the output pointer variables still contain unconsumed strings, or if the inputs don't contain strings, then no operation is possible, and the *_do() call returns without doing anything.

The *_do() calls also return an integer to indicate if there has been activity during this call. One way of detecting that everything is done is by waiting until all nodes are reporting no activity. The other way is to wait for an EOF marker to come from the final output, although this assumes that there are no other implied outputs (file writes/etc) that might still be pending.

2.1.1.  SMLoad: Load lines from a file

Loads lines one by one from a file, decompressing if the filename ends in .gz. The smload_sorted() function sets up a binary-sorted load of data, by piping the input data through UNIX 'sort'; no .gz files are permitted in this case.

2.1.2.  SMSave: Save lines to a file

Save lines to a file. If filename ends with .gz, then data is compressed.

2.1.3.  SMTee: Pipe tee-joint

Copies input to two outputs.

2.1.4.  SMTsave: Tee off lines to a file

Tee off lines and save to a file. If filename ends with .gz, then data is compressed. If 'xtr2tr' is non-0, then the data is treated as an extended tree, and the extended flags are removed on output to the file, to make it a normal tree.

2.1.5.  SMTree: Create tree listing

Generate a tree listing. The file system is scanned from 'base'. Symbolic links are followed up to a depth of 'o_follow'. All other symbolic links are copied. Files or directories with .NO_SYMIRR extensions are omitted; *~ backup files are also omitted. Output is a series of lines containing TAB-separated fields, which take the form of a navigation through the directory structure. See later for the tree listing format.

2.1.6.  SMComp: Compare two trees to generate a list of changes

Compares input trees from two different machines to generate the output action list required to synchronise these two machines. This works on the basis that this is the first sync, i.e. that there is no extended flag information available in these trees. This is a much cruder sync than is possible with SMXcomp, which should be applied when flag information is available (extended trees). This does a basic symmetrical sync (o_basic=='B'), a basic sync with local as secondary ('S'), or a basic sync with local as primary ('M'). See later "action list" section for the format of the action list generated.

2.1.7.  SMXcomp: Compare two extended trees to generate a list of changes

Compares two extended input trees from two different machines to generate the output action list required to synchronise these two machines. This is for the second and following syncs, using the extended flag information to accurately detect deletions and updates. The action list generated is as for SMComp.

2.1.8.  SMAct: Carry out a list of actions

Works through the input command-list, performing actions on the local filesystem and remotely (via the already-open command-channel) on the remote machine. 'idts' is the pathname of the ID-timestamp directory under _SYMIRR_BASE to use for dumping the changes made.

2.1.9.  SMMapch: Map the local-pc-changes file contents into a tree format

This takes "local-pc-changes" lines, and maps them into an extended tree format, inserting 'D' and 'X' lines to handle navigation between directories. The input file should have been sorted before being passed to this node; this is checked as a precaution. The output is an extended tree. All lines that come from the input file have 'u' or 'd' flags; the generated lines do not. "DEL" lines in the input file generate a 'Fd' line, with nonsense mug/szt; note that this may in fact indicate deletion of a directory, not just a file. Note that where possible, generated 'D' lines will have a mode/user/group derived from the file they are leading to. This is a reasonable default if directories are missing, but will possibly disagree with what is in the real tree. Otherwise nonsense/default values are used. Tree lines output are as follows:

2.1.10.  SMMerge: Merge a local-pc-changes list into a tree

This is used to merge the old tree from last time (first arg) with the list of changes made last time (second arg) to give as output the pre-changes tree as it was after the last sync. The input on the second arg should be the output of SMMapch, i.e. the "local-pc-changes" file converted into a tree.

2.1.11.  SMXtree: Generate an extended tree from old/new trees

This is used to take the output of SMMerge and the current output of SMTree and create an 'extended tree' which includes flags which show what has changed since the last sync. See later for the format.

2.1.12.  SMDiff: Generate a diff-list from two trees

The purpose of this is to reduce the information transferred from one machine to another when sending a new tree. Just the differences to one already-shared tree are sent. This node generates this difference list. See later for the format of this list.

2.1.13.  SMPatch: Combine a diff-list with a tree

The new tree (second argument to smdiff above) is reconstructed from the old tree and the diff-list.

2.2.  Data formats

2.2.1.  Normal and extended tree listings

These consists of a series of lines containing TAB-separated fields, which take the form of a navigation through the directory structure. Names are ordered in binary sort-order. All file and directory names are relative to the current directory -- no slashes are ever mentioned. Possible items in the list are:

The normal tree (as output by SMTree) does not contain any extended flag flag after the "F" or "D". The extended tree (as output by SMXtree) does include these markers to indicate changes since the last sync. Extended flags are as follows: "n" (new) indicates that this file/directory was not present at the last sync. "u" (updated) indicates that something about the file/directory has changed since the last sync (mode/user/group/mtime/size). "d" (deleted) indicates that this file/directory has been deleted since the last sync; i.e. it no longer exists. No flag indicates that there has been no change to this file/directory since the last sync. These would therefore appear as "Fn", "Fu", "Fd" and "F" for files, or as "Dn", "Du", "Dd" and "D" for directories.

2.2.2.  Action list format

This is the format of the output of SMComp and SMXcomp. This is also the data that is presented to the user to check over before the sync is completed. Note that the fields are TAB-separated (to permit spaces in filenames). Also note that commands all start with either << or >>, to indicate which tree of files or machine is being modified. There may also be comments in the output starting with '#' to indicate things that the user may wish to give attention to, such as warnings or clashes.

2.2.3.  Action-commands sent over connection

In general, the remote server may send informational messages before any given OK/ERR/etc response. These are formatted with a leading "#", like comments, and should be dumped to STDOUT for the user to see. Also note that where a space appears in the commands below, this indicates a TAB (to permit spaces in filenames). All commands and responses are terminated with a newline character.

On initial connection, the remote command should respond with:

At present no capabilities are defined, but the idea is that if new features are added, Symirr can recognise when the other end supports these features by reading its capability list. The version number and machine name (which may be '???' if not known) are used for display purposes only at present. The second OK may be replaced with ERR if there is some problem.

In case of error for any of the following commands, the response will be "ERR error-description" instead of BIN or OK or NO.

Quit may be used at any point to drop the connection:

Ping may be used at any time to keep the connection alive:

The command-line options and remote base directory that apply to the remote end should be provided using the following command. Arguments are given in argv[] style, TAB-separated.

The next exchange of commands with the server should attempt to assign an ID for this session, either a new ID, or an existing ID from a previous session. Also, a date-time stamp is associated with the session. Most of the later commands won't work without an ID and date-time stamp.

Once ID and date-time are established, the commands below are available to fetch the remote tree listing.

Once the tree is received and changes calculated and approved by the user, the changes on the remote filesystem are performed by sending the action-commands above, as follows:

2.2.4.  Tree-diff format

This diff is generated between two trees to allow the second tree to be generated from the first on a remote machine. The normal tree lines (D,F,X,END) appear as normal, with the following extra lines:

2.2.5.  Structure of _SYMIRR_BASE data

Under _SYMIRR_BASE is a directory for each ID agreed with a remote machine. There is also a lock file @@@NYI:

Within each ID directory is a directory for each date-time stamp of a session. Normally only the full contents of last 2-3 session directories is kept, and the previous ones are automatically deleted or archived. Within the timestamp directory are the following files:

In the ID directory are also the following archive files:

2.2.6.  Lines in local-pc-changes

For updates, these are extended tree lines ('Du' or 'Fu'), but without the full tree navigation to/from the files. Instead a full base-relative pathname (and sequence number) are put on the front to allow easy sorting with any binary or ASCII 'sort' tool. This pathname must use TAB instead of '/' to separate path segments, so that the sort order is fully correct. A double-TAB terminates the path. The sequence number is used to ensure that the last change indeed appears last, in case more than one line appears for the same file/directory. For deletion, only 'DEL" appears after the path and sequence number.

2.3.  How clashes and combinations of changes are handled

When no flag information is available, a basic symmetrical sync is performed by default (option -B). Decisions are based purely on the visible difference between the two trees. This is really a very poor method compared to the full sync process with update flags (which can only occur after the first sync), and so it is advisable to check over the change-command list carefully before letting it run. Sometimes it is not possible to make a clearly 'best' choice, and in this case alternatives are offered by comments in the change-command file. Really this is not suitable for syncing machines with many changes on both sides, or even in some cases on just one side, given that deletions cannot be detected. This is only really good for the initial sync of directories recently copied over, or (with care) to pick up a sync again after loss of the Symirr sync information.

There are also primary (-M) and secondary (-S) variants of the basic sync, in which the sync operation is biased completely to one side, so that by default the secondary is forced to duplicate the contents of the primary, even if the secondary has newer files (although warnings are inserted into the actions list, which the user will hopefully read and check before confirming).

When flag information is available (an extended tree), the decision is more complex, and much more accurate. Note that some situations indicated as 'impossible' below are only impossible if a full sync occurred in the past. However, it is entirely possible that only a partial sync occurred if the user has been editing the action-command lists, so these default to another sensible option:

The action in case of a clash is to choose one file/dir and sync that on both sides, and to rename the other to the same name plus .CLASH-xxxxxxxx, and also sync that to both sides. Then, later on, the user can view the .CLASH file/dir on either machine, and either delete it, merge changes, or copy it over the original file. At this point doing a re-sync will duplicate these fixes on the other machine.

The choice of which file to consider as the clash (when there is a choice) could be made in some cases by comparing the mtime, but instead Symirr defaults to making all the clashes come from the remote machine, where this is possible. This may seem to break the symmetry of the operation slightly, but at least it means that when the user is merging changes, he/she will probably be dealing with all the changes coming from the same direction, which is potentially less confusing. The only exception to this rule is in the case where a file/dir update clashes with a deletion of the same file/dir, when the update always becomes the clash, and the deletion is mirrored.

Note that the .CLASH-xxxxxxxx names are generated from a hash of the current time including fractions of a second. They need to be unique only on a per-pathname basis. The best option would be to scan ahead in the tree to guarantee that a clash name is unique, but this is not feasible in practice because in extreme cases (e.g. clashing top-level directories) almost all of the tree may have to be scanned. So a hash is used instead. A collision should be extremely unlikely, and even then it would only happen if someone has been letting CLASH files accumulate.

The rule-set can be reduced down as follows for the implementation:

3.  BD: Brain-dump

[Note: These are my notes to myself, written to help clarify things in my head. Most of them apply to old problems which are now resolved and handled in some other way. All of this is probably only useful to me, to recall some of my trains of thought. --Jim]

---

The remote side does not know the local tree. This means that if we do the symirr operation from the other side, we can do an extended sync, but not exchange a tree-diff. Worth trying to fix this up? Or always insist on doing it that one way.

---

Need to get clearer about how to handle reconnecting later on. Since we're allowing the user to edit the sequence of actions that is going to be performed, there is no guarantee that we have a perfect sync, ever. So, we will always be doing a combination of the smcomp and more advanced options.

So, I'm thinking of adding "hint" information to the tree that indicates changes made since the last sync, e.g. files that have definitely been updated, directories/files deleted, etc. Then smcomp can judge by both sets of information.

The next problem is maintaining a "last sync" point on both ends. The first time they can exchange a full tree. At each end, the initial tree plus the changes made in that session is the starting point for detecting changes at the next session. I'm not counting on the session ending in any sensible manner, so maybe we are not going to get to a point of rewriting the tree with the changes made.

Let's say we establish a common ID for future exchanges. This becomes a directory under _SYMIRR_BASE. In this we store the tree.gz before changes were made. We also store the change commands one by one as they are performed. Next time we can work through the tree and change-list and generate the tree as it must have been after all the changes had been performed.

Handshake goes like this: "Connect me with one of these IDs", "Okay, with ID xxx".

Two ways I can do this:

or:

Probably easier the first way.

Problem -- what about "mv" commands. This was included to support creating .CLASH files. The problem is that the rename in this case will convert a current tree item into a tree item maybe several items past (in the case of multiple numbered .CLASH files). Maybe I will have to sort in any case.

---

Thinking about what happens when we do a partial sync, when we delete (effectively cancel) a certain set of commands. Next time around we want the change to show up again fresh -- or do we? I think we do. This means that the tree that we use as a basis for comparison next time around should be the old local-tree plus the local-changes, not the tree as it is now.

---

Now, what about the diff method? We are not storing the local-tree.gz at the remote end any more. Should we? This really emphasises the asymmetry of the present method: one end will have a local-tree.gz and the other a remote-tree.gz. When they sync the other way around, it will be reversed. However, we are only caching the remote-tree and local-tree in this case to save bandwidth when transferring the tree over the connection. Maybe there is some other way.

Probably some kind of a symmetric method of exchanging local-changes files might do the job of keeping sync'd copied of trees on both ends, to use as a basis of comparisons, and this could be symmetrical. But not right now.

---

Panic!! Okay, slight problem with the local-pc thing. I'm only updating local-pc-tree with local-pc-changes, which represents changes that have come over from the other side. But it should also include changes which the user has made on this side, but that have been accepted on the other side. One possible solution might be to exchange local-pc-changes files at the start of the session. Otherwise ...

Fix was to exchange ALL action-commands, and let the other side log those changes as well.

---

Need to think what to do about the sort of local-pc-changes. I either have to rely on an external tool such as 'sort' which might (potentially) not sort things in the right order, e.g. if it gets confused about locales or whatever. The other alternative is to write my own sort tool, but this means duplicating all the work done in 'sort'. The other alternative is to make sure that the local-pc-changes file is in the right order. However, whenever there is a clash, we are manipulating a filename which is slightly later in the sequence than the current file. So this means that other files or even directories may be between that file and here. Also, in general, the user might mess up the order of things in the "actions" file. There is no guarantee that they won't.

Much better to leave this for an external tool that specialises in this kind of thing, if we can trust it. Either that or copy/hack down the sort utility.

Use 'LC_ALL=C sort'

---

Wondering whether to append the machine name to clash file names. This way, it is obvious where the clash came from (except when syncing between directories on the same machine). Something like: xx.CLASH-abunda-XXXXXXXX.

---

Do we want to make backups the default, or an explicit option? -b for rm backups and -bb for rm+cp backups is one option. This means that by default no backups are performed. Is this a good default? Alternatively, we could default to all backups, and have the option of turning them off instead.

Same with *~ *.NO_SYMIRR exclusions -- have these as the standard, and have option to turn them off?

---

xtree-rsync

Thinking of implementing the rsync algorithm to get the remote tree, using local-pc-tree.gz as the basis. Assuming we have some kind of a usable sync previously (that at least some part of the tree is the same), this is a reasonable basis to work from. If the user is holding back a large chunk of the changes manually, it is not so good as the previous remote-tree.gz, but this is a much rarer case. This method also has the advantage that we're not storing too many trees -- only local-pc-changes.gz has to be stored. No more local-tree.gz and remote-tree.gz. We are going to stream all of this, which means that the rsync algorithm can't go dodging around in the file. It can only refer to blocks forward of the last block duplicated.

Process:

This means making a pass locally through local-pc-tree.gz to send the xtree-rsync command checksums, and then making another pass through it locally to pick up the referenced blocks as the data arrives. The data can be processed as it arrives. Since rsync uses MD4 for the blocks, if we use MD5 of the whole file as a final check, we have a final safeguard that everything was fine with the algorithm. Actually, as we can only check this at the end, we would probably have fallen over with some other error before that point.

Actually, we can probably optimise rsync to put the blocks always on \n boundaries, which reduces the number of checks we have to do as well. In other words, we could make a block 20 lines (~700 bytes of file-specs). The list coming back could use the "c<num>" and "s<num>" codes as for xtreediff.

Maybe before implementing this, I should implement zlib compression on the data transfers. This will be required in any case.

---

Thinking about xtree-rsync. It is exactly the same as xtreediff, except that in addition we are sending 20 bytes of checksum info for every 700 bytes of data. This means for 1MB file (which when compressed is 250K), sending 30K of checksums. Well, that's not too bad, I suppose, something like 10% of the full gzipped file transfer. Well, anyway, the point is that it is more data to transfer, so xtreediff is actually superior in transfer-cost.

---

Changing everything over to use just one rdlin/wrlin. This means using alarm() not poll. It also means probably making rdlin return a pointer to its internal buffer, rather than a StrDup'd string. Also, expanding that buffer to take in any size line, up to some fixed maximum, like 64K or something.

---

Testing connection problems. Create a 'tupipe' tool which starts up another command connecting STDIN->STDIN of command, and also connecting back command STDOUT->STDOUT of tupipe, i.e. a two-way pipe. This can simulate failures in various ways:

Options:

Signals are caught, echoed to STDERR and ignored.

---

Thinking about EOF markers. Thinking of using 1:*V as an end-marker, but this has problems because it generates a lot of special-case code to make sure it isn't freed. A more consistent way of handling it would be to have a special EOF string, which comes through in the normal way. We could use DOS-style ^Z, or something more weird like ^E^O^F. Either is fine in this application because we are handling only valid printable text lines. ^Z is easier to check.

---

Currently calling rem_wrlin() etc from local-side, and now switched to use rem_wrlin() on remote side as well. Maybe need a better name for it now. con_wrlin()? pipe_*()

---

Speeding up the exchange. If I start sending more than one command, I get into the risk that I'm sending too much and overflowing buffers on the way. For example, sending a file, I could end up having to process several return codes during that send. This really means having sending and receiving in different threads, or multiplexed somehow.

Maximum speed would mean piping commands and data as fast as possible to the destination, and handling the returns as they come back. A compromise would be allowing to get only a certain number of commands ahead.

A simple compromise would be to allow only a maximum of 4K, say, of commands to get ahead, which could be accomodated in the buffers without problem, without requiring two threads.

If I take the approach of piping as much data as possible, as fast as possible, to the remote end, and then waiting for data to come back, then I also have to think about handling error conditions and ^C. Maybe for ^C we finish the current command sending data, then wait for all the responses to come back, then terminate. In error conditions, there will be

Thinking of changing all commands to act only after receiving the OK return. This means that the remote end is controlling the rate that things happen. We can send everything in chunks of 2K or less, checking the input pipe every write, and acting whenever we have enough input data. This way I don't need threads.