1. Symirr 2 user notes

Symirr means "symmetrically mirror". It is a recursive file/directory synchronisation tool for mirroring changes made to duplicate copies of the same directory tree kept in two or more locations. This is done in a symmetrical fashion, i.e. there is no sense of a "master copy" -- changes may be made on either side (or both sides) and then sync'd to the other. This is ideal when it is necessary to keep duplicate synchronised copies of mail folders, project directories, etc, on several different machines.

Some history: Symirr 1 was written in Perl and performed a very optimal and logically-pure symmetrical mirroring operation between directories on two or more machines. However, it required a complete, perfect sync operation every time, and so it was fragile to all kinds of failures. Also, it could not sync a pair of directories which had been transferred or synchronised using other tools.

This new version 2, written in C, is a more realistic symmetrical mirroring operation which is much more robust and can cope with all the situations above that the original could not. It allows the user to check over and, if necessary, modify all the actions it will take -- using the user's favourite editor. This permits a partial sync if the user wants to deal with some files by hand.

Fully-automated syncs are also safe after the first sync because if files are changed on both machines then they generate "clashes", meaning that extra files will appear in the directories with a .CLASH-* extension which the user can then merge (or otherwise deal with) before the next sync to resolve the situation on both machines. Any files deleted or overwritten in the sync can also be saved in per-session .tgz files as a final safeguard by using the -b option.

This version is also designed to work with a small memory footprint. Everything is streamed -- it is never necessary to store the whole tree in memory. This makes it very much more efficient than some other programs for mirroring small changes in immense trees: for example, a tree of MH-style mail folders, or a large set of project directories.

Advantages:

Handles immense trees without problem, streaming the data, never attempting to keep the whole tree into memory.
Shows exactly what will be done in each sync. Gives user full control to edit that action command-list, permitting partial or modified syncs if required.
Can safely run fully automated after the first sync.
Allows sync between many machines. All problems may be resolved by fixing files on the local machine and syncing again, even problems on far-remote machines (two or more links away).
Copes well with transfers/syncs of files/directories done outside of Symirr (e.g. with rsync or cp -a or tar), so long as modification-time/etc is also synchronised by these tools.

Basis for sync:

That if the modification-time and size of a particular file are identical on the two machines, then the file contents are also identical. This is a valid assumption considering the way most people use their filesystems. The alternative would be to MD5 every file every sync, but that would make the process impossibly slow.

Limitations:

Does not attempt to detect renamed files or directories; these appear to Symirr as a deletion and a new file or directory.
Does not attempt to detect multiply hard-linked files; these appear as independent files to Symirr. Symbolic links are handled, though.
Does not handle special files (fifos, sockets, devices, etc), nor files with names containing control characters. Symirr simply ignores these files.
At present always copies whole files. This means that Symirr is not the best choice for syncing large files of which only a small part changes in any update; however, you can use rsync manually in parallel with Symirr to handle these files if necessary.
At present requires a round-trip on the communications link for each action-command; this ensures good consistency between the two ends in case of failure, but is slower than ideal.

The current code of Symirr has been refactored many many times. I'm happier having a small codebase so that I can make sweeping global changes whenever I find a better approach. To me this is much better than having a huge codebase that I'm afraid to touch for fear of breaking something. This means always looking for the small/simple/efficient/optimal way of getting the job done, rather than going for the comprehensive/extensive-feature-set style, which perhaps explains to some extent why Symirr's design is the way it is.

1.1. Legal

Symirr: a symmetrical file-tree mirroring tool.  Copyright (c)
2005-2006 Jim Peters <http://uazu.net/>, all rights reserved.
Released under the GNU General Public Licence version 2.  See the
file COPYING for details, or visit <http://www.fsf.org>.

"This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
version 2 as published by the Free Software Foundation."

"This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details."

"You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston,
MA 02111-1307 USA"

1.2. Usage and options

Command-line usage:

symirr [options] local-base-dir remote-base-dir
symirr [options] local-base-dir remote-base-dir remote-connect-command

The remote connect command should connect to the remote machine and run symirr with no arguments. As an example: ssh -C jim@sunrise symirr.

If no connect command is provided, then a local instance of Symirr is run as the 'remote' command, using the same path as was used to start the present instance of Symirr. This allows mirroring from one directory to another on the local machine.

Command-line options:

-a

Automated; useful for scripts or cron jobs. Does not ask confirmation with the editor, but only permits an extended sync, not a first-time sync. (However -aa overrides this warning.) Warnings of clashes are sent to STDOUT (which for cron jobs gets E-mailed to the user).

-av

Verbose; with the 'automated' option, all the normal informational messages are disabled because you don't want these triggering unnecessary emails when run as a cron-job. With the -v option, these are re-enabled.

-B

Force a basic, first-time (symmetrical) sync, i.e. ignore any history/state information with this host. See also -S and -M for asymmetrical forced basic syncs.

-b

Backup any files that are overwritten or deleted during the mirroring operation, at both ends. The files are saved to a TGZ file under the _SYMIRR_BASE directory. You can look for this file to recover previous versions if necessary. The TGZ is in standard POSIX "ustar" format, using the POSIX "pax interchange" extensions for anything that won't fit within that format.

-x pattern

Exclude files according to this globbing-style pattern. Only pathnames of files (not directories) are checked against this pattern, which must match the entire path relative to the base directory (with no leading slash). Wildcards are as follows:

?   matches any single character except '/'
*   matches any number of characters excluding '/'
**  matches any number of characters including '/'

Backslash may be used to escape any single special character, including itself.

-p pattern

Prune (i.e. exclude) whole directories according to this globbing-style pattern. Only pathnames of directories are checked against this pattern. The check is performed exactly as for '-x'. The directory pathnames to match do not have any trailing slash.

-f

Follow symbolic links within the top-most directory (-f), or for -ff, -fff etc, up to a depth corresponding to the number of 'f's. Normally symbolic links are synced as they are, without following them, but when they are followed, they appear to the other machine as the thing linked-to, not as a symbolic link.

It can be useful to create a base directory full of symbolic links to other directories scattered around the filesystem and then use -f to sync them all. This is one way of syncing a whole set of associated data together. It is also a way of keeping the _SYMIRR_BASE directory from cluttering up the directories being synced.

-m

Map current local user and group to current remote user and group. If I am logged in as user 'jim' in group 'jim', both of these will be mapped to user or group '_', which will in turn be mapped to the remote user and group on the remote server. This allows easy transfer of files between different login accounts accessible by the same person.

-e editor

Call this editor to confirm the action command-list, instead of using $VISUAL or $EDITOR, etc.

-S
-M

Force a basic, first-time, asymmetrical sync. Local is secondary with -S, or primary with -M. The remote side takes the opposite role. Note that normally the default -B is fine for a first sync, but the -S or -M options allow deletions on the primary to be detected if you are sure that no useful changes have been made on the secondary side. In the action-list, warnings are given about any newer files on the secondary, but these newer files will be overwritten with the copy from the primary unless you make manual changes to the action-command list. After using this option for the first sync, drop the option and let Symirr use its extended sync mode and history data to detect changes more precisely.

--all-files

@@@NYI. Do not omit *~ and *.NO_SYMIRR files from the sync.

The editor to use to confirm the action command-list is selected by the first of these found to work:

-e editor command-line option
$VISUAL environment variable
$EDITOR environment variable
editor if this can be found in $PATH
emacs if this can be found in $PATH
vi if this can be found in $PATH

Debugging/testing usage:

symirr [options] -t
symirr [options] -t node-name node-args

Enter a testing mode for the given node, passing in data from the files and arguments listed. Output appears on STDOUT. With no node type, a usage message is displayed showing the node-types available.

1.3. Special files:

"_SYMIRR_BASE"

This directory is created at the base of the directory tree being mirrored. It stores status information and backups of previous versions of files and directories. It also contains unique ID strings to identify this directory in repeated exchanges with other machines, to allow deletions and clashes to be detected.

"*.CLASH-########"

Result of two file or directory changes clashing. These files should be examined and the changes merged by hand. The clash file may then be deleted. These manual changes will be carried over to the other machine(s) on the next sync.

"*~"

Emacs-style editor backup files are by default omitted from the mirroring operation.

"*.NO_SYMIRR"

This extension indicates a file or directory which should not be mirrored. It will be ignored completely by Symirr. For files/directories that should not be mirrored, but that need a more user-friendly name (i.e. without .NO_SYMIRR on the end), use a symbolic link to the .NO_SYMIRR file/directory. Alternatively, use the -x and -p options on the command-line to exclude these files or directories from the sync.

1.4. Sync operation

There are really two distinct sync operations done by Symirr: the first sync and subsequent syncs. (See the IMPLEMENTATION section below for full details of the algorithms.)

The first time Symirr does a sync, it does not know anything about the history of the files it finds there. Ideally the first sync you would do with Symirr would be between the original directory tree and an empty directory (i.e. letting Symirr copy the files across for you), or you would be syncing to an identical copy of the file tree that you have copied across yourself, either using cp -a or tar or some other tool. However, it is possible that you made a duplicate copy of the file tree at some point in the past and you have made changes on both machines and now you want to use Symirr. Since there is no history information in this situation, Symirr cannot detect deleted files, and neither can it detect clashes (files changed on both sides). For this reason, on the first sync it is important to check over the action command-list file carefully to see that Symirr is doing what you expect it to be doing, to avoid any possible problems. In some cases Symirr inserts comments into this file which provide alternatives.

However, after the first sync, Symirr can accurately detect updates and deletions, and recognise clearly when clashes occur. At this point it is safe to allow Symirr to run unattended (e.g. to run automatically from cron). If clashes are detected, i.e. if file changes have been made on both machines, then *.CLASH-######## files are generated which allow the situation to be examined and resolved later on with no loss of information. This is a much more complicated and accurate sync operation (see later for the algorithm).

In addition to the default symmetric basic first-time sync (-B), there are also two asymmetric basic first-time syncs, -S 'local-is-secondary' and -M 'local-is-primary'. These are designed to duplicate exactly the version of the tree on the primary, overwriting or deleting anything on the secondary that disagrees. Even a newer file on the secondary will be overwritten with an older version from the primary, although a warning is given in this case. These are intended for the first-time sync when it is certain that no useful changes have been made to the secondary. One example use is to sync an old copy of an MH-style mail tree with the current version, where there may be many deletions required to the old copy which would result in (incorrect) attempts to copy files the other way if the default -B were used. So, -S and -M provide an alternative way to start Symirr off, better in some situations. After this point the option should be dropped, allowing the extended sync mode to take care of all deletions and updates.

1.5. For hackers: picking up from `_SYMIRR_BASE` corruption

If for some reason the _SYMIRR_BASE history gets corrupted, then rather than resorting to a basic (-B/-S/-M) sync to pick things up again, you might prefer to fix things up manually in the history.

Under _SYMIRR_BASE you will find a directory for each machine this machine has synced with, i.e. for each remote machine with which it holds history. Find the directory that applies to the machine you are interested in. In this directory is a time-stamped directory for each session with that remote machine. In each directory that holds valid history data, there will be both a local-pc-tree.gz and a local-pc-changes file. If either of these files is missing, then the directory is invalid and will be ignored when Symirr is looking for history (e.g. maybe it is the result of a broken connection, or of a sync aborted with ^C). Symirr scans backwards through these directories looking for the first directory with valid history. Knowing this, you can stop Symirr reading corrupt history data by simply deleting it. However, you will have to do this on the remote machine as well otherwise Symirr will complain.

It doesn't matter if you have to take the history back several syncs, because files or directories updated more recently will appear as "new-but-identical", which will not cause a re-sync of these files or directories. You can check all this in the actions list anyway before allowing Symirr to make any updates.

2. IMPLEMENTATION

Internally most of Symirr's operations are performed by various 'operation node' objects which are connected up in different networks depending on the requirements of the activity at hand. These nodes communicate via text strings in various formats described later on, which also appear in the various files generated by Symirr: the trees and action list and so on. In some ways this structure imitates UNIX-style tools and pipes, i.e. small tool-objects taking small well-defined roles in the process and flow of data, but extended to a full network rather than a simple pipe-chain.

2.1. Operation nodes

Each of these operations is a node that consumes input strings and generates output strings. Input strings are passed via pointers to pointer variables which hold either a pointer to a StrDup'd string, or zero. When the input string is consumed, the pointer variable is set to zero. The output strings operate similarly; they are either 0 or a StrDup'd string. End-of-file is indicated by a StrDup'd string containing an end-of-file marker (see SMEOF and isEOF in the source).

Normally the node input string pointers are set to point to output strings of other nodes, allowing data to be passed from one node to another automatically. The node *_do() call tries to do more work, moving data from inputs to outputs. Due to nodes that consume more data than they produce, or produce more than they consume, at times there may be 0 values in these pointers, which indicate "no data available right now". So, if the output pointer variables still contain unconsumed strings, or if the inputs don't contain strings, then no operation is possible, and the *_do() call returns without doing anything.

The *_do() calls also return an integer to indicate if there has been activity during this call. One way of detecting that everything is done is by waiting until all nodes are reporting no activity. The other way is to wait for an EOF marker to come from the final output, although this assumes that there are no other implied outputs (file writes/etc) that might still be pending.

2.1.1. SMLoad: Load lines from a file

SMLoad *xx= smload(fnam);
SMLoad *xx= smload_sorted(fnam); // Sorted alternative
xx->out;
smload_do(xx);
smload_del(xx);

Loads lines one by one from a file, decompressing if the filename ends in .gz. The smload_sorted() function sets up a binary-sorted load of data, by piping the input data through UNIX 'sort'; no .gz files are permitted in this case.

2.1.2. SMSave: Save lines to a file

SMSave *xx= smsave(fnam, &yy->out);
smsave_do(xx);
smsave_del(xx);

Save lines to a file. If filename ends with .gz, then data is compressed.

2.1.3. SMTee: Pipe tee-joint

SMTee *xx= smtee(&yy->out);
xx->out1;
xx->out2;
smtee_do(xx);
smtee_del(xx);

Copies input to two outputs.

2.1.4. SMTsave: Tee off lines to a file

SMTsave *xx= smtsave(fnam, &yy->out, xtr2tr);
xx->out;
smsave_do(xx);
smsave_del(xx);

Tee off lines and save to a file. If filename ends with .gz, then data is compressed. If 'xtr2tr' is non-0, then the data is treated as an extended tree, and the extended flags are removed on output to the file, to make it a normal tree.

2.1.5. SMTree: Create tree listing

SMTree *xx= smtree(base);
xx->out;
smtree_do(xx);
smtree_del(xx);

Generate a tree listing. The file system is scanned from 'base'. Symbolic links are followed up to a depth of 'o_follow'. All other symbolic links are copied. Files or directories with .NO_SYMIRR extensions are omitted; *~ backup files are also omitted. Output is a series of lines containing TAB-separated fields, which take the form of a navigation through the directory structure. See later for the tree listing format.

2.1.6. SMComp: Compare two trees to generate a list of changes

# tree + tree -> command-list
SMComp *xx= smcomp(&yy->out, &zz->out);
xx->out; // Output strings
smcomp_do(xx);
smcomp_del(xx);

Compares input trees from two different machines to generate the output action list required to synchronise these two machines. This works on the basis that this is the first sync, i.e. that there is no extended flag information available in these trees. This is a much cruder sync than is possible with SMXcomp, which should be applied when flag information is available (extended trees). This does a basic symmetrical sync (o_basic=='B'), a basic sync with local as secondary ('S'), or a basic sync with local as primary ('M'). See later "action list" section for the format of the action list generated.

2.1.7. SMXcomp: Compare two extended trees to generate a list of changes

# tree + tree -> command-list
SMXcomp *xx= smxcomp(&yy->out, &zz->out);
xx->out; // Output strings
smxcomp_do(xx);
smxcomp_del(xx);

Compares two extended input trees from two different machines to generate the output action list required to synchronise these two machines. This is for the second and following syncs, using the extended flag information to accurately detect deletions and updates. The action list generated is as for SMComp.

2.1.8. SMAct: Carry out a list of actions

# command-list ->
SMAct *xx= smact(&yy->out, base, idts);
smact_do(xx);
smact_del(xx);

Works through the input command-list, performing actions on the local filesystem and remotely (via the already-open command-channel) on the remote machine. 'idts' is the pathname of the ID-timestamp directory under _SYMIRR_BASE to use for dumping the changes made.

2.1.9. SMMapch: Map the local-pc-changes file contents into a tree format

# local-pc-changes -> xtree
SMMapch *xx= smmapch(&changes->out);
xx->out;
smmapch_do(xx);
smmapch_del(xx);

This takes "local-pc-changes" lines, and maps them into an extended tree format, inserting 'D' and 'X' lines to handle navigation between directories. The input file should have been sorted before being passed to this node; this is checked as a precaution. The output is an extended tree. All lines that come from the input file have 'u' or 'd' flags; the generated lines do not. "DEL" lines in the input file generate a 'Fd' line, with nonsense mug/szt; note that this may in fact indicate deletion of a directory, not just a file. Note that where possible, generated 'D' lines will have a mode/user/group derived from the file they are leading to. This is a reasonable default if directories are missing, but will possibly disagree with what is in the real tree. Otherwise nonsense/default values are used. Tree lines output are as follows:

D mug nam
D 000777/0/0 nam
X
Du mug nam
Fu mug nam szt
Fd 000000/0/0 nam 0/00000000
END

2.1.10. SMMerge: Merge a local-pc-changes list into a tree

# tree + xtree -> tree
SMMerge *xx= smmerge(&tree->out, &mapch->out);
xx->out;
smmerge_do(xx);
smmerge_del(xx);

This is used to merge the old tree from last time (first arg) with the list of changes made last time (second arg) to give as output the pre-changes tree as it was after the last sync. The input on the second arg should be the output of SMMapch, i.e. the "local-pc-changes" file converted into a tree.

2.1.11. SMXtree: Generate an extended tree from old/new trees

# tree + tree -> xtree
SMXtree *xx= smxtree(&old->out, &new->out);
smxtree_do(xx);
smxtree_del(xx);

This is used to take the output of SMMerge and the current output of SMTree and create an 'extended tree' which includes flags which show what has changed since the last sync. See later for the format.

2.1.12. SMDiff: Generate a diff-list from two trees

# tree + tree -> diff-list
SMDiff *xx= smdiff(&old->out, &new->out);
xx->out;
smdiff_do(xx);
smdiff_del(xx);

The purpose of this is to reduce the information transferred from one machine to another when sending a new tree. Just the differences to one already-shared tree are sent. This node generates this difference list. See later for the format of this list.

2.1.13. SMPatch: Combine a diff-list with a tree

# tree + diff-list -> tree
SMPatch *xx= smpatch(&old->out, &diff->out);
xx->out;
smpatch_do(xx);
smpatch_del(xx);

The new tree (second argument to smdiff above) is reconstructed from the old tree and the diff-list.

2.2. Data formats

2.2.1. Normal and extended tree listings

These consists of a series of lines containing TAB-separated fields, which take the form of a navigation through the directory structure. Names are ordered in binary sort-order. All file and directory names are relative to the current directory -- no slashes are ever mentioned. Possible items in the list are:

Dflag mode/user/group dir-name

This specifies a sub-directory within the current directory, and enters that directory. Mode is specified in octal. Users and groups are specified using user or group names rather than UID/GID, if possible.

Fflag mode/user/group file-name size/mtime

This specifies a file in the current directory. 'mtime' is specified as 8 hex digits, 'size' in decimal. Symbolic links come through as a file with a mode of "SYMLNK", and the mtime being a 32-bit hex hash of the link-string prefixed with '#' (as the mtime is not useful for symlinks).

X

This exits the current directory (matching a previous D line), and moves back to the parent.

END [md5-hash]

This marks the end of the list. The md5-hash, if included, is a hash of all the data in the previous lines.

The normal tree (as output by SMTree) does not contain any extended flag flag after the "F" or "D". The extended tree (as output by SMXtree) does include these markers to indicate changes since the last sync. Extended flags are as follows: "n" (new) indicates that this file/directory was not present at the last sync. "u" (updated) indicates that something about the file/directory has changed since the last sync (mode/user/group/mtime/size). "d" (deleted) indicates that this file/directory has been deleted since the last sync; i.e. it no longer exists. No flag indicates that there has been no change to this file/directory since the last sync. These would therefore appear as "Fn", "Fu", "Fd" and "F" for files, or as "Dn", "Du", "Dd" and "D" for directories.

2.2.2. Action list format

This is the format of the output of SMComp and SMXcomp. This is also the data that is presented to the user to check over before the sync is completed. Note that the fields are TAB-separated (to permit spaces in filenames). Also note that commands all start with either << or >>, to indicate which tree of files or machine is being modified. There may also be comments in the output starting with '#' to indicate things that the user may wish to give attention to, such as warnings or clashes.

<<mkdir mode/user/group path

Create a directory on first machine

<<cp mode/user/group path size/mtime

Copy a file from second machine to first

<<chmug mode/user/group path

Change mode/user/group on first machine

<<rm path

Remove a file or entire directory on first machine

<<mv path path

Rename a file on first machine

>>...

As above, but affecting the second machine/tree

#...

Comments

2.2.3. Action-commands sent over connection

In general, the remote server may send informational messages before any given OK/ERR/etc response. These are formatted with a leading "#", like comments, and should be dumped to STDOUT for the user to see. Also note that where a space appears in the commands below, this indicates a TAB (to permit spaces in filenames). All commands and responses are terminated with a newline character.

On initial connection, the remote command should respond with:

-> OK symirr <symirr-version> on <machine-name>
OK <capability0> <capability1> <capability2> ...

At present no capabilities are defined, but the idea is that if new features are added, Symirr can recognise when the other end supports these features by reading its capability list. The version number and machine name (which may be '???' if not known) are used for display purposes only at present. The second OK may be replaced with ERR if there is some problem.

In case of error for any of the following commands, the response will be "ERR error-description" instead of BIN or OK or NO.

Quit may be used at any point to drop the connection:

quit
-> OK

Drop the connection

Ping may be used at any time to keep the connection alive:

ping
-> OK

Check that the connection is up

The command-line options and remote base directory that apply to the remote end should be provided using the following command. Arguments are given in argv[] style, TAB-separated.

opt <options-tsv> <remote-base-directory>
-> OK
-> ERR

Specify run-time options and base directory which apply to the remote end.

The next exchange of commands with the server should attempt to assign an ID for this session, either a new ID, or an existing ID from a previous session. Also, a date-time stamp is associated with the session. Most of the later commands won't work without an ID and date-time stamp.

conn datetime newid oldid1 oldid2 oldid3 ...
-> OK id datetime
-> OK id
-> NO

First set the date-time stamp that will be associated with this session. Next try to establish an ID to use for this session and future sessions. The remote machine should first scan the list of old IDs and see if it recognises any of them. If so, it should return OK with the ID and the date-time stamp of the last sync as a double-check. If not, it should see if the newid is acceptable, and reply OK if so. If the new ID is not acceptable (e.g. already in use), it replies NO.

Once ID and date-time are established, the commands below are available to fetch the remote tree listing.

tree
-> ...
END

Request that the complete normal tree-listing of the remote machine be sent. This is the only option if this is a new ID.

xtree
-> ...
END

Request that the complete extended tree-listing of the remote machine be sent. This is possible only if we had a previous session with this ID.

xtreediff
-> ...
END

Request that a list of differences between the last tree sent and the current extended remote tree are sent. This is possible only if we had a previous session with this ID, and if we synced the same way around. If we synced the other way around, then this is the same as an 'xtree'.

Once the tree is received and changes calculated and approved by the user, the changes on the remote filesystem are performed by sending the action-commands above, as follows:

>>mkdir mode/user/group path
-> OK

Create remote directory

>>chmug mode/user/group path
-> OK

Change mode/user/group for a remote file or directory

>>rm path
-> OK

Remove a remote file or directory (like a rm -r)

>>mv path path
-> OK

Rename a remote file or directory

>>cp mode/user/group path size/mtime
raw-data
-> OK

Send a file or symlink to be saved on the remote machine; overwrites any existing file or symlink with that name. Note that the raw-data is terminated by a newline to verify that sync has not been lost.

<<mkdir mode/user/group path
<<chmug mode/user/group path
<<rm path
<<mv path path
-> OK

Inform the remote end of these changes on the local machine. This is used to acknowledge that changes made on the remote machine have been accepted on the local machine, so that the remote machine can update its local-pc-tree correctly.

<<cp mode/user/group path size/mtime
-> BIN
raw-data

Request that a remote file or symlink be sent (mug/size/mtime are checked before the file is sent). Note that the raw-data is terminated by a newline to verify that sync has not been lost.

2.2.4. Tree-diff format

This diff is generated between two trees to allow the second tree to be generated from the first on a remote machine. The normal tree lines (D,F,X,END) appear as normal, with the following extra lines:

cnum

Copy a number of lines from the old tree

snum

Skip a number of lines from the old tree

2.2.5. Structure of _SYMIRR_BASE data

Under _SYMIRR_BASE is a directory for each ID agreed with a remote machine. There is also a lock file @@@NYI:

LOCK-@@@@

Within each ID directory is a directory for each date-time stamp of a session. Normally only the full contents of last 2-3 session directories is kept, and the previous ones are automatically deleted or archived. Within the timestamp directory are the following files:

local-pc-tree.gz

The local "pre-change" tree. This is the previous session's local-pc-tree.gz with local-pc-changes applied, i.e. the (theoretical) state of the filesystem before the user started making changes. If the user has made no changes and we did a full sync, then it should match the local filesystem exactly.

local-pc-changes

List of changes actioned locally this session. These are updates to local-pc-tree.gz to give the comparison basis point for the next session. Working in this way, if the user decides to not do a full sync, the changes left unsynced will continue to be recognised in future sync operations.

Note that this file (in general) has to be sorted before it is merged. Even if the commands are sent in the correct sort-order, "mv" commands can mess up the order of changes.

This file, if present, also acts as a flag file to indicate that the local-pc-tree.gz file is complete and valid, i.e. this session has reached the point where it can be used as history data for a future session, even if the session is aborted and no syncing is performed.

actions

List of actions to be performed. This file is edited by the user in normal (non-automatic) operation.

clashes

List of clashes detected in this session.

backups.tgz

Archived versions of all files overwritten or deleted when -b (backup) option is in operation.

remote-tree.gz
local-tree.gz

The remote tree or local tree at the point just after the "tree", "xtree" or "xtreediff" command of this timestamped session. The remote tree is cached at the local end, and the local tree cached at the remote end, to enable a tree-diff to be sent the next time around if the sync is in the same direction. If one or other of these files are missing, the whole tree must be sent instead.

In the ID directory are also the following archive files:

actions_clashes.tgz

All the old actions and clashes files are archived here.

yyyymmdd-hhmmss-backups.tgz

Backup TGZ files from timestamped directories that have been cleaned up.

2.2.6. Lines in local-pc-changes

tab-sep-path  seq-num Du mode/user/group nam
tab-sep-path  seq-num Fu mode/user/group nam size/mtime
tab-sep-path  seq-num DEL

For updates, these are extended tree lines ('Du' or 'Fu'), but without the full tree navigation to/from the files. Instead a full base-relative pathname (and sequence number) are put on the front to allow easy sorting with any binary or ASCII 'sort' tool. This pathname must use TAB instead of '/' to separate path segments, so that the sort order is fully correct. A double-TAB terminates the path. The sequence number is used to ensure that the last change indeed appears last, in case more than one line appears for the same file/directory. For deletion, only 'DEL" appears after the path and sequence number.

2.3. How clashes and combinations of changes are handled

When no flag information is available, a basic symmetrical sync is performed by default (option -B). Decisions are based purely on the visible difference between the two trees. This is really a very poor method compared to the full sync process with update flags (which can only occur after the first sync), and so it is advisable to check over the change-command list carefully before letting it run. Sometimes it is not possible to make a clearly 'best' choice, and in this case alternatives are offered by comments in the change-command file. Really this is not suitable for syncing machines with many changes on both sides, or even in some cases on just one side, given that deletions cannot be detected. This is only really good for the initial sync of directories recently copied over, or (with care) to pick up a sync again after loss of the Symirr sync information.

There are also primary (-M) and secondary (-S) variants of the basic sync, in which the sync operation is biased completely to one side, so that by default the secondary is forced to duplicate the contents of the primary, even if the secondary has newer files (although warnings are inserted into the actions list, which the user will hopefully read and check before confirming).

Actions:
  FC: copy primary to secondary, with warning if secondary's file is newer
  FM: copy mug details primary->secondary
  FI: no action
  DM: copy mug details primary->secondary
  DI: no action
  FD: file or directory on primary overwrites file/dir on secondary
  F: if on primary, copy it over; if on secondary, delete it
  D: if on primary, copy it over; if on secondary, delete it

When flag information is available (an extended tree), the decision is more complex, and much more accurate. Note that some situations indicated as 'impossible' below are only impossible if a full sync occurred in the past. However, it is entirely possible that only a partial sync occurred if the user has been editing the action-command lists, so these default to another sensible option:

Info from tree:
  FC: same-name files/symlinks, different mode/user/group/size/mtime
  FM: same-name files/symlinks, different mode/user/group
  FI: same-name files/symlinks, identical (from mug/szt)
  DM: same-name directory, different mode/user/group
  DI: same-name directory, identical (from mug)
  FD: same-name file and directory
  F: file/symlink found on one side only
  D: directory found on one side only

Change flags (generated by comparing old tree with current tree):
  _  no change since last time
  n  new since last time
  u  updated since last time
  d  deleted since last time

Actions for different combinations:
  FC__ impossible, treat as FCnn
  FCnn clash
  FCuu clash
  FCdd no action
  FC_n impossible, treat as FC_u
  FC_u copy bb->aa
  FC_d delete aa
  FCdn impossible, treat as FCdu
  FCdu file on bb becomes a clash file on both
  FCnu impossible, treat as FCnn

  FM__ impossible, treat as FMnn
  FMnn clash
  FMuu clash
  FMdd no action
  FM_n impossible, treat as FM_u
  FM_u chmug aa from bb's details
  FM_d delete aa
  FMdn impossible, treat as FMdu
  FMdu file on bb becomes a clash file on both
  FMnu impossible, treat as FMnn

  FI__ no action
  FInn no action (assuming sync'd identically by other means)
  FIuu no action (assuming sync'd identically by other means)
  FIdd no action
  FI_n impossible, no action
  FI_u impossible, no action
  FI_d delete aa
  FIdn impossible, treat as FIdu
  FIdu file on bb becomes a clash file on both
  FInu impossible, no action

  DM__ impossible, treat as DMnn
  DMnn clash
  DMuu clash
  DMdd no action
  DM_n impossible, treat as DM_u
  DM_u chmug aa from bb's details
  DM_d delete aa
  DMdn impossible, treat as DMdu
  DMdu directory on bb becomes a clash directory on both
  DMnu impossible, treat as DMnn

  DI__ no action
  DInn no action (assuming sync'd identically by other means)
  DIuu no action (assuming sync'd identically by other means)
  DIdd no action
  DI_n impossible, no action
  DI_u impossible, no action
  DI_d delete aa
  DIdn impossible, treat as DIdu
  DIdu directory on bb becomes clash directory on both
  DInu impossible, no action

  FD__ impossible, treat as clash
  FDnn clash
  FDuu impossible, treat as clash
  FDdd no action
  FD_n copy bb->aa; new directory replaces file
  FD_u impossible, treat as FD_n
  FD_d impossible, treat as clash (copy aa->bb as clash file)
  FDdn copy bb->aa
  FDdu impossible, treat as FDdn
  FDnu impossible, treat as clash
  FDn_ copy aa->bb; new file replaces directory
  FDu_ impossible, treat as FDn_
  FDd_ impossible, treat as clash (copy bb->aa as clash directory)
  FDnd copy aa->bb
  FDud impossible, treat as FDnd
  FDun impossible, treat as clash

  F_ impossible, treat as Fn
  Fn copy aa->bb
  Fu impossible, treat as Fn
  Fd impossible, no action

  D_ impossible, treat as Dn
  Dn copy aa->bb
  Du impossible, treat as Dn
  Dd impossible, no action

The action in case of a clash is to choose one file/dir and sync that on both sides, and to rename the other to the same name plus .CLASH-xxxxxxxx, and also sync that to both sides. Then, later on, the user can view the .CLASH file/dir on either machine, and either delete it, merge changes, or copy it over the original file. At this point doing a re-sync will duplicate these fixes on the other machine.

The choice of which file to consider as the clash (when there is a choice) could be made in some cases by comparing the mtime, but instead Symirr defaults to making all the clashes come from the remote machine, where this is possible. This may seem to break the symmetry of the operation slightly, but at least it means that when the user is merging changes, he/she will probably be dealing with all the changes coming from the same direction, which is potentially less confusing. The only exception to this rule is in the case where a file/dir update clashes with a deletion of the same file/dir, when the update always becomes the clash, and the deletion is mirrored.

Note that the .CLASH-xxxxxxxx names are generated from a hash of the current time including fractions of a second. They need to be unique only on a per-pathname basis. The best option would be to scan ahead in the tree to guarantee that a clash name is unique, but this is not feasible in practice because in extreme cases (e.g. clashing top-level directories) almost all of the tree may have to be scanned. So a hash is used instead. A collision should be extremely unlikely, and even then it would only happen if someone has been letting CLASH files accumulate.

The rule-set can be reduced down as follows for the implementation:

General rules:
  ??__ becomes ??nn
  ??uu becomes ??nn
  ??dd no action
  ??_n becomes ??_u
  ??dn becomes ??du
  ??nu becomes ??nn
  ??n_ becomes ??u_
  ??nd becomes ??ud
  ??un becomes ??nn

Actions for remaining combinations after mapping above:
  FCnn clash
  FC_u copy bb->aa
  FC_d delete aa
  FCdu file on bb becomes a clash file on both

  FMnn clash
  FM_u chmug aa from bb's details
  FM_d delete aa
  FMdu file on bb becomes a clash file on both

  FInn no action
  FI_u impossible, no action
  FI_d delete aa
  FIdu file on bb becomes a clash file on both

  DMnn clash
  DM_u chmug aa from bb's details
  DM_d delete aa
  DMdu directory on bb becomes a clash directory on both

  DInn no action
  DI_u impossible, no action
  DI_d delete aa
  DIdu directory on bb becomes clash directory on both

  FDnn clash
  FD_u copy bb->aa; new directory replaces file
  FD_d impossible, treat as clash (copy aa->bb as clash file)
  FDdu copy bb->aa
  FDu_ copy aa->bb; new file replaces directory
  FDd_ impossible, treat as clash (copy bb->aa as clash dir)
  FDud copy aa->bb

  F_ impossible, treat as Fn
  Fn copy aa->bb
  Fu impossible, treat as Fn
  Fd impossible, no action

  D_ impossible, treat as Dn
  Dn copy aa->bb
  Du impossible, treat as Dn
  Dd impossible, no action

3. BD: Brain-dump

[Note: These are my notes to myself, written to help clarify things in my head. Most of them apply to old problems which are now resolved and handled in some other way. All of this is probably only useful to me, to recall some of my trains of thought. --Jim]

---

The remote side does not know the local tree. This means that if we do the symirr operation from the other side, we can do an extended sync, but not exchange a tree-diff. Worth trying to fix this up? Or always insist on doing it that one way.

---

Need to get clearer about how to handle reconnecting later on. Since we're allowing the user to edit the sequence of actions that is going to be performed, there is no guarantee that we have a perfect sync, ever. So, we will always be doing a combination of the smcomp and more advanced options.

So, I'm thinking of adding "hint" information to the tree that indicates changes made since the last sync, e.g. files that have definitely been updated, directories/files deleted, etc. Then smcomp can judge by both sets of information.

The next problem is maintaining a "last sync" point on both ends. The first time they can exchange a full tree. At each end, the initial tree plus the changes made in that session is the starting point for detecting changes at the next session. I'm not counting on the session ending in any sensible manner, so maybe we are not going to get to a point of rewriting the tree with the changes made.

Let's say we establish a common ID for future exchanges. This becomes a directory under _SYMIRR_BASE. In this we store the tree.gz before changes were made. We also store the change commands one by one as they are performed. Next time we can work through the tree and change-list and generate the tree as it must have been after all the changes had been performed.

Handshake goes like this: "Connect me with one of these IDs", "Okay, with ID xxx".

Two ways I can do this:

Keep copy of tree from last tree/treediff
Keep copy of all changes made in this session
Next session, merge changes with old tree, compare with current tree, then do diff with old tree to send result

or:

Update copy of tree with all changes in this session, as we make them; this way the tree reflects the actual state right now
Next session, compare with current tree, do diff with old tree to send result

Probably easier the first way.

Problem -- what about "mv" commands. This was included to support creating .CLASH files. The problem is that the rename in this case will convert a current tree item into a tree item maybe several items past (in the case of multiple numbered .CLASH files). Maybe I will have to sort in any case.

---

Thinking about what happens when we do a partial sync, when we delete (effectively cancel) a certain set of commands. Next time around we want the change to show up again fresh -- or do we? I think we do. This means that the tree that we use as a basis for comparison next time around should be the old local-tree plus the local-changes, not the tree as it is now.

---

Now, what about the diff method? We are not storing the local-tree.gz at the remote end any more. Should we? This really emphasises the asymmetry of the present method: one end will have a local-tree.gz and the other a remote-tree.gz. When they sync the other way around, it will be reversed. However, we are only caching the remote-tree and local-tree in this case to save bandwidth when transferring the tree over the connection. Maybe there is some other way.

Probably some kind of a symmetric method of exchanging local-changes files might do the job of keeping sync'd copied of trees on both ends, to use as a basis of comparisons, and this could be symmetrical. But not right now.

---

Panic!! Okay, slight problem with the local-pc thing. I'm only updating local-pc-tree with local-pc-changes, which represents changes that have come over from the other side. But it should also include changes which the user has made on this side, but that have been accepted on the other side. One possible solution might be to exchange local-pc-changes files at the start of the session. Otherwise ...

Fix was to exchange ALL action-commands, and let the other side log those changes as well.

---

Need to think what to do about the sort of local-pc-changes. I either have to rely on an external tool such as 'sort' which might (potentially) not sort things in the right order, e.g. if it gets confused about locales or whatever. The other alternative is to write my own sort tool, but this means duplicating all the work done in 'sort'. The other alternative is to make sure that the local-pc-changes file is in the right order. However, whenever there is a clash, we are manipulating a filename which is slightly later in the sequence than the current file. So this means that other files or even directories may be between that file and here. Also, in general, the user might mess up the order of things in the "actions" file. There is no guarantee that they won't.

Much better to leave this for an external tool that specialises in this kind of thing, if we can trust it. Either that or copy/hack down the sort utility.

Use 'LC_ALL=C sort'

---

Wondering whether to append the machine name to clash file names. This way, it is obvious where the clash came from (except when syncing between directories on the same machine). Something like: xx.CLASH-abunda-XXXXXXXX.

---

Do we want to make backups the default, or an explicit option? -b for rm backups and -bb for rm+cp backups is one option. This means that by default no backups are performed. Is this a good default? Alternatively, we could default to all backups, and have the option of turning them off instead.

Same with *~ *.NO_SYMIRR exclusions -- have these as the standard, and have option to turn them off?

---

xtree-rsync

Thinking of implementing the rsync algorithm to get the remote tree, using local-pc-tree.gz as the basis. Assuming we have some kind of a usable sync previously (that at least some part of the tree is the same), this is a reasonable basis to work from. If the user is holding back a large chunk of the changes manually, it is not so good as the previous remote-tree.gz, but this is a much rarer case. This method also has the advantage that we're not storing too many trees -- only local-pc-changes.gz has to be stored. No more local-tree.gz and remote-tree.gz. We are going to stream all of this, which means that the rsync algorithm can't go dodging around in the file. It can only refer to blocks forward of the last block duplicated.

Process:

xtree-rsync\n<block-checksum-data>\n
-><gzipped-list-of-block-references-and-data><md5-checksum>

This means making a pass locally through local-pc-tree.gz to send the xtree-rsync command checksums, and then making another pass through it locally to pick up the referenced blocks as the data arrives. The data can be processed as it arrives. Since rsync uses MD4 for the blocks, if we use MD5 of the whole file as a final check, we have a final safeguard that everything was fine with the algorithm. Actually, as we can only check this at the end, we would probably have fallen over with some other error before that point.

Actually, we can probably optimise rsync to put the blocks always on \n boundaries, which reduces the number of checks we have to do as well. In other words, we could make a block 20 lines (~700 bytes of file-specs). The list coming back could use the "c<num>" and "s<num>" codes as for xtreediff.

Maybe before implementing this, I should implement zlib compression on the data transfers. This will be required in any case.

---

Thinking about xtree-rsync. It is exactly the same as xtreediff, except that in addition we are sending 20 bytes of checksum info for every 700 bytes of data. This means for 1MB file (which when compressed is 250K), sending 30K of checksums. Well, that's not too bad, I suppose, something like 10% of the full gzipped file transfer. Well, anyway, the point is that it is more data to transfer, so xtreediff is actually superior in transfer-cost.

---

Changing everything over to use just one rdlin/wrlin. This means using alarm() not poll. It also means probably making rdlin return a pointer to its internal buffer, rather than a StrDup'd string. Also, expanding that buffer to take in any size line, up to some fixed maximum, like 64K or something.

---

Testing connection problems. Create a 'tupipe' tool which starts up another command connecting STDIN->STDIN of command, and also connecting back command STDOUT->STDOUT of tupipe, i.e. a two-way pipe. This can simulate failures in various ways:

Drop all STDIN communication after xx lines
Drop all STDOUT communication after xx lines
Block STDIN, don't accept more data, after xx lines
Block STDOUT, don't accept more data, after xx lines
Die completely after xx STDIN lines
Die completely after xx STDOUT lines

Options:

-o <cnt>      # Trigger event after <cnt> output lines
-i <cnt>      # Trigger event after <cnt> input lines
-D            # Read all data, but don't pass it on; drop it
-B            # Block all data; refuse to read any more data, but stay alive
-X            # Exit, die completely

Signals are caught, echoed to STDERR and ignored.

---

Thinking about EOF markers. Thinking of using 1:*V as an end-marker, but this has problems because it generates a lot of special-case code to make sure it isn't freed. A more consistent way of handling it would be to have a special EOF string, which comes through in the normal way. We could use DOS-style ^Z, or something more weird like ^E^O^F. Either is fine in this application because we are handling only valid printable text lines. ^Z is easier to check.

---

Currently calling rem_wrlin() etc from local-side, and now switched to use rem_wrlin() on remote side as well. Maybe need a better name for it now. con_wrlin()? pipe_*()

---

Speeding up the exchange. If I start sending more than one command, I get into the risk that I'm sending too much and overflowing buffers on the way. For example, sending a file, I could end up having to process several return codes during that send. This really means having sending and receiving in different threads, or multiplexed somehow.

Maximum speed would mean piping commands and data as fast as possible to the destination, and handling the returns as they come back. A compromise would be allowing to get only a certain number of commands ahead.

A simple compromise would be to allow only a maximum of 4K, say, of commands to get ahead, which could be accomodated in the buffers without problem, without requiring two threads.

If I take the approach of piping as much data as possible, as fast as possible, to the remote end, and then waiting for data to come back, then I also have to think about handling error conditions and ^C. Maybe for ^C we finish the current command sending data, then wait for all the responses to come back, then terminate. In error conditions, there will be

<<mkdir   act, log, send-recv-OK
<<chmug   act, log, send-recv-OK
<<rm      act, log, send-recv-OK
<<mv      act, log, send-recv-OK
<<cp      send-recv-data, act, log
>>mkdir   send-recv-OK, log
>>chmug   send-recv-OK, log
>>rm      send-recv-OK, log
>>mv      log, send-recv-OK
>>cp      send-data-recv-OK, log

Thinking of changing all commands to act only after receiving the OK return. This means that the remote end is controlling the rate that things happen. We can send everything in chunks of 2K or less, checking the input pipe every write, and acting whenever we have enough input data. This way I don't need threads.

<<mkdir   send-recv-OK, act, log
<<chmug   send-recv-OK, act, log
<<rm      send-recv-OK, act, log
<<mv      send-recv-OK, act, log
<<cp      send-recv-data, act, log
>>mkdir   send-recv-OK, log
>>chmug   send-recv-OK, log
>>rm      send-recv-OK, log
>>mv      send-recv-OK, log
>>cp      send-data-recv-OK, log