pg_comparator - Online in the Cloud

Run pg_comparator in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command pg_comparator that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

pg_comparator - efficient table content comparison and synchronization

SYNOPSIS

pg_comparator [options as --help --option --man] conn1 conn2

DESCRIPTION

This script performs a network and time efficient comparison or synchronization of two
possibly large tables in PostgreSQL, MySQL or SQLite databases, so as to detect inserted,
updated or deleted tuples between these tables. The algorithm is efficient especially if
the expected differences are relatively small.

The implementation is quite generic: multi-column keys (but there must be a key!), no
assumption of data types other that they can be cast to text, subset of columns can be
used for the comparison, handling of NULL values...

This script focuses on the comparison algorithm, hence the many options. The fact that it
may do anything useful, such as checking that a replication tool does indeed replicates
your data, or such as synchronizing tables, is a mere side effect.

OPTIONS

Options allow requesting help or to adjust some internal parameters. Short one-letter
options are also available, usually with the first letter of the option name.

"--aggregate=(sum|xor)" or "-a (sum|xor)"
Aggregation function to be used for summaries, either xor or sum. It must operate on
the result of the checksum function. For PostgreSQL and SQLite, the xor aggregate
needs to be loaded. There is a signed/unsigned issue on the key hash when using xor
for comparing tables on MySQL or SQLite vs PostgreSQL. We provide a new "ISUM"
aggregate for SQLite because both "SUM" and "TOTAL" do some incompatible handling of
integer overflows.

Default is sum because it is available by default and works in mixed mode.

"--ask-pass"
Ask for passwords interactively. See also "--env-pass" option below.

Default is not to ask for passwords.

"--asynchronous" or "-A", "--no-asynchronous" or "-X"
Whether to run asynchronous queries. This provides some parallelism, however the two
connections are more or less synchronized per query.

Default is to use asynchronous queries to enable some parallelism.

"--checksum-computation=(create|insert)" or "--cc=..."
How to create the checksum table. Use create to use a "CREATE ... AS SELECT ..."
query, or insert to use a "CREATE ...; INSERT ... SELECT ..." query. The former will
require an additional counting to get the table size, so in the end there are two
queries anyway. There is a type size issue with the insert strategy on MySQL, the
cumulated key string length must be under 64 bytes.

Default is create because it always works for both databases.

"--checksum-function=fun" or "--cf=fun" or "-c fun"
Checksum function to use, either ck, fnv or md5. For PostgreSQL, MySQL and SQLite the
provided ck and fnv checksum functions must be loaded into the target databases.
Choosing md5 does not come free either: the provided cast functions must be loaded
into the target databases and the computation is more expensive.

Default is ck, which is fast, especially if the operation is cpu-bound and the
bandwidth is reasonably high.

"--checksum-size=n" or "--check-size=n" or "--cs=n" or "-z n"
Tuple checksum size, must be 2, 4 or 8 bytes. The key checksum size is always 4 bytes
long.

Default is 8, so that the false negative probability is very low. There should be no
reason to change that.

"--cleanup"
Drop checksum and summary tables beforehand. Useful after a run with "--no-temp" and
"--no-clear", typically used for debugging.

Default is not to drop because it is not needed.

"--clear"
Drop checksum and summary tables explicitly after the computation. Note that they are
dropped implicitly by default when the connection is closed as they are temporary, see
"-(-no)-temporary" option. This option is useful for debugging.

Default is not to clear explicitly the checksum and summary tables, as it is not
needed.

"--debug" or "-d"
Set debug mode. Repeat for higher debug levels. See also "--verbose". Beware that
some safe gards about option settings are skipped under debug so as to allow testing
under different conditions.

Default is not to run in debug mode.

"--env-pass='var'"
Take password from environment variables "var1", "var2" or "var" for connection one,
two, or both. This is tried before asking interactively if "--ask-pass" is also set.

Default is not to look for passwords from environment variables.

"--expect n" or "-e n"
Total number of differences to expect (updates, deletes and inserts). This option is
only used for non regression tests. See the TESTS section.

"--folding-factor=7" or "-f 7"
Folding factor: log2 of the number of rows grouped together at each stage, starting
from the leaves so that the first round always groups as many records as possible. The
power of two allows one to use masked computations. The minimum value of 1 builds a
binary tree.

Default folding factor log2 is 7, i.e. size 128 folds. This default value was chosen
after some basic tests on medium-size cases with medium or low bandwidth. Values from
4 to 8 should be a reasonable choice for most settings.

"--help" or "-h"
Show short help.

"--key-checksum='kcs'" or "--kcs=..."
Use key checksum attribute of this name, which must be already available in the tables
to compare. This option also requires option "--tuple-checksum". See also the
EXAMPLES section below for how to set a checksum trigger. Consider "--use-key"
instead if you already have a reasonably distributed integer primary key.

Default is to build both key and tuple checksums on the fly.

"--lock", "--no-lock"
Whether to lock tables. Setting the option explicitly overrides the default one way
or another. For PostgreSQL, this option requires "--transaction", which is enabled by
default.

Default depends on the current operation: the table is not locked for a comparison,
but it is locked for a synchronization.

"--long-read-len=0" or "-L 0"
Set max size for fetched binary large objects. Well, it seems to be ignored at least
by the PostgreSQL driver.

Default is to keep the default value set by the driver.

"--man" or "-m"
Show manual page interactively in the terminal.

"--max-ratio=0.1"
Maximum relative search effort. The search is stopped if the number of results is
above this threshold expressed relatively to the table size. Use 2.0 for no limit
(all tuples were deleted and new ones are inserted).

Default is 0.1, i.e. an overall 10% difference is allowed before giving up.

"--max-report=n"
Maximum absolute search effort. The search is stopped if the number of differences
goes beyond this threshold. If set, the previous "--max-ratio" option is ignored,
otherwise the effort is computed with the ratio once the table size is known.

Default is to compute the maximum number of reported differences based on the
"--max-ratio" option, with a mimimum of 100 differences allowed.

"--max-levels=0"
Maximum number of levels used. Allows one to cut-off folding. 0 means no cut-off.
Setting a value of 1 would only use the checksum table, without summaries. A value of
3 or 4 would be raisonable, as the last levels of the tree are nice for the
theoretical complexity formula, but do not improve performance in practice.

Default is 0.

"--null='text'"
How to handle NULL values. Either hash to hash all values, where NULL has one special
hash value, or text where NULL values are substituted by the "NULL" string.

Default is text because it is faster.

"--option" or "-o"
Show option summary.

"--pg-copy=128"
Experimental option to use PostgreSQL's COPY instead of INSERT/UPDATE when
synchronizing, by chunks of the specified size.

"--prefix='pgc_cmp'"
Name prefix, possibly schema qualified, used for generated comparison tables by
appending numbers to it. Consider changing the prefix if you expect several
comparisons to run concurrently against the same database.

Default is "pgc_cmp". Cheksum tables is named "pgc_cmp_1_0" and "pgc_cmp_2_0", and
summary tables are named by increasing the last number.

"--report", "--no-report"
Report differing keys to stdout as they are found.

Default is to report.

"--separator='|'" or "-s '|'"
Separator string or character used when concatenating key columns for computing
checksums.

Defaults to the pipe '|' character.

"--size=n"
Assume this value as the table size. It is sufficient for the algorithm to perform
well that this size is in the order of magnitude of the actual table size.

Default is to query the table sizes, which is skipped if this option is set.

"--source-1='DBI:...'", "--source-2='...'" or "-1 '...'", "-2 '...'"
Take full control of DBI data source specification and mostly ignore the comparison
authentication part of the source or target URLs. One can connect with
"DBI:Pg:service=backup", use an alternate driver, set any option allowed by the
driver... See "DBD::Pg" and "DBD:mysql" manuals for the various options that can be
set through the DBI data source specification. However, the database server specified
in the URL must be consistent with this source specification so that the queries'
syntax is the right one.

Default is to rely on the two URL arguments.

"--skip-inserts", "--skip-updates", "--skip-deletes"
When synchronizing, do not perform these operations.

Default under "--synchronize" is to do all operations.

"--stats=(txt|csv)"
Show various statistics about the comparison performed in this format. Also, option
"--stats-name" gives the test a name, useful to generate csv files that will be
processed automatically.

Default is not to show statistics, because it requires additional synchronizations and
is not necessarily interesting to the user.

"--synchronize" or "-S"
Actually perform operations to synchronize the second table wrt the first. Well, not
really, it is only a dry run. It is actually done if you add "--do-it" or "-D". Save
your data before attempting anything like that!

Default is not to synchronize.

"--temporary", "--no-temporary"
Whether to use temporary tables. If you don't, the tables are kept by default at the
end, so they will have to be deleted by hand. See "--clear" option to request a
cleanup. This option is useful for debugging.

Default is to use temporary tables that are automatically wiped out when the
connection is closed.

"--unlogged", "--no-unlogged"
Use unlogged tables for storing checksums. These tables are not transactional, so it
may speed up things a little. However, they are not automatically cleaned up at the
end. See "--clear" option to request a cleanup.

Default is not to use unlogged tables.

"--threads" or "-T", "--no-threads" or "-N"
Highly EXPERIMENTAL feature.

Try to use threads to perform computations in parallel, with some hocus-pocus because
perl thread model does not really work well with DBI. Perl threads are rather heavy
and slow, more like communicating processes than light weight threads, really.

This does NOT work at all with PostgreSQL. It works partially with MySQL, at the
price of turning off "--transaction".

Default is not to use threads, as it does not work for all databases.

"--timeout n"
Timeout comparison after "n" seconds.

Default is no timeout. Be patient.

"--transaction", "--no-transaction"
Whether to wrap the whole algorithm in a single transaction.

Default is to use a wrapping transaction, as it seems to be both faster and safer to
do so.

"--tuple-checksum='tcs'" or "--tcs=..."
Use tuple checksum attribute of this name, which must be already available in the
tables to compare. This option requires to set also either "--use-key" or
"--key-checksum=..." above. The provided checksum attributes must not appear in the
lists of key and value columns. See also the EXAMPLES section below for how to set a
checksum trigger.

Default is to build both key and tuple checksums on the fly.

"--use-key" or "-u"
Whether to directly use the value of the key to distribute tuples among branches. The
key must be simple, integer, not NULL, and evenly distributed. If you have a
reasonably spread integer primary key, consider using this option to avoid half of the
checksum table hash computations.

Default is to hash the key, so as to handle any type, composition and distribution.

"--use-null", "--no-use-null"
Whether to use the information that a column is declared NOT NULL to simplify
computations by avoiding calls to COALESCE to handle NULL values.

Default is to use this information, at the price of querying table metadata.

"--verbose" or "-v"
Be verbose about what is happening. The more you ask, the more verbose.

Default is to be quiet, so that possible warnings or errors stand out.

"--version" or "-V"
Show version information and exit.

"--where=..."
SQL boolean condition on table tuples for partial comparison. Useful to reduce the
load if you know that expected differences are in some parts of your data, say those
time-stamped today... The same condition is passed on both sides, so both tables must
be pretty similar so that it works. This is usually the case.

Default is to compare whole tables.

ARGUMENTS

The two arguments describe database connections with the following URL-like syntax, where
square brackets denote optional parts. Many parts are optional with a default. The minimum
syntactically correct specification is "/", but that does not necessary mean anything
useful.

[driver://][login[:pass]@][host][:port]/[base/[[schema.]table[?key[:cols]]]]

See the EXAMPLES section below, and also the "--source-*" options above.

Note that some default value used by DBI drivers may be changed with driver-specific
environment variables, and that DBI also provides its own defaults and overrides, so what
actually happens may not always be clear. Default values for the second URL are mostly
taken from the first URL.

driver
Database driver to use. Use pgsql for PostgreSQL, mysql for MySQL, sqlite for SQLite.
Heterogeneous databases may be compared and synchronized, however beware that subtle
typing, encoding and casting issues may prevent heterogeneous comparisons or
synchronizations to succeed. Default is pgsql for the first connection, and same as
first for second.

For SQLite, the authentication part of the URL (login, pass, host, port) is expected
to be empty, thus the full URL should look like:

sqlite:///base.db/table?key,col:other,columns

Moreover, setting the PGC_SQLITE_LOAD_EXTENSION environment variable with
":"-separated shared object files loads these into SQLite.

login
Login to use when connecting to database. Default is username for first connection,
and same as first connection for second.

pass
Password to use when connecting to database. Note that it is a bad idea to put a
password as a command argument. Default is none for the first connection, and the
same password as the first connection for the second if the connection targets the
same host, port and uses the same login. See also "--ask-pass" and "--env-pass"
options.

host
Hostname or IP to connect to. Default is the empty string, which means connecting to
the database on localhost with a UNIX socket.

port
TCP-IP port to connect to. Default is 5432 for PostgreSQL and 3306 for MySQL.

base
Database catalog to connect to. Default is username for first connection. Default is
same as first connection for second connection. For SQLite, provide the database file
name. The path is relative by default, but can be made absolute by prepending an
additional '/':

sqlite:////var/cache/sqlite/base.db/table?...

schema.table
The possibly schema-qualified table to use for comparison. No default for first
connection. Default is same as first connection for second connection.

Note that MySQL does not have schemas, but strangely enough their database concept is
just like a schema, so MySQL really does not have databases, although there is
something of that name. Am I clear?

keys
Comma-separated list of key columns. Default is table primary key for first
connection. Default is same as first connection for second connection. The key
cannot be empty. If you do not have a way of identifying your tuples, then there is no
point in looking for differences.

cols
Comma-separated list of columns to compare. May be empty. Default is all columns but
keys for first connection. Default is same as first connection for second connection.
Beware that "...?key:" means an empty cols, while "...?key" sets the default by
querying table metadata.

EXAMPLES

Compare tables calvin and hobbes in database family on localhost, with key id and columns
c1 and c2:

./pg_comparator /family/calvin?id:c1,c2 /family/hobbes

Compare tables calvin in default database on localhost and the same table in default
database on sablons, with key id and column data:

./pg_comparator localhost/family/calvin?id:data sablons/

Synchronize "user" table in database "wikipedia" from MySQL on "server1" to PostgreSQL on
"server2".

./pg_comparator -S -D --ask-pass \
mysql://calvin@server1/wikipedia/user pgsql://hobbes@server2/

For PostgreSQL, you may add trigger-maintained key and tuple checksums as:

-- TABLE Foo(id SERIAL PRIMARY KEY, data ... NOT NULL);
-- add a key and tuple checksum attributes
-- the key checksum can be skipped if you use --use-key,
-- for which the key must be a simple NOT NULL integer.
ALTER TABLE Foo
ADD COLUMN key_cs INT4 NOT NULL DEFAULT 0,
ADD COLUMN tup_cs INT8 NOT NULL DEFAULT 0;
-- function to update the tuple checksum
-- if some attributes may be NULL, they must be coalesced
CREATE FUNCTION foo_cs() RETURNS TRIGGER AS $$
BEGIN
-- compute key checksum
NEW.key_cs = cksum4(NEW.id);
-- compute tuple checksum
NEW.tup_cs = cksum8(NEW.id || '|' || NEW.data);
RETURN NEW;
END; $$ LANGUAGE plpgsql;
-- set trigger to call the checksum update function
CREATE TRIGGER foo_cs_trigger
BEFORE UPDATE OR INSERT ON Foo
FOR EACH ROW EXECUTE PROCEDURE foo_cs();
-- if table Foo is not initially empty,
-- update its contents to trigger checksum computations
UPDATE Foo SET id=id;

Then a fast comparison, which does not need to compute the initial checksum table, can be
requested with:

./pg_comparator --tcs=tup_cs --kcs=key_cs \
admin@server1/app/Foo?id:data hobbes@server2/

As the primary key is a simple integer, the key_cs could be left out and the comparison
could be launched with:

./pg_comparator --tcs=tup_cs --use-key \
admin@server1/app/Foo?id:data hobbes@server2/

OUTPUT

The output of the command consists of lines describing the differences found between the
two tables. They are expressed in term of insertions, updates or deletes and of tuple
keys.

UPDATE k
Key k tuple is updated from table 1 to table 2. It exists in both tables with
different values.

INSERT k
Key k tuple does not appear in table 2, but only in table 1. It must be inserted in
table 2 to synchronize it wrt table 1.

DELETE k
Key k tuple appears in table 2, but not in table 1. It must be deleted from 2 to
synchronize it wrt table 1.

In case of tuple checksum collisions, false negative results may occur. Changing the
checksum function would help in such cases. See the ANALYSIS sub-section.

DEPENDENCES

Three support functions are needed on the database:

1.
The "COALESCE" function takes care of NULL values in columns.

2.
A checksum function must be used to reduce and distribute key and columns values. It may
be changed with the "--checksum" option. Its size can be selected with the
"--checksize" option (currently 2, 4 or 8 bytes). The checksums also require casts to be
converted to integers of various sizes.

Suitable implementations are available for PostgreSQL and can be loaded into the server
by processing "share/contrib/pgc_checksum.sql" and "share/contrib/pgc_casts.sql". New
checksums and casts are also available for MySQL, see "mysql_*.sql". An loadable
implementation of suitable checksum functions is also available for SQLite, see
"sqlite_checksum.*".

The "ck" checksum is based on Jenkins hash <https://en.wikipedia.org/wiki/Jenkins_hash>,
which relies on simple add, shift and xor integer operations. The "fnv" checksum is
inspired by FNV hash <https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash> (64 bits 1a
version) which uses xor and mult integer operations, although I also added some shift
and add to help tweak high bits.

3.
An aggregate function is used to summarize checksums for a range of rows. It must
operate on the result of the checksum function. It may be changed with the
"--aggregate" option.

Suitable implementations of a exclusive-or "xor" aggregate are available for PostgreSQL
and can be loaded into the server by processing "share/contrib/xor_aggregate.sql".

The "sqlite_checksum.*" file also provides a "xor" and "sum" aggregates for SQLite that
are compatible with other databases.

Moreover several perl modules are useful to run this script:

· "Getopt::Long" for option management.

· "DBI", "DBD::Pg" to connect to PostgreSQL, "DBD::mysql" to connect to MySQL, and
"DBD::SQLite" to connect to SQLite.

· "Term::ReadPassword" for "--ask-pass" option.

· "Pod::Usage" for doc self-extraction ("--man" "--opt" "--help").

· "threads" for the experimental threaded version with option "--threads".

· "Digest::MD5" for md5 checksum with SQLite.

Modules are only loaded by the script if they are actually required.

ALGORITHM

The aim of the algorithm is to compare the content of two tables, possibly on different
remote servers, with minimum network traffic. It is performed in three phases.

1.
A checksum table is computed on each side for the target table.

2.
A fist level summary table is computed on each side by aggregating chunks of the
checksum table. Other levels of summary aggregations are then performed till there is
only one row in the last table, which then stores a global checksum for the whole
initial target tables.

3.
Starting from the upper summary tables, aggregated checksums are compared from both
sides to look for differences, down to the initial checksum table. Keys of differing
tuples are displayed.

CHECKSUM TABLE
The first phase computes the initial checksum table T(0) on each side. Assuming that key
is the table key columns, and cols is the table data columns that are to be checked for
differences, then it is performed by querying target table T as follow:

CREATE TABLE T(0) AS
SELECT key AS pk, -- primary key
checksum(key) AS kcs, -- key checksum
checksum(key || cols) AS tcs -- tuple checksum
FROM t;

The initial key is kept, as it will be used to show differing keys at the end. The
rational for the kcs column is to randomize the key-values distribution so as to balance
aggregates in the next phase. The key must appear in the checksum also, otherwise content
exchanged between two keys would not be detected in some cases.

SUMMARY TABLES
Now we compute a set of cascading summary tables by grouping f (folding factor) checksums
together at each stage. The grouping is based on a mask on the kcs column to take
advantage of the checksum randomization. Starting from p=0 we build:

CREATE TABLE T(p+1) AS
SELECT kcs & mask(p+1) AS kcs, -- key checksum subset
XOR(tcs) AS tcs -- tuple checksum summary
FROM T(p)
GROUP BY kcs & mask(p+1);

The mask(p) is defined so that it groups together on average f checksums together: mask(0)
= ceil2(size); mask(p) = mask(p-1)/f; This leads to a hierarchy of tables, each one being
a smaller summary of the previous one:

level 0
checksum table, size rows, i.e. as many rows as the target table.

level 1
first summary table, (size/f) rows.

level p
intermediate summary table, (size/f**p) rows.

level n-1
one before last summary table, less than f rows.

level n
last summary table, mask is 0, 1 row.

It is important that the very same masks are used on both sides so that aggregations are
the same, allowing to compare matching contents on both sides.

SEARCH FOR DIFFERENCES
After all these support tables are built on both sides comes the search for differences.
When checking the checksum summary of the last tables (level n) with only one row, it is
basically a comparison of the checksum of the whole table contents. If they match, then
both tables are equal, and we are done. Otherwise, if these checksums differ, some
investigation is needed to detect offending keys.

The investigation is performed by going down the table hierarchy and looking for all kcs
for which there was a difference in the checksum on the previous level. The same query is
performed on both side at each stage:

SELECT kcs, tcs
FROM T(p)
WHERE kcs & mask(p+1) IN (kcs-with-diff-checksums-from-level-p+1)
ORDER BY kcs [and on level 0: , id];

And the results from both sides are merged together. When doing the merge procedure, four
cases can arise:

1.
Both kcs and tcs match. Then there is no difference.

2.
Although kcs does match, tcs does not. Then this kcs is to be investigated at the next
level, as the checksum summary differs. If we are already at the last level, then the
offending key can be shown.

3.
No kcs match, one supplemental kcs in the first side. Then this kcs correspond to
key(s) that must be inserted for syncing the second table wrt the first.

4.
No kcs match, one supplemental kcs in the second side. Then this kcs correspond to
key(s) that must be deleted for syncing the second table wrt the first.

Cases 3 and 4 are simply symmetrical, and it is only an interpretation to decide whether
it is an insert or a delete, taking the first side as the reference.

ANALYSIS
Let n be the number of rows, r the row size, f the folding factor, k the number of
differences to be detected, c the checksum size in bits, then the costs to identify
differences and the error rate is:

network volume
is better than k*f*ceil(log(n)/log(f))*(c+log(n)). the contents of k blocks of size f
is transferred on the depth of the tree, and each block identifier is of size log(n) and
contains a checksum c. It is independent of r, and you want k<<n. The volume of the
SQL requests is about k*log(n)*ceil(log(n)/log(f)), as the list of non matching
checksums k*log(n) may be dragged on the tree depth.

number of requests (on each side, the algorithm is symmetric)
minimum is 6+ceil(log(n)/log(f)) for equal tables, maximum is 6+2*ceil(log(n)/log(f)).

disk I/O traffic
is about n*r+n*ln(n)*(f/(f-1)).

false negative probability
i.e. part of the tables are considered equal although they are different. With a
perfect checksum function, this is the probability of a checksum collision at any point
where they are computed and should have been different: about
k*ceil(log(n)/log(f))*2**-c. For a million row table, expecting 1000 changes with the
default algorithm parameter values, this is about 2**10 *3/2**64, that is about one
chance in 2**52 merge runs.

The lower the folding factor f the better for the network volume, but the higher the
better for the number of requests and disk I/Os: the choice of f is a tradeoff.

The lower the checksum size c, the better for the network volume, but the worse for the
false negative probability.

If the available bandwidth is reasonable, the comparison will most likely be cpu-bound:
the time is spent mainly on computing the initial checksum table. Thus if you are
planning to check for differences quite often, consider maintaining a tuple checksum with
a trigger, and possibly a key checksum as well, and invoke with "--tuple-checksum" and
either "--key-checksum" or "--use-key".

IMPLEMENTATION ISSUES
The checksum implementation gives integers, which are constant length and easy to
manipulate afterwards.

The xor aggregate is a good choice because there is no overflow issue with it, it takes
into account all bits of the input, and it can easily be defined on any binary data. The
sum aggregate is also okay, but it requires some kind of underlying integer type.

NULL values must be taken care appropriately.

The folding factor and all modules are taken as power of two so as to use a masks.

There is a special management of large chunks of deletes or inserts which is implemented
although not detailed in the algorithmic overview and complexity analysis.

There is some efforts to build a PostgreSQL/MySQL compatible implementation of the
algorithm, which added hacks to deal with type conversions and other stuff.

This script is reasonably tested, but due to its proof of concept nature there is a lot of
options the combination of which cannot all be tested.

NOTE
If the tables to compare are in the same database, a simple SQL query can extract the
differences. Assuming Tables T1 and T2 with primary key id and non null contents data,
then their differences, that is how T2 differs from the reference T1, is summarized by the
following query:

SELECT COALESCE(T1.id, T2.id) AS key,
CASE WHEN T1.id IS NULL THEN 'DELETE'
WHEN T2.id IS NULL THEN 'INSERT'
ELSE 'UPDATE'
END AS operation
FROM T1 FULL JOIN T2 USING (id)
WHERE T1.id IS NULL -- DELETE
OR T2.id IS NULL -- INSERT
OR T1.data <> T2.data -- UPDATE

REFERENCES
A paper was presented at a conference about this tool and its algorithm: Remote Comparison
of Database Tables by Fabien Coelho, In Third International Conference on Advances in
Databases, Knowledge, and Data Applications (DBKDA), pp 23-28, St Marteen, The Netherlands
Antilles, January 2011. ISBN: 978-1-61208-002-4. Copyright IARIA 2011. Online at Think
Mind <http://www.thinkmind.org/index.php?view=article&articleid=dbkda_2011_2_10_30021>.

The algorithm and script was inspired by Taming the Distributed Database Problem: A Case
Study Using MySQL by Giuseppe Maxia in Sys Admin vol 13 num 8, Aug 2004, pp 29-40. See
Perl Monks <http://www.perlmonks.org/index.pl?node_id=381053> for details. In this paper,
three algorithms are presented. The first one compares two tables with a checksum
technique. The second one finds UPDATE or INSERT differences based on a 2-level (checksum
and summary) table hierarchy. The algorithm is asymmetrical, as different queries are
performed on the two tables to compare. It seems that the network traffic volume is in
k*(f+(n/f)+r), that it has a probabilistically-buggy merge procedure, and that it makes
assumptions about the distribution of key values. The third algorithm looks for DELETE
differences based on counting, with the implicit assumption that there are only such
differences.

In contrast to this approach, our fully symmetrical algorithm implements all three tasks
at once, to find UPDATE, DELETE and INSERT between the two tables. The checksum and
summary hierarchical level idea is reused and generalized so as to reduce the algorithmic
complexity.

From the implementation standpoint, the script is as parametric as possible with many
options, and makes few assumptions about table structures, types and values.

Use pg_comparator online using onworks.net services