From 5aa2350426c4fdb3d04568b65aadac397012bbcb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 29 Apr 2015 19:30:53 +0200
Subject: Introduce replication progress tracking infrastructure.

When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
---
 doc/src/sgml/catalogs.sgml            | 123 +++++++++++++++++++++
 doc/src/sgml/filelist.sgml            |   1 +
 doc/src/sgml/func.sgml                | 201 +++++++++++++++++++++++++++++++++-
 doc/src/sgml/logicaldecoding.sgml     |  35 +++++-
 doc/src/sgml/postgres.sgml            |   1 +
 doc/src/sgml/replication-origins.sgml |  93 ++++++++++++++++
 6 files changed, 448 insertions(+), 6 deletions(-)
 create mode 100644 doc/src/sgml/replication-origins.sgml

(limited to 'doc/src')
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 898865eea19..4b79958b357 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -238,6 +238,16 @@
       <entry>query rewrite rules</entry>
      </row>
 
+     <row>
+      <entry><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link></entry>
+      <entry>registered replication origins</entry>
+     </row>
+
+     <row>
+      <entry><link linkend="catalog-pg-replication-origin-status"><structname>pg_replication_origin_status</structname></link></entry>
+      <entry>information about replication origins, including replication progress</entry>
+     </row>
+
      <row>
       <entry><link linkend="catalog-pg-replication-slots"><structname>pg_replication_slots</structname></link></entry>
       <entry>replication slot information</entry>
@@ -5337,6 +5347,119 @@
 
  </sect1>
 
+ <sect1 id="catalog-pg-replication-origin">
+  <title><structname>pg_replication_origin</structname></title>
+
+  <indexterm zone="catalog-pg-replication-origin">
+   <primary>pg_replication_origin</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_replication_origin</structname> catalog contains
+   all replication origins created.  For more on replication origins
+   see <xref linkend="replication-origins">.
+  </para>
+
+  <table>
+
+   <title><structname>pg_replication_origin</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry><structfield>roident</structfield></entry>
+      <entry><type>Oid</type></entry>
+      <entry></entry>
+      <entry>A unique, cluster-wide identifier for the replication
+      origin. Should never leave the system.</entry>
+     </row>
+
+     <row>
+      <entry><structfield>roname</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>The external, user defined, name of a replication
+      origin.</entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+ </sect1>
+
+  <sect1 id="catalog-pg-replication-origin-status">
+  <title><structname>pg_replication_origin_status</structname></title>
+
+  <indexterm zone="catalog-pg-replication-origin-status">
+   <primary>pg_replication_origin_status</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_replication_origin_status</structname> view
+   contains information about how far replay for a certain origin has
+   progressed.  For more on replication origins
+   see <xref linkend="replication-origins">.
+  </para>
+
+  <table>
+
+   <title><structname>pg_replication_origin_status</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry><structfield>local_id</structfield></entry>
+      <entry><type>Oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>.roident</literal></entry>
+      <entry>internal node identifier</entry>
+     </row>
+
+     <row>
+      <entry><structfield>external_id</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry><literal><link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>.roname</literal></entry>
+      <entry>external node identifier</entry>
+     </row>
+
+     <row>
+      <entry><structfield>remote_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The origin node's LSN up to which data has been replicated.</entry>
+     </row>
+
+
+     <row>
+      <entry><structfield>local_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>This node's LSN that at
+      which <literal>remote_lsn</literal> has been replicated. Used to
+      flush commit records before persisting data to disk when using
+      asynchronous commits.</entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+ </sect1>
+
  <sect1 id="catalog-pg-replication-slots">
   <title><structname>pg_replication_slots</structname></title>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 26aa7ee50ee..6268d5496bd 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -95,6 +95,7 @@
 <!ENTITY fdwhandler SYSTEM "fdwhandler.sgml">
 <!ENTITY custom-scan SYSTEM "custom-scan.sgml">
 <!ENTITY logicaldecoding SYSTEM "logicaldecoding.sgml">
+<!ENTITY replication-origins SYSTEM "replication-origins.sgml">
 <!ENTITY protocol   SYSTEM "protocol.sgml">
 <!ENTITY sources    SYSTEM "sources.sgml">
 <!ENTITY storage    SYSTEM "storage.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0053d7d4101..dcade93e439 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16879,11 +16879,13 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
    <title>Replication Functions</title>
 
    <para>
-    The functions shown in <xref linkend="functions-replication-table"> are
-    for controlling and interacting with replication features.
-    See <xref linkend="streaming-replication">
-    and <xref linkend="streaming-replication-slots"> for information about the
-    underlying features.  Use of these functions is restricted to superusers.
+    The functions shown
+    in <xref linkend="functions-replication-table"> are for
+    controlling and interacting with replication features.
+    See <xref linkend="streaming-replication">,
+    <xref linkend="streaming-replication-slots">, <xref linkend="replication-origins">
+    for information about the underlying features.  Use of these
+    functions is restricted to superusers.
    </para>
 
    <para>
@@ -17040,6 +17042,195 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         on future calls.
        </entry>
       </row>
+
+      <row id="pg-replication-origin-create">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_create</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_create(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry>
+        <parameter>internal_id</parameter> <type>oid</type>
+       </entry>
+       <entry>
+        Create a replication origin with the the passed in external
+        name, and create an internal id for it.
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-drop">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_drop</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_drop(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Delete a previously created replication origin, including the
+        associated replay progress.
+       </entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_oid</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_oid(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry>
+        <parameter>internal_id</parameter> <type>oid</type>
+       </entry>
+       <entry>
+        Lookup replication origin by name and return the internal
+        oid. If no corresponding replication origin is found a error
+        is thrown.
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-session-setup">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_session_setup</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_setup_session(<parameter>node_name</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Configure the current session to be replaying from the passed in
+        origin, allowing replay progress to be tracked.  Use
+        <function>pg_replication_origin_session_reset</function> to revert.
+        Can only be used if no previous origin is configured.
+       </entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_session_reset</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_session_reset()</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Cancel the effects
+        of <function>pg_replication_origin_session_setup()</function>.
+       </entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_replication_session_is_setup</primary>
+        </indexterm>
+        <literal><function>pg_replication_session_is_setup()</function></literal>
+       </entry>
+       <entry>
+        bool
+       </entry>
+       <entry>
+        Has a replication origin been configured in the current session?
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-session-progress">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_session_progress</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_progress(<parameter>flush</parameter> <type>bool</type>)</function></literal>
+       </entry>
+       <entry>
+        pg_lsn
+       </entry>
+       <entry>
+        Return the replay position for the replication origin configured in
+        the current session. The parameter <parameter>flush</parameter>
+        determines whether the corresponding local transaction will be
+        guaranteed to have been flushed to disk or not.
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-xact-setup">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_xact_setup</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_xact_setup(<parameter>origin_lsn</parameter> <type>pg_lsn</type>, <parameter>origin_timestamp</parameter> <type>timestamptz</type>)</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Mark the current transaction to be replaying a transaction that has
+        committed at the passed in <acronym>LSN</acronym> and timestamp. Can
+        only be called when a replication origin has previously been
+        configured using
+        <function>pg_replication_origin_session_setup()</function>.
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-xact-reset">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_xact_reset</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_xact_reset()</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Cancel the effects of
+        <function>pg_replication_origin_xact_setup()</function>.
+       </entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_advance</primary>
+        </indexterm>
+        <literal>pg_replication_origin_advance<function>(<parameter>node_name</parameter> <type>text</type>, <parameter>pos</parameter> <type>pg_lsn</type>)</function></literal>
+       </entry>
+       <entry>
+        void
+       </entry>
+       <entry>
+        Set replication progress for the passed in node to the passed in
+        position. This primarily is useful for setting up the initial position
+        or a new position after configuration changes and similar. Be aware
+        that careless use of this function can lead to inconsistently
+        replicated data.
+       </entry>
+      </row>
+
+      <row id="pg-replication-origin-progress">
+       <entry>
+        <indexterm>
+         <primary>pg_replication_origin_progress</primary>
+        </indexterm>
+        <literal><function>pg_replication_origin_progress(<parameter>node_name</parameter> <type>text</type>, <parameter>flush</parameter> <type>bool</type>)</function></literal>
+       </entry>
+       <entry>
+        pg_lsn
+       </entry>
+       <entry>
+        Return the replay position for the passed in replication origin. The
+        parameter <parameter>flush</parameter> determines whether the
+        corresponding local transaction will be guaranteed to have been
+        flushed to disk or not.
+       </entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 0810a2d1f97..f817af3ea8a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -363,6 +363,7 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeBeginCB begin_cb;
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
 } OutputPluginCallbacks;
 
@@ -370,7 +371,8 @@ typedef void (*LogicalOutputPluginInit)(struct OutputPluginCallbacks *cb);
 </programlisting>
      The <function>begin_cb</function>, <function>change_cb</function>
      and <function>commit_cb</function> callbacks are required,
-     while <function>startup_cb</function>
+     while <function>startup_cb</function>,
+     <function>filter_by_origin_cb</function>
      and <function>shutdown_cb</function> are optional.
     </para>
    </sect2>
@@ -569,6 +571,37 @@ typedef void (*LogicalDecodeChangeCB) (
       </para>
      </note>
     </sect3>
+
+     <sect3 id="logicaldecoding-output-plugin-filter-by-origin">
+     <title>Origin Filter Callback</title>
+
+     <para>
+       The optional <function>filter_by_origin_cb</function> callback
+       is called to determine wheter data that has been replayed
+       from <parameter>origin_id</parameter> is of interest to the
+       output plugin.
+<programlisting>
+typedef bool (*LogicalDecodeChangeCB) (
+    struct LogicalDecodingContext *ctx,
+    RepNodeId origin_id
+);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. No information but the origin is
+      available. To signal that changes originating on the passed in
+      node are irrelevant, return true, causing them to be filtered
+      away; false otherwise. The other callbacks will not be called
+      for transactions and changes that have been filtered away.
+     </para>
+     <para>
+       This is useful when implementing cascading or multi directional
+       replication solutions. Filtering by the origin allows to
+       prevent replicating the same changes back and forth in such
+       setups.  While transactions and changes also carry information
+       about the origin, filtering via this callback is noticeably
+       more efficient.
+     </para>
+     </sect3>
    </sect2>
 
    <sect2 id="logicaldecoding-output-plugin-output">
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e378d6978d0..4a45138bf72 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -220,6 +220,7 @@
   &spi;
   &bgworker;
   &logicaldecoding;
+  &replication-origins;
 
  </part>
 
diff --git a/doc/src/sgml/replication-origins.sgml b/doc/src/sgml/replication-origins.sgml
new file mode 100644
index 00000000000..c5310229119
--- /dev/null
+++ b/doc/src/sgml/replication-origins.sgml
@@ -0,0 +1,93 @@
+<!-- doc/src/sgml/replication-origins.sgml -->
+<chapter id="replication-origins">
+ <title>Replication Progress Tracking</title>
+ <indexterm zone="replication-origins">
+  <primary>Replication Progress Tracking</primary>
+ </indexterm>
+ <indexterm zone="replication-origins">
+  <primary>Replication Origins</primary>
+ </indexterm>
+
+ <para>
+  Replication origins are intended to make it easier to implement
+  logical replication solutions on top
+  of <xref linkend="logicaldecoding">. They provide a solution to two
+  common problems:
+  <itemizedlist>
+   <listitem><para>How to safely keep track of replication progress</para></listitem>
+   <listitem><para>How to change replication behavior, based on the
+   origin of a row; e.g. to avoid loops in bi-directional replication
+   setups</para></listitem>
+  </itemizedlist>
+ </para>
+
+ <para>
+  Replication origins consist out of a name and a oid. The name, which
+  is what should be used to refer to the origin across systems, is
+  free-form text. It should be used in a way that makes conflicts
+  between replication origins created by different replication
+  solutions unlikely; e.g. by prefixing the replication solution's
+  name to it.  The oid is used only to avoid having to store the long
+  version in situations where space efficiency is important. It should
+  never be shared between systems.
+ </para>
+
+ <para>
+  Replication origins can be created using the
+  <link linkend="pg-replication-origin-create"><function>pg_replication_origin_create()</function></link>;
+  dropped using
+  <link linkend="pg-replication-origin-drop"><function>pg_replication_origin_drop()</function></link>;
+  and seen in the
+  <link linkend="catalog-pg-replication-origin"><structname>pg_replication_origin</structname></link>
+  catalog.
+ </para>
+
+ <para>
+  When replicating from one system to another (independent of the fact that
+  those two might be in the same cluster, or even same database) one
+  nontrivial part of building a replication solution is to keep track of
+  replay progress in a safe manner. When the applying process, or the whole
+  cluster, dies, it needs to be possible to find out up to where data has
+  successfully been replicated. Naive solutions to this like updating a row in
+  a table for every replayed transaction have problems like runtime overhead
+  bloat.
+ </para>
+
+ <para>
+  Using the replication origin infrastructure a session can be
+  marked as replaying from a remote node (using the
+  <link linkend="pg-replication-origin-session-setup"><function>pg_replication_origin_session_setup()</function></link>
+  function. Additionally the <acronym>LSN</acronym> and commit
+  timestamp of every source transaction can be configured on a per
+  transaction basis using
+  <link linkend="pg-replication-origin-xact-setup"><function>pg_replication_origin_xact-setup()</function></link>.
+  If that's done replication progress will be persist in a crash safe
+  manner. Replay progress for all replication origins can be seen in the
+  <link linkend="catalog-pg-replication-origin-status">
+   <structname>pg_replication_origin_status</structname>
+  </link> view. A individual origin's progress, e.g. when resuming
+  replication, can be acquired using
+  <link linkend="pg-replication-origin-progress"><function>pg_replication_origin_progress()</function></link>
+  for any origin or
+  <link linkend="pg-replication-origin-session-progress"><function>pg_replication_origin_session_progress()</function></link>
+  for the origin configured in the current session.
+ </para>
+
+ <para>
+  In more complex replication topologies than replication from exactly one
+  system to one other, another problem can be that, that it is hard to avoid
+  replicating replayed rows again. That can lead both to cycles in the
+  replication and inefficiencies. Replication origins provide a optional
+  mechanism to recognize and prevent that. When configured using the functions
+  referenced in the previous paragraph, every change and transaction passed to
+  output plugin callbacks (see <xref linkend="logicaldecoding-output-plugin">)
+  generated by the session is tagged with the replication origin of the
+  generating session.  This allows to treat them differently in the output
+  plugin, e.g. ignoring all but locally originating rows.  Additionally
+  the <link linkend="logicaldecoding-output-plugin-filter-by-origin">
+  <function>filter_by_origin_cb</function></link> callback can be used
+  to filter the logical decoding change stream based on the
+  source. While less flexible, filtering via that callback is
+  considerably more efficient.
+ </para>
+</chapter>
-- 
cgit v1.2.3