Allow a streaming replication standby to follow a timeline switch.

Before this patch, streaming replication would refuse to start replicating if the timeline in the primary doesn't exactly match the standby. The situation where it doesn't match is when you have a master, and two standbys, and you promote one of the standbys to become new master. Promoting bumps up the timeline ID, and after that bump, the other standby would refuse to continue. There's significantly more timeline related logic in streaming replication now. First of all, when a standby connects to primary, it will ask the primary for any timeline history files that are missing from the standby. The missing files are sent using a new replication command TIMELINE_HISTORY, and stored in standby's pg_xlog directory. Using the timeline history files, the standby can follow the latest timeline present in the primary (recovery_target_timeline='latest'), just as it can follow new timelines appearing in an archive directory. START_REPLICATION now takes a TIMELINE parameter, to specify exactly which timeline to stream WAL from. This allows the standby to request the primary to send over WAL that precedes the promotion. The replication protocol is changed slightly (in a backwards-compatible way although there's little hope of streaming replication working across major versions anyway), to allow replication to stop when the end of timeline reached, putting the walsender back into accepting a replication command. Many thanks to Amit Kapila for testing and reviewing various versions of this patch.
author: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2012-12-13 19:00:00 +0200
committer: Heikki Linnakangas <heikki.linnakangas@iki.fi> 2012-12-13 19:17:32 +0200
commit: abfd192b1b5ba5216ac4b1f31dcd553106304b19 (patch)
tree: 9dc145a8f72c500e06ccc779a2d54784ff1681c1 /src/backend/replication/basebackup.c
parent: 527668717a660e67c2a6cfd4e85f7a513f99f6f2 (diff)
download: postgresql-abfd192b1b5ba5216ac4b1f31dcd553106304b19.tar.gz
postgresql-abfd192b1b5ba5216ac4b1f31dcd553106304b19.zip
1 files changed, 17 insertions, 4 deletions
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 04681f41962..65200c129aa 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -56,6 +56,9 @@ static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static void SendXlogRecPtrResult(XLogRecPtr ptr);
 
+/* Was the backup currently in-progress initiated in recovery mode? */
+static bool backup_started_in_recovery = false;
+
 /*
  * Size of each block sent into the tar stream for larger files.
  *
@@ -94,6 +97,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	XLogRecPtr	endptr;
 	char	   *labelfile;
 
+	backup_started_in_recovery = RecoveryInProgress();
+
 	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &labelfile);
 	SendXlogRecPtrResult(startptr);
 
@@ -261,7 +266,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 				 * http://lists.apple.com/archives/xcode-users/2003/Dec//msg000
 				 * 51.html
 				 */
-				XLogRead(buf, ptr, TAR_SEND_SIZE);
+				XLogRead(buf, ThisTimeLineID, ptr, TAR_SEND_SIZE);
 				if (pq_putmessage('d', buf, TAR_SEND_SIZE))
 					ereport(ERROR,
 							(errmsg("base backup could not send data, aborting backup")));
@@ -592,11 +597,19 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 		/*
 		 * Check if the postmaster has signaled us to exit, and abort with an
 		 * error in that case. The error handler further up will call
-		 * do_pg_abort_backup() for us.
+		 * do_pg_abort_backup() for us. Also check that if the backup was
+		 * started while still in recovery, the server wasn't promoted.
+		 * dp_pg_stop_backup() will check that too, but it's better to stop
+		 * the backup early than continue to the end and fail there.
 		 */
-		if (ProcDiePending || walsender_ready_to_stop)
+		CHECK_FOR_INTERRUPTS();
+		if (RecoveryInProgress() != backup_started_in_recovery)
 			ereport(ERROR,
-				(errmsg("shutdown requested, aborting active base backup")));
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("the standby was promoted during online backup"),
+					 errhint("This means that the backup being taken is corrupt "
+							 "and should not be used. "
+							 "Try taking another online backup.")));
 
 		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, de->d_name);
author	Heikki Linnakangas <heikki.linnakangas@iki.fi>	2012-12-13 19:00:00 +0200
committer	Heikki Linnakangas <heikki.linnakangas@iki.fi>	2012-12-13 19:17:32 +0200
commit	abfd192b1b5ba5216ac4b1f31dcd553106304b19 (patch)
tree	9dc145a8f72c500e06ccc779a2d54784ff1681c1 /src/backend/replication/basebackup.c
parent	527668717a660e67c2a6cfd4e85f7a513f99f6f2 (diff)
download	postgresql-abfd192b1b5ba5216ac4b1f31dcd553106304b19.tar.gz postgresql-abfd192b1b5ba5216ac4b1f31dcd553106304b19.zip