postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
*	Avoid bogus "out-of-sequence timeline ID" errors in standby-mode.	Heikki Linnakangas	2012-11-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When startup process opens a WAL segment after replaying part of it, it validates the first page on the WAL segment, even though the page it's really interested in later in the file. As part of the validation, it checks that the TLI on the page header is >= the TLI it saw on the last page it read. If the segment contains a timeline switch, and we have already replayed it, and then re-open the WAL segment (because of streaming replication got disconnected and reconnected, for example), the TLI check will fail when the first page is validated. Fix that by relaxing the TLI check when re-opening a WAL segment. Backpatch to 9.0. Earlier versions had the same code, but before standby mode was introduced in 9.0, recovery never tried to re-read a segment after partially replaying it. Reported by Amit Kapila, while testing a new feature.
*	Fix multiple problems in WAL replay.	Tom Lane	2012-11-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
*	Close un-owned SMgrRelations at transaction end.	Tom Lane	2012-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If an SMgrRelation is not "owned" by a relcache entry, don't allow it to live past transaction end. This design allows the same SMgrRelation to be used for blind writes of multiple blocks during a transaction, but ensures that we don't hold onto such an SMgrRelation indefinitely. Because an SMgrRelation typically corresponds to open file descriptors at the fd.c level, leaving it open when there's no corresponding relcache entry can mean that we prevent the kernel from reclaiming deleted disk space. (While CacheInvalidateSmgr messages usually fix that, there are cases where they're not issued, such as DROP DATABASE. We might want to add some more sinval messaging for that, but I'd be inclined to keep this type of logic anyway, since allowing VFDs to accumulate indefinitely for blind-written relations doesn't seem like a good idea.) This code replaces a previous attempt towards the same goal that proved to be unreliable. Back-patch to 9.1 where the previous patch was added.
*	Properly set relpersistence for fake relcache entries.	Robert Haas	2012-09-14
\| \| \| \| \| \| \|	This can result in buffers failing to be properly flushed at checkpoint time, leading to data loss. Report, diagnosis, and patch by Jeff Davis.
*	Fix inappropriate error messages for Hot Standby misconfiguration errors.	Tom Lane	2012-09-05
\| \| \| \| \| \| \| \|	Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me
*	fsync backup_label after pg_start_backup()	Simon Riggs	2012-08-07
\| \| \| \|	Dave Kerr, backpatched by Simon Riggs
*	Initialize shared memory copy of ckptXidEpoch correctly when not in recovery.	Heikki Linnakangas	2012-06-29
\| \| \| \| \| \| \|	This bug was introduced by commit 20d98ab6e4110087d1816cd105a40fcc8ce0a307, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar
*	Wake WALSender to reduce data loss at failover for async commit.	Simon Riggs	2012-06-07
\| \| \| \| \| \| \| \| \|	WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms, allowing significantly reduced data loss at failover. Andres Freund and Simon Riggs
*	Revert back-branch changes in behavior of age(xid).	Tom Lane	2012-05-31
\| \| \| \| \| \| \| \|	Per discussion, it does not seem like a good idea to change the behavior of age(xid) in a minor release, even though the old definition causes the function to fail on hot standby slaves. Therefore, revert commit 5829387381d2e4edf84652bb5a712f6185860670 and follow-on commits in the back branches only.
*	Teach AbortOutOfAnyTransaction to clean up partially-started transactions.	Tom Lane	2012-05-28
\| \| \| \| \| \| \| \| \| \| \| \|	AbortOutOfAnyTransaction failed to do anything if the state it saw on entry corresponded to failing partway through StartTransaction. I fixed AbortCurrentTransaction to cope with that case way back in commit 60b2444cc3ba037630c9b940c3c9ef01b954b87b, but evidently overlooked that AbortOutOfAnyTransaction should do likewise. Back-patch to all supported branches. It's not clear that this omission has any more-than-cosmetic consequences, but it's also not clear that it doesn't, so back-patching seems the least risky choice.
*	Ensure backwards compatibility for GetStableLatestTransactionId()	Simon Riggs	2012-05-12
\|
*	Ensure age() returns a stable value rather than the latest value	Simon Riggs	2012-05-11
\|
*	Don't wait for the commit record to be replicated if we wrote no WAL.	Heikki Linnakangas	2012-04-17
\| \| \| \| \| \| \| \|	When using synchronous replication, we waited for the commit record to be replicated, but if we our transaction didn't write any other WAL records, that's not required because we don't even flush the WAL locally to disk in that case. This lead to long waits when committing a transaction that only modified a temporary table. Bug spotted by Thom Brown.
*	Correct epoch of txid_current() when executed on a Hot Standby server.	Simon Riggs	2012-03-29
\| \| \| \| \| \| \| \| \|	Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user. Bug report from Daniel Farina
*	Correctly initialise shared recoveryLastRecPtr in recovery.	Simon Riggs	2012-02-22
\| \| \| \| \| \| \| \|	Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao
*	Avoid problems with OID wraparound during WAL replay.	Tom Lane	2012-02-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The consequences of a failure to update nextOid would be pretty minimal, since we have long had the code set up to obtain another OID and try again if the generated value is already in use. But in the worst case there could be significant performance glitches while such loops iterate through many already-used OIDs before finding a free one. The odds of a wraparound happening during WAL replay would be small in a crash-recovery scenario, and the length of any ensuing OID-assignment stall quite limited anyway. But neither of these statements hold true for a replication slave that follows a WAL stream for a long period; its behavior upon going live could be almost unboundedly bad. Hence it seems worth back-patching this fix into all supported branches. Already fixed in HEAD in commit c6d76d7c82ebebb7210029f7382c0ebe2c558bca.
*	Fix transient clobbering of shared buffers during WAL replay.	Tom Lane	2012-02-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.
*	Avoid crashing when we have problems unlinking files post-commit.	Tom Lane	2011-12-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.
*	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,	Heikki Linnakangas	2011-12-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.
*	Derive oldestActiveXid at correct time for Hot Standby.	Simon Riggs	2011-11-02
\| \| \| \| \| \| \| \| \|	There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop
*	Fix timing of Startup CLOG and MultiXact during Hot Standby	Simon Riggs	2011-11-02
\| \| \| \|	Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
*	Adjust translator comment format to xgettext expectations	Alvaro Herrera	2011-09-05
\|
*	Mark some untranslatable messages with errmsg_internal	Alvaro Herrera	2011-09-05
\|
*	If backup-end record is not seen, and we reach end of recovery from a	Heikki Linnakangas	2011-08-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	streamed backup, throw an error and refuse to start up. The restore has not finished correctly in that case and the data directory is possibly corrupt. We already errored out in case of archive recovery, but could not during crash recovery because we couldn't distinguish between the case that pg_start_backup() was called and the database then crashed (must not error, data is OK), and the case that we're restoring from a backup and not all the needed WAL was replayed (data can be corrupt). To distinguish those cases, add a line to backup_label to indicate whether the backup was taken with pg_start/stop_backup(), or by streaming (ie. pg_basebackup). This is a different implementation than what I committed to 9.2 a week ago. That implementation was not back-patchable because it required re-initdb. Fujii Masao
*	Fix race condition in relcache init file invalidation.	Tom Lane	2011-08-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous code tried to synchronize by unlinking the init file twice, but that doesn't actually work: it leaves a window wherein a third process could read the already-stale init file but miss the SI messages that would tell it the data is stale. The result would be bizarre failures in catalog accesses, typically "could not read block 0 in file ..." later during startup. Instead, hold RelCacheInitLock across both the unlink and the sending of the SI messages. This is more straightforward, and might even be a bit faster since only one unlink call is needed. This has been wrong since it was put in (in 2002!), so back-patch to all supported releases.
*	Back-patch assorted latch-related fixes.	Tom Lane	2011-08-10
\| \| \| \| \| \| \| \| \| \|	Fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. While at it, rename the PGPROC latch field to procLatch for consistency with 9.2. Issues noted while reviewing a patch by Peter Geoghegan.
*	Measure WaitLatch's timeout parameter in milliseconds, not microseconds.	Tom Lane	2011-08-09
\| \| \| \| \| \| \| \| \| \| \| \|	The original definition had the problem that timeouts exceeding about 2100 seconds couldn't be specified on 32-bit machines. Milliseconds seem like sufficient resolution, and finer grain than that would be fantasy anyway on many platforms. Back-patch to 9.1 so that this aspect of the latch API won't change between 9.1 and later releases. Peter Geoghegan
*	Unify spelling of "canceled", "canceling", "cancellation"	Peter Eisentraut	2011-07-02
\| \| \| \| \|	We had previously (af26857a2775e7ceb0916155e931008c2116632f) established the U.S. spellings as standard.
*	pgindent run of recent SSI changes. Also, remove an unnecessary #include.	Heikki Linnakangas	2011-06-16
\| \| \| \|	Kevin Grittner
*	Oops, forgot to change the order of entries in 2PC callback arrays when I	Heikki Linnakangas	2011-06-14
\| \| \| \|	renumbered the resource managers. This should fix the buildfarm..
*	Work around gcc 4.6.0 bug that breaks WAL replay.	Tom Lane	2011-06-10
\| \| \| \| \| \| \| \| \| \| \| \| \|	ReadRecord's habit of using both direct references to tmpRecPtr and references to *RecPtr (which is pointing at tmpRecPtr) triggers an optimization bug in gcc 4.6.0, which apparently has forgotten about aliasing rules. Avoid the compiler bug, and make the code more readable to boot, by getting rid of the direct references. Improve the comments while at it. Back-patch to all supported versions, in case they get built with 4.6.0. Tom Lane, with some cosmetic suggestions from Alex Hunsaker
*	Pgindent run before 9.1 beta2.	Bruce Momjian	2011-06-09
\|
*	Fix assorted typos	Alvaro Herrera	2011-05-12
\|
*	Shut down WAL receiver if it's still running at end of recovery. We used to	Heikki Linnakangas	2011-05-11
\| \| \| \| \|	just check that it's not running and PANIC if it was, but that can rightfully happen if recovery stops at recovery target.
*	Move RegisterPredicateLockingXid() call to a safer place.	Tom Lane	2011-05-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The SSI patch inserted a call of RegisterPredicateLockingXid into GetNewTransactionId, which was a bad idea on a couple of grounds. First, it's not necessary to hold XidGenLock while manipulating that shared memory, and doing so is bad because XidGenLock is a high-contention lock that should be held for as short a time as possible. (Not to mention that it adds an entirely unnecessary deadlock hazard, since we must take SerializableXactHashLock as well.) Second, the specific place where it was put was between extending CLOG and advancing nextXid, which could result in unpleasant behavior in case of a failure there. Pull the call out to AssignTransactionId, which is much safer and arguably better from a modularity standpoint too. There is more work to do to clean up the failure-before-advancing-nextXid issue, but that is a separate change that will need to be back-patched. So for the moment I just want to make GetNewTransactionId look the same as it did in prior versions.
*	recoveryStopsHere() must check the resource manager ID.	Robert Haas	2011-04-18
\| \| \| \| \| \| \| \| \| \|	Before commit c016ce728139be95bb0dc7c4e5640507334c2339, this wasn't needed, but now that multiple resource manager IDs can percolate down through here, we have to make sure we know which one we've got. Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record with an XLOG_CHECKPOINT_SHUTDOWN record. Review by Jaime Casanova
*	Revert the patch to check if we've reached end-of-backup also when doing	Heikki Linnakangas	2011-04-13
\| \| \| \| \| \| \| \| \|	crash recovery, and throw an error if not. hubert depesz lubaczewski pointed out that that situation also happens in the crash recovery following a system crash that happens during an online backup. We might want to do something smarter in 9.1, like put the check back for backups taken with pg_basebackup, but that's for another patch.
*	pgindent run before PG 9.1 beta 1.	Bruce Momjian	2011-04-10
\|
*	Revise the API for GUC variable assign hooks.	Tom Lane	2011-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous functions of assign hooks are now split between check hooks and assign hooks, where the former can fail but the latter shouldn't. Aside from being conceptually clearer, this approach exposes the "canonicalized" form of the variable value to guc.c without having to do an actual assignment. And that lets us fix the problem recently noted by Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus log messages about "parameter "wal_buffers" cannot be changed without restarting the server". There may be some speed advantage too, because this design lets hook functions avoid re-parsing variable values when restoring a previous state after a rollback (they can store a pre-parsed representation of the value instead). This patch also resolves a longstanding annoyance about custom error messages from variable assign hooks: they should modify, not appear separately from, guc.c's own message about "invalid parameter value".
*	Avoid assuming there will be only 3 states for synchronous_commit.	Simon Riggs	2011-04-04
\| \| \| \| \| \|	Also avoid hardcoding the current default state by giving it the name "on" and replace with a meaningful name that reflects its behaviour. Coding only, no change in behaviour.
*	Merge synchronous_replication setting into synchronous_commit.	Robert Haas	2011-04-04
\| \| \| \| \| \| \| \|	This means one less thing to configure when setting up synchronous replication, and also avoids some ambiguity around what the behavior should be when the settings of these variables conflict. Fujii Masao, with additional hacking by me.
*	Improve error message when WAL ends before reaching end of online backup.	Heikki Linnakangas	2011-03-31
\|
*	Check that we've reached end-of-backup also when we're not performing	Heikki Linnakangas	2011-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	archive recovery. It's possible to restore an online backup without recovery.conf, by simply copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that too. That's the use case where this cross-check is useful. Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code was inadvertently changed so that the check is only performed after archive recovery. Fujii Masao.
*	Minor changes to recovery pause behaviour.	Simon Riggs	2011-03-23
\| \| \| \| \| \| \| \| \|	Change location LOG message so it works each time we pause, not just for final pause. Ensure that we pause only if we are in Hot Standby and can connect to allow us to run resume function. This change supercedes the code to override parameter recoveryPauseAtTarget to false if not attempting to enter Hot Standby, which is now removed.
*	Prevent intermittent hang in recovery from bgwriter interaction.	Simon Riggs	2011-03-23
\| \| \| \| \| \|	Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.
*	When two base backups are started at the same time with pg_basebackup,	Heikki Linnakangas	2011-03-21
\| \| \| \| \| \| \| \|	ensure that they use different checkpoints as the starting point. We use the checkpoint redo location as a unique identifier for the base backup in the end-of-backup record, and in the backup history file name. Bug spotted by Fujii Masao.
*	Remove bogus semicolons in recoveryPausesHere.	Robert Haas	2011-03-18
\| \| \| \| \|	Without this, the startup process goes into a tight loop, consuming 100% of one CPU and failing to respond to interrupts.
*	Add pause_at_recovery_target to recovery.conf.sample; improve docs.	Robert Haas	2011-03-17
\| \| \| \| \|	Fujii Masao, but with the proposed behavior change reverted, and the rest adjusted accordingly.
*	Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as	Bruce Momjian	2011-03-10
\| \| \| \|	opposed to O_DSYNC.
*	Emit a LOG message when pausing at the recovery target.	Robert Haas	2011-03-10
\| \| \| \|	Fujii Masao