aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Don't assume that PageIsEmpty() returns true on an all-zeros page.Heikki Linnakangas2015-07-27
| | | | | | | | It does currently, and I don't see us changing that any time soon, but we don't make that assumption anywhere else. Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that added this assumption.
* Reuse all-zero pages in GIN.Heikki Linnakangas2015-07-27
| | | | | | | | | | In GIN, an all-zeros page would be leaked forever, and never reused. Just add them to the FSM in vacuum, and they will be reinitialized when grabbed from the FSM. On master and 9.5, attempting to access the page's opaque struct also caused an assertion failure, although that was otherwise harmless. Reported by Jeff Janes. Backpatch to all supported versions.
* Fix handling of all-zero pages in SP-GiST vacuum.Heikki Linnakangas2015-07-27
| | | | | | | | | | | | SP-GiST initialized an all-zeros page at vacuum, but that was not WAL-logged, which is not safe. You might get a torn page write, when it gets flushed to disk, and end-up with a half-initialized index page. To fix, leave it in the all-zeros state, and add it to the FSM. It will be initialized when reused. Also don't set the page-deleted flag when recycling an empty page. That was also not WAL-logged, and a torn write of that would cause the page to have an invalid checksum. Backpatch to 9.2, where SP-GiST indexes were added.
* Fix off-by-one error in calculating subtrans/multixact truncation point.Heikki Linnakangas2015-07-23
| | | | | | | | | | If there were no subtransactions (or multixacts) active, we would calculate the oldestxid == next xid. That's correct, but if next XID happens to be on the next pg_subtrans (pg_multixact) page, the page does not exist yet, and SimpleLruTruncate will produce an "apparent wraparound" warning. The warning is harmless in this case, but looks very alarming to users. Backpatch to all supported versions. Patch and analysis by Thomas Munro.
* Don't call PageGetSpecialPointer() on page until it's been initialized.Heikki Linnakangas2015-06-30
| | | | | | | | | | After calling XLogInitBufferForRedo(), the page might be all-zeros if it was not in page cache already. btree_xlog_unlink_page initialized the page correctly, but it called PageGetSpecialPointer before initializing it, which would lead to a corrupt page at WAL replay, if the unlinked page is not in page cache. Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
* Revoke incorrectly applied patch versionSimon Riggs2015-06-27
|
* Avoid hot standby cancels from VAC FREEZESimon Riggs2015-06-27
| | | | | | | | | | | | VACUUM FREEZE generated false cancelations of standby queries on an otherwise idle master. Caused by an off-by-one error on cutoff_xid which goes back to original commit. Backpatch to all versions 9.0+ Analysis and report by Marco Nenciarini Bug fix by Simon Riggs
* Fix a couple of bugs with wal_log_hints.Heikki Linnakangas2015-06-26
| | | | | | | | | | | | | | 1. Replay of the WAL record for setting a bit in the visibility map contained an assertion that a full-page image of that record type can only occur with checksums enabled. But it can also happen with wal_log_hints, so remove the assertion. Unlike checksums, wal_log_hints can be changed on the fly, so it would be complicated to figure out if it was enabled at the time that the WAL record was generated. 2. wal_log_hints has the same effect on the locking needed to read the LSN of a page as data checksums. BufferGetLSNAtomic() didn't get the memo. Backpatch to 9.4, where wal_log_hints was added.
* Improve multixact emergency autovacuum logic.Andres Freund2015-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | Previously autovacuum was not necessarily triggered if space in the members slru got tight. The first problem was that the signalling was tied to values in the offsets slru, but members can advance much faster. Thats especially a problem if old sessions had been around that previously prevented the multixact horizon to increase. Secondly the skipping logic doesn't work if the database was restarted after autovacuum was triggered - that knowledge is not preserved across restart. This is especially a problem because it's a common panic-reaction to restart the database if it gets slow to anti-wraparound vacuums. Fix the first problem by separating the logic for members from offsets. Trigger autovacuum whenever a multixact crosses a segment boundary, as the current member offset increases in irregular values, so we can't use a simple modulo logic as for offsets. Add a stopgap for the second problem, by signalling autovacuum whenver ERRORing out because of boundaries. Discussion: 20150608163707.GD20772@alap3.anarazel.de Backpatch into 9.3, where it became more likely that multixacts wrap around.
* Add missing check for wal_debug GUC.Andres Freund2015-06-21
| | | | | | | | | | 9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The other WAL_DEBUG dependant messages check for the wal_debug GUC, but this one did not. While at it replace 'upto' with 'up to'. Discussion: 20150610110253.GF3832@alap3.anarazel.de Backpatch to 9.4, the first release containing 9a20a9b2.
* Fix corner case in autovacuum-forcing logic for multixact wraparound.Robert Haas2015-06-19
| | | | | | | | | | | | | | | Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(), and that function looks only at the on-disk state, it's possible for it to fail to find a page that exists in the in-memory SLRU that has not been written yet. If that happens, SetOffsetVacuumLimit() will erroneously decide to force emergency autovacuuming immediately. We should probably fix find_multixact_start() to consider the data cached in memory as well as on the on-disk state, but that's no excuse for SetOffsetVacuumLimit() to be stupid about the case where it can no longer read the value after having previously succeeded in doing so. Report by Andres Freund.
* Allow HotStandbyActiveInReplay() to be called in single user mode.Andres Freund2015-06-08
| | | | | | | | | | | | HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL replay to happen in the startup process, missing the single user case. This buglet is fairly harmless as it only causes problems when single user mode in an assertion enabled build is used to replay a btree vacuum record. Backpatch to 9.2. 061b079f was backpatched further, but the assertion was not.
* Cope with possible failure of the oldest MultiXact to exist.Robert Haas2015-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent commits, mainly b69bf30b9bfacafc733a9ba77c9587cf54d06c0c and 53bb309d2d5a9432d2602c93ed18e58bd2924e15, introduced mechanisms to protect against wraparound of the MultiXact member space: the number of multixacts that can exist at one time is limited to 2^32, but the total number of members in those multixacts is also limited to 2^32, and older code did not take care to enforce the second limit, potentially allowing old data to be overwritten while it was still needed. Unfortunately, these new mechanisms failed to account for the fact that the code paths in which they run might be executed during recovery or while the cluster was in an inconsistent state. Also, they failed to account for the fact that users who used pg_upgrade to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have might oldestMultiXid = 1 in the control file despite the true value being larger. To fix these problems, first, avoid unnecessarily examining the mmembers of MultiXacts when the cluster is not known to be consistent. TruncateMultiXact has done this for a long time, and this patch does not fix that. But the new calls used to prevent member wraparound are not needed until we reach normal running, so avoid calling them earlier. (SetMultiXactIdLimit is actually called before InRecovery is set, so we can't rely on that; we invent our own multixact-specific flag instead.) Second, make failure to look up the members of a MultiXact a non-fatal error. Instead, if we're unable to determine the member offset at which wraparound would occur, postpone arming the member wraparound defenses until we are able to do so. If we're unable to determine the member offset that should force autovacuum, force it continuously until we are able to do so. If we're unable to deterine the member offset at which we should truncate the members SLRU, log a message and skip truncation. An important consequence of these changes is that anyone who does have a bogus oldestMultiXid = 1 value in pg_control will experience immediate emergency autovacuuming when upgrading to a release that contains this fix. The release notes should highlight this fact. If a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1 in the control file, they may wish to vacuum any tables with relminmxid = 1 prior to upgrading in order to avoid an immediate emergency autovacuum after the upgrade. This must be done with a PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age set to 0. This patch also adds an additional log message at each database server startup, indicating either that protections against member wraparound have been engaged, or that they have not. In the latter case, once autovacuum has advanced oldestMultiXid to a sane value, the message indicating that the guards have been engaged will appear at the next checkpoint. A few additional messages have also been added at the DEBUG1 level so that the correct operation of this code can be properly audited. Along the way, this patch fixes another, related bug in TruncateMultiXact that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at all, the truncation code looks up NextMultiXactId, which doesn't exist yet. This can lead to TruncateMultiXact removing every file in pg_multixact/offsets instead of keeping one around, as it should. This in turn will cause the database server to refuse to start afterwards. Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and Thomas Munro.
* pgindent run on access/transam/multixact.cAlvaro Herrera2015-06-04
| | | | | | | | | This file has been patched over and over, and the differences to master caused by pgindent are annoying enough that it seems saner to make the older branches look the same. Backpatch to 9.3, which is as far back as backpatching of bugfixes is necessary.
* Fix fsync-at-startup code to not treat errors as fatal.Tom Lane2015-05-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane
* Update README.tuplockAlvaro Herrera2015-05-25
| | | | | | | | | | | Multixact truncation is now handled differently, and this file hadn't gotten the memo. Per note from Amit Langote. I didn't use his patch, though. Also update the description of infomask bits, which weren't completely up to date either. This commit also propagates b01a4f6838 back to 9.3 and 9.4, which apparently I failed to do back then.
* Fix spelling in commentSimon Riggs2015-05-19
|
* Fix whitespacePeter Eisentraut2015-05-16
|
* Increase threshold for multixact member emergency autovac to 50%.Robert Haas2015-05-11
| | | | | | | | | | | | | | | | | | | | | | | Analysis by Noah Misch shows that the 25% threshold set by commit 53bb309d2d5a9432d2602c93ed18e58bd2924e15 is lower than any other, similar autovac threshold. While we don't know exactly what value will be optimal for all users, it is better to err a little on the high side than on the low side. A higher value increases the risk that users might exhaust the available space and start seeing errors before autovacuum can clean things up sufficiently, but a user who hits that problem can compensate for it by reducing autovacuum_multixact_freeze_max_age to a value dependent on their average multixact size. On the flip side, if the emergency cap imposed by that patch kicks in too early, the user will experience excessive wraparound scanning and will be unable to mitigate that problem by configuration. The new value will hopefully reduce the risk of such bad experiences while still providing enough headroom to avoid multixact member exhaustion for most users. Along the way, adjust the documentation to reflect the effects of commit 04e6d3b877e060d8445eb653b7ea26b1ee5cec6b, which taught autovacuum to run for multixact wraparound even when autovacuum is configured off.
* Even when autovacuum=off, force it for members as we do in other cases.Robert Haas2015-05-11
| | | | Thomas Munro, with some adjustments by me.
* Advance the stop point for multixact offset creation only at checkpoint.Robert Haas2015-05-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c advanced the stop point at vacuum time, but this has subsequently been shown to be unsafe as a result of analysis by myself and Thomas Munro and testing by Thomas Munro. The crux of the problem is that the SLRU deletion logic may get confused about what to remove if, at exactly the right time during the checkpoint process, the head of the SLRU crosses what used to be the tail. This patch, by me, fixes the problem by advancing the stop point only following a checkpoint. This has the additional advantage of making the removal logic work during recovery more like the way it works during normal running, which is probably good. At least one of the calls to DetermineSafeOldestOffset which this patch removes was already dead, because MultiXactAdvanceOldest is called only during recovery and DetermineSafeOldestOffset was set up to do nothing during recovery. That, however, is inconsistent with the principle that recovery and normal running should work similarly, and was confusing to boot. Along the way, fix some comments that previous patches in this area neglected to update. It's not clear to me whether there's any concrete basis for the decision to use only half of the multixact ID space, but it's neither necessary nor sufficient to prevent multixact member wraparound, so the comments should not say otherwise.
* Fix DetermineSafeOldestOffset for the case where there are no mxacts.Robert Haas2015-05-10
| | | | | | | | Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c failed to take into account the possibility that there might be no multixacts in existence at all. Report by Thomas Munro; patch by me.
* Teach autovacuum about multixact member wraparound.Robert Haas2015-05-08
| | | | | | | | | | | | | | | | | | | | | The logic introduced in commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c and repaired in commits 669c7d20e6374850593cb430d332e11a3992bbcf and 7be47c56af3d3013955c91c2877c08f2a0e3e6a2 helps to ensure that we don't overwrite old multixact member information while it is still needed, but a user who creates many large multixacts can still exhaust the member space (and thus start getting errors) while autovacuum stands idly by. To fix this, progressively ramp down the effective value (but not the actual contents) of autovacuum_multixact_freeze_max_age as member space utilization increases. This makes autovacuum more aggressive and also reduces the threshold for a manual VACUUM to perform a full-table scan. This patch leaves unsolved the problem of ensuring that emergency autovacuums are triggered even when autovacuum=off. We'll need to fix that via a separate patch. Thomas Munro and Robert Haas
* Fix incorrect math in DetermineSafeOldestOffset.Robert Haas2015-05-07
| | | | | | | | | | The old formula didn't have enough parentheses, so it would do the wrong thing, and it used / rather than % to find a remainder. The effect of these oversights is that the stop point chosen by the logic introduced in commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c might be rather meaningless. Thomas Munro, reviewed by Kevin Grittner, with a whitespace tweak by me.
* Recursively fsync() the data directory after a crash.Robert Haas2015-05-04
| | | | | | | | | | | Otherwise, if there's another crash, some writes from after the first crash might make it to disk while writes from before the crash fail to make it to disk. This could lead to data corruption. Back-patch to all supported versions. Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised by me.
* Fix pg_upgrade's multixact handling (again)Alvaro Herrera2015-04-30
| | | | | | | | | | | | | | | We need to create the pg_multixact/offsets file deleted by pg_upgrade much earlier than we originally were: it was in TrimMultiXact(), which runs after we exit recovery, but it actually needs to run earlier than the first call to SetMultiXactIdLimit (before recovery), because that routine already wants to read the first offset segment. Per pg_upgrade trouble report from Jeff Janes. While at it, silence a compiler warning about a pointless assert that an unsigned variable was being tested non-negative. This was a signed constant in Thomas Munro's patch which I changed to unsigned before commit. Pointed out by Andres Freund.
* Code review for multixact bugfixAlvaro Herrera2015-04-28
| | | | | | Reword messages, rename a confusingly named function. Per Robert Haas.
* Protect against multixact members wraparoundAlvaro Herrera2015-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multixact member files are subject to early wraparound overflow and removal: if the average multixact size is above a certain threshold (see note below) the protections against offset overflow are not enough: during multixact truncation at checkpoint time, some pg_multixact/members files would be removed because the server considers them to be old and not needed anymore. This leads to loss of files that are critical to interpret existing tuples's Xmax values. To protect against this, since we don't have enough info in pg_control and we can't modify it in old branches, we maintain shared memory state about the oldest value that we need to keep; we use this during new multixact creation to abort if an old still-needed file would get overwritten. This value is kept up to date by checkpoints, which makes it not completely accurate but should be good enough. We start emitting warnings sometime earlier, so that the eventual multixact-shutdown doesn't take DBAs completely by surprise (more precisely: once 20 members SLRU segments are remaining before shutdown.) On troublesome average multixact size: The threshold size depends on the multixact freeze parameters. The oldest age is related to the greater of multixact_freeze_table_age and multixact_freeze_min_age: anything older than that should be removed promptly by autovacuum. If autovacuum is keeping up with multixact freezing, the troublesome multixact average size is (2^32-1) / Max(freeze table age, freeze min age) or around 28 members per multixact. Having an average multixact size larger than that will eventually cause new multixact data to overwrite the data area for older multixacts. (If autovacuum is not able to keep up, or there are errors in vacuuming, the actual maximum is multixact_freeeze_max_age instead, at which point multixact generation is stopped completely. The default value for this limit is 400 million, which means that the multixact size that would cause trouble is about 10 members). Initial bug report by Timothy Garnett, bug #12990 Backpatch to 9.3, where the problem was introduced. Authors: Álvaro Herrera, Thomas Munro Reviews: Thomas Munro, Amit Kapila, Robert Haas, Kevin Grittner
* Fix deadlock at startup, if max_prepared_transactions is too small.Heikki Linnakangas2015-04-23
| | | | | | | | | | | | | | | When the startup process recovers transactions by scanning pg_twophase directory, it should clear MyLockedGxact after it's done processing each transaction. Like we do during normal operation, at PREPARE TRANSACTION. Otherwise, if the startup process exits due to an error, it will try to clear the locking_backend field of the last recovered transaction. That's usually harmless, but if the error happens in MarkAsPreparing, while holding TwoPhaseStateLock, the shmem-exit hook will try to acquire TwoPhaseStateLock again, and deadlock with itself. This fixes bug #13128 reported by Grant McAlister. The bug was introduced by commit bb38fb0d, so backpatch to all supported versions like that commit.
* Fix typo in commentAlvaro Herrera2015-04-14
| | | | | | | | SLRU_SEGMENTS_PER_PAGE -> SLRU_PAGES_PER_SEGMENT I introduced this ancient typo in subtrans.c and later propagated it to multixact.c. I fixed the latter in f741300c, but only back to 9.3; backpatch to all supported branches for consistency.
* Don't archive bogus recycled or preallocated files after timeline switch.Heikki Linnakangas2015-04-13
| | | | | | | | | | | | | | | | | | | After a timeline switch, we would leave behind recycled WAL segments that are in the future, but on the old timeline. After promotion, and after they become old enough to be recycled again, we would notice that they don't have a .ready or .done file, create a .ready file for them, and archive them. That's bogus, because the files contain garbage, recycled from an older timeline (or prealloced as zeros). We shouldn't archive such files. This could happen when we're following a timeline switch during replay, or when we switch to new timeline at end-of-recovery. To fix, whenever we switch to a new timeline, scan the data directory for WAL segments on the old timeline, but with a higher segment number, and remove them. Those don't belong to our timeline history, and are most likely bogus recycled or preallocated files. They could also be valid files that we streamed from the primary ahead of time, but in any case, they're not needed to recover to the new timeline.
* Remove unnecessary variables in _hash_splitbucket().Tom Lane2015-04-03
| | | | | | | | | | | Commit ed9cc2b5df59fdbc50cce37399e26b03ab2c1686 made it unnecessary to pass start_nblkno to _hash_splitbucket(), and for that matter unnecessary to have the internal nblkno variable either. My compiler didn't complain about that, but some did. I also rearranged the use of oblkno a bit to make that case more parallel. Report and initial patch by Petr Jelinek, rearranged a bit by me. Back-patch to all branches, like the previous patch.
* Fix bogus concurrent use of _hash_getnewbuf() in bucket split code.Tom Lane2015-03-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | _hash_splitbucket() obtained the base page of the new bucket by calling _hash_getnewbuf(), but it held no exclusive lock that would prevent some other process from calling _hash_getnewbuf() at the same time. This is contrary to _hash_getnewbuf()'s API spec and could in fact cause failures. In practice, we must only call that function while holding write lock on the hash index's metapage. An additional problem was that we'd already modified the metapage's bucket mapping data, meaning that failure to extend the index would leave us with a corrupt index. Fix both issues by moving the _hash_getnewbuf() call to just before we modify the metapage in _hash_expandtable(). Unfortunately there's still a large problem here, which is that we could also incur ENOSPC while trying to get an overflow page for the new bucket. That would leave the index corrupt in a more subtle way, namely that some index tuples that should be in the new bucket might still be in the old one. Fixing that seems substantially more difficult; even preallocating as many pages as we could possibly need wouldn't entirely guarantee that the bucket split would complete successfully. So for today let's just deal with the base case. Per report from Antonin Houska. Back-patch to all active branches.
* Don't delay replication for less than recovery_min_apply_delay's resolution.Andres Freund2015-03-23
| | | | | | | | | | | | | | | | | Recovery delays are implemented by waiting on a latch, and latches take milliseconds as a parameter. The required amount of waiting was computed using microsecond resolution though and the wait loop's abort condition was checking the delay in microseconds as well. This could lead to short spurts of busy looping when the overall wait time was below a millisecond, but above 0 microseconds. Instead just formulate the wait loop's abort condition in millisecond granularity as well. Given that that's recovery_min_apply_delay resolution, it seems harmless to not wait for less than a millisecond. Backpatch to 9.4 where recovery_min_apply_delay was introduced. Discussion: 20150323141819.GH26995@alap3.anarazel.de
* Fix memory leaks in GIN index vacuum.Heikki Linnakangas2015-03-12
| | | | | Per bug #12850 by Walter Nordmann. Backpatch to 9.4 where the leak was introduced.
* Reconsider when to wait for WAL flushes/syncrep during commit.Andres Freund2015-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
* Minor cleanup/code review for "indirect toast" stuff.Tom Lane2015-02-09
| | | | | | | | | | | | | Fix some issues I noticed while fooling with an extension to allow an additional kind of toast pointer. Much of this is just comment improvement, but there are a couple of actual bugs, which might or might not be reachable today depending on what can happen during logical decoding. An example is that toast_flatten_tuple() failed to cover the possibility of an indirection pointer in its input. Back-patch to 9.4 just in case that is reachable now. In HEAD, also correct some really minor issues with recent compression reorganization, such as dangerously underparenthesized macros.
* Fix reference-after-free when waiting for another xact due to constraint.Heikki Linnakangas2015-02-04
| | | | | | | | | | | | | | | | | | | If an insertion or update had to wait for another transaction to finish, because there was another insertion with conflicting key in progress, we would pass a just-free'd item pointer to XactLockTableWait(). All calls to XactLockTableWait() and MultiXactIdWait() had similar issues. Some passed a pointer to a buffer in the buffer cache, after already releasing the lock. The call in EvalPlanQualFetch had already released the pin too. All but the call in execUtils.c would merely lead to reporting a bogus ctid, however (or an assertion failure, if enabled). All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus anyway: if the tuple was updated (again) in the same transaction, its ctid field would point to the next tuple in the chain, not the tuple itself. Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added (in commit f88d4cfc)
* Fix query-duration memory leak with GIN rescans.Heikki Linnakangas2015-01-30
| | | | | | | | | | | | | | | | | | | | | The requiredEntries / additionalEntries arrays were not freed in freeScanKeys() like other per-key stuff. It's not obvious, but startScanKey() was only ever called after the keys have been initialized with ginNewScanKey(). That's why it doesn't need to worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap was never true, because ginrescan free's the existing keys, and it's not OK to call gingetbitmap twice in a row without calling ginrescan in between. To make that clear, remove the unnecessary ginIsNewKey(). And just to be extra sure that nothing funny happens if there is an existing key after all, call freeScanKeys() to free it if it exists. This makes the code more straightforward. (I'm seeing other similar leaks in testing a query that rescans an GIN index scan, but that's a different issue. This just fixes the obvious leak with those two arrays.) Backpatch to 9.4, where GIN fast scan was added.
* Fix BuildIndexValueDescription for expressionsStephen Frost2015-01-29
| | | | | | | | | | | | | | | In 804b6b6db4dcfc590a468e7be390738f9f7755fb we modified BuildIndexValueDescription to pay attention to which columns are visible to the user, but unfortunatley that commit neglected to consider indexes which are built on expressions. Handle error-reporting of violations of constraint indexes based on expressions by not returning any detail when the user does not have table-level SELECT rights. Backpatch to 9.0, as the prior commit was. Pointed out by Tom.
* Fix bug where GIN scan keys were not initialized with gin_fuzzy_search_limit.Heikki Linnakangas2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | When gin_fuzzy_search_limit was used, we could jump out of startScan() without calling startScanKey(). That was harmless in 9.3 and below, because startScanKey()() didn't do anything interesting, but in 9.4 it initializes information needed for skipping entries (aka GIN fast scans), and you readily get a segfault if it's not done. Nevertheless, it was clearly wrong all along, so backpatch all the way to 9.1 where the early return was introduced. (AFAICS startScanKey() did nothing useful in 9.3 and below, because the fields it initialized were already initialized in ginFillScanKey(), but I don't dare to change that in a minor release. ginFillScanKey() is always called in gingetbitmap() even though there's a check there to see if the scan keys have already been initialized, because they never are; ginrescan() free's them.) In the passing, remove unnecessary if-check from the second inner loop in startScan(). We already check in the first loop that the condition is true for all entries. Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although AFAICS it causes a live bug only in 9.4.
* Fix column-privilege leak in error-message pathsStephen Frost2015-01-28
| | | | | | | | | | | | | | | | | | | | | | | While building error messages to return to the user, BuildIndexValueDescription, ExecBuildSlotValueDescription and ri_ReportViolation would happily include the entire key or entire row in the result returned to the user, even if the user didn't have access to view all of the columns being included. Instead, include only those columns which the user is providing or which the user has select rights on. If the user does not have any rights to view the table or any of the columns involved then no detail is provided and a NULL value is returned from BuildIndexValueDescription and ExecBuildSlotValueDescription. Note that, for key cases, the user must have access to all of the columns for the key to be shown; a partial key will not be returned. Back-patch all the way, as column-level privileges are now in all supported versions. This has been assigned CVE-2014-8161, but since the issue and the patch have already been publicized on pgsql-hackers, there's no point in trying to hide this commit.
* Fix thinko in re-setting wal_log_hints flag from a parameter-change record.Heikki Linnakangas2015-01-15
| | | | | | | | The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.
* Fix thinko in lock mode enumAlvaro Herrera2015-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case where a tuple is marked with a multixact with at most a FOR SHARE lock, and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock; this case should block but doesn't. Include a new isolation tester spec file to explicitely try all the tuple lock combinations; without the fix it shows the problem: starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit step s1_begin: BEGIN; step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo; a 1 step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE; a 1 step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE; a 1 step s1_commit: COMMIT; With the fixed code, step s2_tuplock3 blocks until session 1 commits, which is the correct behavior. All other cases behave correctly. Backpatch to 9.3, like the commit that introduced the problem.
* Treat negative values of recovery_min_apply_delay as having no effect.Tom Lane2015-01-03
| | | | | | | | | | | | | | | | | | | | | | At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.
* Grab heavyweight tuple lock only before sleepingAlvaro Herrera2014-12-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | We were trying to acquire the lock even when we were subsequently not sleeping in some other transaction, which opens us up unnecessarily to deadlocks. In particular, this is troublesome if an update tries to lock an updated version of a tuple and finds itself doing EvalPlanQual update chain walking; more than two sessions doing this concurrently will find themselves sleeping on each other because the HW tuple lock acquisition in heap_lock_tuple called from EvalPlanQualFetch races with the same tuple lock being acquired in heap_update -- one of these sessions sleeps on the other one to finish while holding the tuple lock, and the other one sleeps on the tuple lock. Per trouble report from Andrew Sackville-West in http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230 His scenario can be simplified down to a relatively simple isolationtester spec file which I don't include in this commit; the reason is that the current isolationtester is not able to deal with more than one blocked session concurrently and it blocks instead of raising the expected deadlock. In the future, if we improve isolationtester, it would be good to include the spec file in the isolation schedule. I posted it in http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org Hat tip to Mark Kirkwood, who helped diagnose the trouble.
* Fix timestamp in end-of-recovery WAL records.Heikki Linnakangas2014-12-19
| | | | | | | We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.
* Fix (re-)starting from a basebackup taken off a standby after a failure.Andres Freund2014-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.
* Fix assorted confusion between Oid and int32.Tom Lane2014-12-11
| | | | | | | | | | | In passing, also make some debugging elog's in pgstat.c a bit more consistently worded. Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are really old). Mark Dilger identified and patched the type violations; the message rewordings are mine.
* Print wal_log_hints in the rm_desc routing of a parameter-change record.Heikki Linnakangas2014-12-05
| | | | | | | | | It was an oversight in the original commit. Also note in the sample config file that changing wal_log_hints requires a restart. Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.