postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
*	Address ccvalid/ccnoinherit in TupleDesc support functions.	Noah Misch	2014-03-23
\| \| \| \| \| \| \| \| \|	equalTupleDescs() neglected both of these ConstrCheck fields, and CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the only known behavior defect resulting from these omissions is constraint exclusion disregarding a CHECK constraint validated by an ALTER TABLE VALIDATE CONSTRAINT statement issued earlier in the same transaction. Back-patch to 9.2, where these fields were introduced.
*	In WAL replay, restore GIN metapage unconditionally to avoid torn page.	Heikki Linnakangas	2014-03-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't take a full-page image of the GIN metapage; instead, the WAL record contains all the information required to reconstruct it from scratch. But to avoid torn page hazards, we must re-initialize it from the WAL record every time, even if it already has a greater LSN, similar to how normal full page images are restored. This was highly unlikely to cause any problems in practice, because the GIN metapage is small. We rely on an update smaller than a 512 byte disk sector to be atomic elsewhere, at least in pg_control. But better safe than sorry, and this would be easy to overlook if more fields are added to the metapage so that it's no longer small. Reported by Noah Misch. Backpatch to all supported versions.
*	Fix dangling smgr_owner pointer when a fake relcache entry is freed.	Heikki Linnakangas	2014-03-07
\| \| \| \| \| \| \| \| \| \| \| \|	A fake relcache entry can "own" a SmgrRelation object, like a regular relcache entry. But when it was free'd, the owner field in SmgrRelation was not cleared, so it was left pointing to free'd memory. Amazingly this apparently hasn't caused crashes in practice, or we would've heard about it earlier. Andres found this with Valgrind. Report and fix by Andres Freund, with minor modifications by me. Backpatch to all supported versions.
*	Do wal_level and hot standby checks when doing crash-then-archive recovery.	Heikki Linnakangas	2014-03-05
\| \| \| \| \| \| \| \|	CheckRequiredParameterValues() should perform the checks if archive recovery was requested, even if we are going to perform crash recovery first. Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive recovery mode.
*	Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint.	Heikki Linnakangas	2014-03-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When entering crash recovery followed by archive recovery, and the latest checkpoint is a shutdown checkpoint, and there are no more WAL records to replay before transitioning from crash to archive recovery, we would not immediately allow read-only connections in hot standby mode even if we could. That's because when starting from a shutdown checkpoint, we set lastReplayedEndRecPtr incorrectly to the record before the checkpoint record, instead of the checkpoint record itself. We don't run the redo routine of the shutdown checkpoint record, but starting recovery from it goes through the same motions, so it should be considered as replayed. Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected, so backpatch to 9.0.
*	Remove bogus while-loop.	Heikki Linnakangas	2014-02-28
\| \| \| \| \| \| \| \| \| \|	Commit abf5c5c9a4f142b3343614746bb9e99a794f8e7b added a bogus while- statement after the for(;;)-loop. It went unnoticed in testing, because it was dead code. Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced this was also applied to 9.2, but not the bogus while-loop part, because the code in 9.2 looks quite different.
*	Fix WAL replay of locking an updated tuple	Alvaro Herrera	2014-02-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on WAL replay of a tuple-lock operation, which is incorrect when the tuple is already updated. Back-patch to 9.3. The clearing of both header elements was there previously, but since no update could be present on a tuple that was being locked, it was harmless. Bug reported by Peter Geoghegan and Greg Stark in CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com and CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com respectively; diagnosis by Andres Freund.
*	Add a GUC to report whether data page checksums are enabled.	Heikki Linnakangas	2014-02-20
\| \| \| \| \|	Backported from master. It was an oversight in the original data checksums patch to not have a GUC like this.
*	Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2.	Heikki Linnakangas	2014-02-18
\| \| \| \|	Amit Langote
*	Prevent potential overruns of fixed-size buffers.	Tom Lane	2014-02-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Coverity identified a number of places in which it couldn't prove that a string being copied into a fixed-size buffer would fit. We believe that most, perhaps all of these are in fact safe, or are copying data that is coming from a trusted source so that any overrun is not really a security issue. Nonetheless it seems prudent to forestall any risk by using strlcpy() and similar functions. Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports. In addition, fix a potential null-pointer-dereference crash in contrib/chkpass. The crypt(3) function is defined to return NULL on failure, but chkpass.c didn't check for that before using the result. The main practical case in which this could be an issue is if libc is configured to refuse to execute unapproved hashing algorithms (e.g., "FIPS mode"). This ideally should've been a separate commit, but since it touches code adjacent to one of the buffer overrun changes, I included it in this commit to avoid last-minute merge issues. This issue was reported by Honza Horak. Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
*	Change the order that pg_xlog and WAL archive are polled for WAL segments.	Heikki Linnakangas	2014-02-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there is a WAL segment with same ID but different TLI present in both the WAL archive and pg_xlog, prefer the one with higher TLI. Before this patch, the archive was polled first, for all expected TLIs, and only if no file was found was pg_xlog scanned. This was a change in behavior from 9.3, which first scanned archive and pg_xlog for the highest TLI, then archive and pg_xlog for the next highest TLI and so forth. This patch reverts the behavior back to what it was in 9.2. The reason for this is that if for example you try to do archive recovery to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is not archived yet, we would replay past the timeline switch point on timeline 1 using the archived files, before even looking timeline 2's files in pg_xlog Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior was changed.
*	Separate multixact freezing parameters from xid's	Alvaro Herrera	2014-02-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously we were piggybacking on transaction ID parameters to freeze multixacts; but since there isn't necessarily any relationship between rates of Xid and multixact consumption, this turns out not to be a good idea. Therefore, we now have multixact-specific freezing parameters: vacuum_multixact_freeze_min_age: when to remove multis as we come across them in vacuum (default to 5 million, i.e. early in comparison to Xid's default of 50 million) vacuum_multixact_freeze_table_age: when to force whole-table scans instead of scanning only the pages marked as not all visible in visibility map (default to 150 million, same as for Xids). Whichever of both which reaches the 150 million mark earlier will cause a whole-table scan. autovacuum_multixact_freeze_max_age: when for cause emergency, uninterruptible whole-table scans (default to 400 million, double as that for Xids). This means there shouldn't be more frequent emergency vacuuming than previously, unless multixacts are being used very rapidly. Backpatch to 9.3 where multixacts were made to persist enough to require freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of fields in an unnatural place, and StdRdOptions is split in two so that the newly added fields can go at the end. Patch by me, reviewed by Robert Haas, with additional input from Andres Freund and Tom Lane.
*	In XLogReadBufferExtended, don't assume P_NEW yields consecutive pages.	Tom Lane	2014-02-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a database that's not yet reached consistency, it's possible that some segments of a relation are not full-size but are not the last ones either. Because of the way smgrnblocks() works, asking for a new page with P_NEW will fill in the last not-full-size segment --- and if that makes it full size, the apparent EOF of the relation will increase by more than one page, so that the next P_NEW request will yield a page past the next consecutive one. This breaks the relation-extension logic in XLogReadBufferExtended, possibly allowing a page update to be applied to some page far past where it was intended to go. This appears to be the explanation for reports of table bloat on replication slaves compared to their masters, and probably explains some corrupted-slave reports as well. Fix the loop to check the page number it actually got, rather than merely Assert()'ing that dead reckoning got it to the desired place. AFAICT, there are no other places that make assumptions about exactly which page they'll get from P_NEW. Problem identified by Greg Stark, though this is not the same as his proposed patch. It's been like this for a long time, so back-patch to all supported branches.
*	Fix multiple bugs in index page locking during hot-standby WAL replay.	Tom Lane	2014-01-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
*	Accept pg_upgraded tuples during multixact freezing	Alvaro Herrera	2014-01-10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new MultiXact freezing routines introduced by commit 8e9a16ab8f7 neglected to consider tuples that came from a pg_upgrade'd database; a vacuum run that tried to freeze such tuples would die with an error such as ERROR: MultiXactId 11415437 does no longer exist -- apparent wraparound To fix, ensure that GetMultiXactIdMembers is allowed to return empty multis when the infomask bits are right, as is done in other callsites. Per trouble report from F-Secure. In passing, fix a copy&paste bug reported by Andrey Karpov from VIVA64 from their PVS-Studio static checked, that instead of setting relminmxid to Invalid, we were setting relfrozenxid twice. Not an important mistake because that code branch is about relations for which we don't use the frozenxid/minmxid values at all in the first place, but seems to warrants a fix nonetheless.
*	Fix pause_at_recovery_target + recovery_target_inclusive combination.	Heikki Linnakangas	2014-01-08
\| \| \| \| \| \| \| \| \| \| \| \|	If pause_at_recovery_target is set, recovery pauses before applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.
*	Fix bug in determining when recovery has reached consistency.	Heikki Linnakangas	2014-01-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.
*	Move permissions check from do_pg_start_backup to pg_start_backup	Magnus Hagander	2014-01-07
\| \| \| \| \| \| \| \| \| \|	And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund
*	Handle 5-char filenames in SlruScanDirectory	Alvaro Herrera	2014-01-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Original users of slru.c were all producing 4-digit filenames, so that was all that that code was prepared to handle. Changes to multixact.c in the course of commit 0ac5ad5134f made pg_multixact/members create 5-digit filenames once a certain threshold was reached, which SlruScanDirectory wasn't prepared to deal with; in particular, 5-digit-name files were not removed during truncation. Change that routine to make it aware of those files, and have it process them just like any others. Right now, some pg_multixact/members directories will contain a mixture of 4-char and 5-char filenames. A future commit is expected fix things so that each slru.c user declares the correct maximum width for the files it produces, to avoid such unsightly mixtures. Noticed while investigating bug #8673 reported by Serge Negodyuck.
*	Wrap multixact/members correctly during extension	Alvaro Herrera	2014-01-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the 9.2 code for extending multixact/members, the logic was very simple because the number of entries in a members page was a proper divisor of 2^32, and thus at 2^32 wraparound the logic for page switch was identical than at any other page boundary. In commit 0ac5ad5134f I failed to realize this and introduced code that was not able to go over the 2^32 boundary. Fix that by ensuring that when we reach the last page of the last segment we correctly zero the initial page of the initial segment, using correct uint32-wraparound-safe arithmetic. Noticed while investigating bug #8673 reported by Serge Negodyuck, as diagnosed by Andres Freund.
*	Handle wraparound during truncation in multixact/members	Alvaro Herrera	2014-01-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pg_multixact/members, relying on modulo-2^32 arithmetic for wraparound handling doesn't work all that well. Because we don't explicitely track wraparound of the allocation counter for members, it is possible that the "live" area exceeds 2^31 entries; trying to remove SLRU segments that are "old" according to the original logic might lead to removal of segments still in use. To fix, have the truncation routine use a tailored SlruScanDirectory callback that keeps track of the live area in actual use; that way, when the live range exceeds 2^31 entries, the oldest segments still live will not get removed untimely. This new SlruScanDir callback needs to take care not to remove segments that are "in the future": if new SLRU segments appear while the truncation is ongoing, make sure we don't remove them. This requires examination of shared memory state to recheck for false positives, but testing suggests that this doesn't cause a problem. The original coding didn't suffer from this pitfall because segments created when truncation is running are never considered to be removable. Per Andres Freund's investigation of bug #8673 reported by Serge Negodyuck.
*	Optimize updating a row that's locked by same xid	Alvaro Herrera	2013-12-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updating or locking a row that was already locked by the same transaction under the same Xid caused a MultiXact to be created; but this is unnecessary, because there's no usefulness in being able to differentiate two locks by the same transaction. In particular, if a transaction executed SELECT FOR UPDATE followed by an UPDATE that didn't modify columns of the key, we would dutifully represent the resulting combination as a multixact -- even though a single key-update is sufficient. Optimize the case so that only the strongest of both locks/updates is represented in Xmax. This can save some Xmax's from becoming MultiXacts, which can be a significant optimization. This missed optimization opportunity was spotted by Andres Freund while investigating a bug reported by Oliver Seemann in message CANCipfpfzoYnOz5jj=UZ70_R=CwDHv36dqWSpwsi27vpm1z5sA@mail.gmail.com and also directly as a performance regression reported by Dong Ye in message d54b8387.000012d8.00000010@YED-DEVD1.vmware.com Reportedly, this patch fixes the performance regression. Since the missing optimization was reported as a significant performance regression from 9.2, backpatch to 9.3. Andres Freund, tweaked by Álvaro Herrera
*	Don't ignore tuple locks propagated by our updates	Alvaro Herrera	2013-12-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a tuple was locked by transaction A, and transaction B updated it, the new version of the tuple created by B would be locked by A, yet visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the lock held by A wouldn't get checked if transaction B later deleted (or key-updated) the new version of the tuple. This might cause referential integrity checks to give false positives (that is, allow deletes that should have been rejected). This is an easy oversight to have made, because prior to improved tuple locks in commit 0ac5ad5134f it wasn't possible to have tuples created by our own transaction that were also locked by remote transactions, and so locks weren't even considered in that code path. It is recommended that foreign keys be rechecked manually in bulk after installing this update, in case some referenced rows are missing with some referencing row remaining. Per bug reported by Daniel Wood in CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
*	Rework tuple freezing protocol	Alvaro Herrera	2013-12-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tuple freezing was broken in connection to MultiXactIds; commit 8e53ae025de9 tried to fix it, but didn't go far enough. As noted by Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted update might cause locks in the multi to go ignored by later transactions. This is because the code depended on a multixact above their cutoff point not having any lock-only member older than the cutoff point for Xids, which is easily defeated in READ COMMITTED transactions. The fix for this involves creating a new MultiXactId when necessary. But this cannot be done during WAL replay, and moreover multixact examination requires using CLOG access routines which are not supposed to be used during WAL replay either; so tuple freezing cannot be done with the old freeze WAL record. Therefore, separate the freezing computation from its execution, and change the WAL record to carry all necessary information. At WAL replay time, it's easy to re-execute freezing because we don't need to re-compute the new infomask/Xmax values but just take them from the WAL record. While at it, restructure the coding to ensure all page changes occur in a single critical section without much room for failures. The previous coding wasn't using a critical section, without any explanation as to why this was acceptable. In replication scenarios using the 9.3 branch, standby servers must be upgraded before their master, so that they are prepared to deal with the new WAL record once the master is upgraded; failure to do so will cause WAL replay to die with a PANIC message. Later upgrade of the standby will allow the process to continue where it left off, so there's no disruption of the data in the standby in any case. Standbys know how to deal with the old WAL record, so it's okay to keep the master running the old code for a while. In master, the old freeze WAL record is gone, for cleanliness' sake; there's no compatibility concern there. Backpatch to 9.3, where the original bug was introduced and where the previous fix was backpatched. Álvaro Herrera and Andres Freund
*	Fix typo	Alvaro Herrera	2013-12-13
\|
*	Rework MultiXactId cache code	Alvaro Herrera	2013-12-13
\| \| \| \| \| \| \| \| \| \| \|	The original performs too poorly; in some scenarios it shows way too high while profiling. Try to make it a bit smarter to avoid excessive cosst. In particular, make it have a maximum size, and have entries be sorted in LRU order; once the max size is reached, evict the oldest entry to avoid it from growing too large. Per complaint from Andres Freund in connection with new tuple freezing code.
*	Fix ancient docs/comments thinko: XID comparison is mod 2^32, not 2^31.	Tom Lane	2013-12-12
\| \| \| \|	Pointed out by Gianni Ciolli.
*	Fix improper abort during update chain locking	Alvaro Herrera	2013-12-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 247c76a98909, I added some code to do fine-grained checking of MultiXact status of locking/updating transactions when traversing an update chain. There was a thinko in that patch which would have the traversing abort, that is return HeapTupleUpdated, when the other transaction is a committed lock-only. In this case we should ignore it and return success instead. Of course, in the case where there is a committed update, HeapTupleUpdated is the correct return value. A user-visible symptom of this bug is that in REPEATABLE READ and SERIALIZABLE transaction isolation modes spurious serializability errors can occur: ERROR: could not serialize access due to concurrent update In order for this to happen, there needs to be a tuple that's key-share- locked and also updated, and the update must abort; a subsequent transaction trying to acquire a new lock on that tuple would abort with the above error. The reason is that the initial FOR KEY SHARE is seen as committed by the new locking transaction, which triggers this bug. (If the UPDATE commits, then the serialization error is correctly reported.) When running a query in READ COMMITTED mode, what happens is that the locking is aborted by the HeapTupleUpdated return value, then EvalPlanQual fetches the newest version of the tuple, which is then the only version that gets locked. (The second time the tuple is checked there is no misbehavior on the committed lock-only, because it's not checked by the code that traverses update chains; so no bug.) Only the newest version of the tuple is locked, not older ones, but this is harmless. The isolation test added by this commit illustrates the desired behavior, including the proper serialization errors that get thrown. Backpatch to 9.3.
*	Fix full-page writes of internal GIN pages.	Heikki Linnakangas	2013-12-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Insertion to a non-leaf GIN page didn't make a full-page image of the page, which is wrong. The code used to do it correctly, but was changed (commit 853d1c3103fa961ae6219f0281885b345593d101) because the redo-routine didn't track incomplete splits correctly when the page was restored from a full page image. Of course, that was not right way to fix it, the redo routine should've been fixed instead. The redo-routine was surreptitiously fixed in 2010 (commit 4016bdef8aded77b4903c457050622a5a1815c16), so all we need to do now is revert the code that creates the record to its original form. This doesn't change the format of the WAL record. Backpatch to all supported versions.
*	Fix a couple of bugs in MultiXactId freezing	Alvaro Herrera	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look into a multixact to check the members against cutoff_xid. This means that a very old Xid could survive hidden within a multi, possibly outliving its CLOG storage. In the distant future, this would cause clog lookup failures: ERROR: could not access status of transaction 3883960912 DETAIL: Could not open file "pg_clog/0E78": No such file or directory. This mostly was problematic when the updating transaction aborted, since in that case the row wouldn't get pruned away earlier in vacuum and the multixact could possibly survive for a long time. In many cases, data that is inaccessible for this reason way can be brought back heuristically. As a second bug, heap_freeze_tuple() didn't properly handle multixacts that need to be frozen according to cutoff_multi, but whose updater xid is still alive. Instead of preserving the update Xid, it just set Xmax invalid, which leads to both old and new tuple versions becoming visible. This is pretty rare in practice, but a real threat nonetheless. Existing corrupted rows, unfortunately, cannot be repaired in an automated fashion. Existing physical replicas might have already incorrectly frozen tuples because of different behavior than in master, which might only become apparent in the future once pg_multixact/ is truncated; it is recommended that all clones be rebuilt after upgrading. Following code analysis caused by bug report by J Smith in message CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com and privately by F-Secure. Backpatch to 9.3, where freezing of MultiXactIds was introduced. Analysis and patch by Andres Freund, with some tweaks by Álvaro.
*	Don't TransactionIdDidAbort in HeapTupleGetUpdateXid	Alvaro Herrera	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is dangerous to do so, because some code expects to be able to see what's the true Xmax even if it is aborted (particularly while traversing HOT chains). So don't do it, and instead rely on the callers to verify for abortedness, if necessary. Several race conditions and bugs fixed in the process. One isolation test changes the expected output due to these. This also reverts commit c235a6a589b, which is no longer necessary. Backpatch to 9.3, where this function was introduced. Andres Freund
*	Truncate pg_multixact/'s contents during crash recovery	Alvaro Herrera	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash recovery, because there was no guarantee that enough state had been setup, and because it wasn't deemed to be a good idea to remove data during crash recovery anyway. Since then, due to Hot-Standby, streaming replication and PITR, the amount of time a cluster can spend doing crash recovery has increased significantly, to the point that a cluster may even never come out of it. This has made not truncating the content of pg_multixact/ not defensible anymore. To fix, take care to setup enough state for multixact truncation before crash recovery starts (easy since checkpoints contain the required information), and move the current end-of-recovery actions to a new TrimMultiXact() function, analogous to TrimCLOG(). At some later point, this should probably done similarly to the way clog.c is doing it, which is to just WAL log truncations, but we can't do that for the back branches. Back-patch to 9.0. 8.4 also has the problem, but since there's no hot standby there, it's much less pressing. In 9.2 and earlier, this patch is simpler than in newer branches, because multixact access during recovery isn't required. Add appropriate checks to make sure that's not happening. Andres Freund
*	Fix full-table-vacuum request mechanism for MultiXactIds	Alvaro Herrera	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While autovacuum dutifully launched anti-multixact-wraparound vacuums when the multixact "age" was reached, the vacuum code was not aware that it needed to make them be full table vacuums. As the resulting partial-table vacuums aren't capable of actually increasing relminmxid, autovacuum continued to launch anti-wraparound vacuums that didn't have the intended effect, until age of relfrozenxid caused the vacuum to finally be a full table one via vacuum_freeze_table_age. To fix, introduce logic for multixacts similar to that for plain TransactionIds, using the same GUCs. Backpatch to 9.3, where permanent MultiXactIds were introduced. Andres Freund, some cleanup by Álvaro
*	Replace hardcoded 200000000 with autovacuum_freeze_max_age	Alvaro Herrera	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \|	Parts of the code used autovacuum_freeze_max_age to determine whether anti-multixact-wraparound vacuums are necessary, while others used a hardcoded 200000000 value. This leads to problems when autovacuum_freeze_max_age is set to a non-default value. Use the latter everywhere. Backpatch to 9.3, where vacuuming of multixacts was introduced. Andres Freund
*	Fix assorted race conditions in the new timeout infrastructure.	Tom Lane	2013-11-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prevent handle_sig_alarm from losing control partway through due to a query cancel (either an asynchronous SIGINT, or a cancel triggered by one of the timeout handler functions). That would at least result in failure to schedule any required future interrupt, and might result in actual corruption of timeout.c's data structures, if the interrupt happened while we were updating those. We could still lose control if an asynchronous SIGINT arrives just as the function is entered. This wouldn't break any data structures, but it would have the same effect as if the SIGALRM interrupt had been silently lost: we'd not fire any currently-due handlers, nor schedule any new interrupt. To forestall that scenario, forcibly reschedule any pending timer interrupt during AbortTransaction and AbortSubTransaction. We can avoid any extra kernel call in most cases by not doing that until we've allowed LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events. Another hazard is that some platforms (at least Linux and *BSD) block a signal before calling its handler and then unblock it on return. When we longjmp out of the handler, the unblock doesn't happen, and the signal is left blocked indefinitely. Again, we can fix that by forcibly unblocking signals during AbortTransaction and AbortSubTransaction. These latter two problems do not manifest when the longjmp reaches postgres.c, because the error recovery code there kills all pending timeout events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal mask is restored. So errors thrown outside any transaction should be OK already, and cleaning up in AbortTransaction and AbortSubTransaction should be enough to fix these issues. (We're assuming that any code that catches a query cancel error and doesn't re-throw it will do at least a subtransaction abort to clean up; but that was pretty much required already by other subsystems.) Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when disabling that event: if a lock timeout interrupt happened after the lock was granted, the ensuing query cancel is still going to happen at the next CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user cancel. Per reports from Dan Wood. Back-patch to 9.3 where the new timeout handling infrastructure was introduced. We may at some point decide to back-patch the signal unblocking changes further, but I'll desist from that until we hear actual field complaints about it.
*	Use a more granular approach to follow update chains	Alvaro Herrera	2013-11-28
\| \| \| \| \| \| \| \|	Instead of simply checking the KEYS_UPDATED bit, we need to check whether each lock held on the future version of the tuple conflicts with the lock we're trying to acquire. Per bug report #8434 by Tomonari Katsumata
*	Compare Xmin to previous Xmax when locking an update chain	Alvaro Herrera	2013-11-28
\| \| \| \| \| \| \| \| \| \| \| \|	Not doing so causes us to traverse an update chain that has been broken by concurrent page pruning. All other code that traverses update chains uses this check as one of the cases in which to stop iterating, so replicate it here too. Failure to do so leads to erroneous CLOG, subtrans or multixact lookups. Per discussion following the bug report by J Smith in CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com as diagnosed by Andres Freund.
*	Don't try to set InvalidXid as page pruning hint	Alvaro Herrera	2013-11-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a transaction updates/deletes a tuple just before aborting, and a concurrent transaction tries to prune the page concurrently, the pruner may see HeapTupleSatisfiesVacuum return HEAPTUPLE_DELETE_IN_PROGRESS, but a later call to HeapTupleGetUpdateXid() return InvalidXid. This would cause an assertion failure in development builds, but would be otherwise Mostly Harmless. Fix by checking whether the updater Xid is valid before trying to apply it as page prune point. Reported by Andres in 20131124000203.GA4403@alap2.anarazel.de
*	Cope with heap_fetch failure while locking an update chain	Alvaro Herrera	2013-11-28
\| \| \| \| \| \| \| \| \| \| \|	The reason for the fetch failure is that the tuple was removed because it was dead; so the failure is innocuous and can be ignored. Moreover, there's no need for further work and we can return success to the caller immediately. EvalPlanQualFetch is doing something very similar to this already. Report and test case from Andres Freund in 20131124000203.GA4403@alap2.anarazel.de
*	Fix Hot-Standby initialization of clog and subtrans.	Heikki Linnakangas	2013-11-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These bugs can cause data loss on standbys started with hot_standby=on at the moment they start to accept read only queries, by marking committed transactions as uncommited. The likelihood of such corruptions is small unless the primary has a high transaction rate. 5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup logic by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both clog and subtrans are written to before that. This only failed to fail in common cases because the usage of ExtendCLOG in procarray.c was superflous since clog extensions are actually WAL logged. f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing extensions of pg_subtrans due to the former commit's changes - which are not WAL logged - by performing the extensions when switching to a state > STANDBY_INITIALIZED and not performing xid assignments before that - again missing the fact that ExtendCLOG is unneccessary - but screwed up twice: Once because latestObservedXid wasn't updated anymore in that state due to the earlier commit and once by having an off-by-one error in the loop performing extensions. This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed between the start of the checkpoint recovery started from and the first xl_running_xact record old transactions commit bits in pg_clog could be overwritten if they started and committed in that window. Fix this mess by not performing ExtendCLOG() in HS at all anymore since it's unneeded and evidently dangerous and by performing subtrans extensions even before reaching STANDBY_SNAPSHOT_PENDING. Analysis and patch by Andres Freund. Reported by Christophe Pettus. Backpatch down to 9.0, like the previous commit that caused this.
*	Fix race condition in GIN posting tree page deletion.	Heikki Linnakangas	2013-11-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a page is deleted, and reused for something else, just as a search is following a rightlink to it from its left sibling, the search would continue scanning whatever the new contents of the page are. That could lead to incorrect query results, or even something more curious if the page is reused for a different kind of a page. To fix, modify the search algorithm to lock the next page before releasing the previous one, and refrain from deleting pages from the leftmost branch of the tree. Add a new Concurrency section to the README, explaining why this works. There is a lot more one could say about concurrency in GIN, but that's for another patch. Backpatch to all supported versions.
*	Prevent memory leaks from accumulating across printtup() calls.	Tom Lane	2013-11-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Historically, printtup() has assumed that it could prevent memory leakage by pfree'ing the string result of each output function and manually managing detoasting of toasted values. This amounts to assuming that datatype output functions never leak any memory internally; an assumption we've already decided to be bogus elsewhere, for example in COPY OUT. range_out in particular is known to leak multiple kilobytes per call, as noted in bug #8573 from Godfried Vanluffelen. While we could go in and fix that leak, it wouldn't be very notationally convenient, and in any case there have been and undoubtedly will again be other leaks in other output functions. So what seems like the best solution is to run the output functions in a temporary memory context that can be reset after each row, as we're doing in COPY OUT. Some quick experimentation suggests this is actually a tad faster than the retail pfree's anyway. This patch fixes all the variants of printtup, except for debugtup() which is used in standalone mode. It doesn't seem worth worrying about query-lifespan leaks in standalone mode, and fixing that case would be a bit tedious since debugtup() doesn't currently have any startup or shutdown functions. While at it, remove manual detoast management from several other output-function call sites that had copied it from printtup(). This doesn't make a lot of difference right now, but in view of recent discussions about supporting "non-flattened" Datums, we're going to want that code gone eventually anyway. Back-patch to 9.2 where range_out was introduced. We might eventually decide to back-patch this further, but in the absence of known major leaks in older output functions, I'll refrain for now.
*	Retry after buffer locking failure during SPGiST index creation.	Tom Lane	2013-11-02
\| \| \| \| \| \| \| \| \|	The original coding thought this case was impossible, but it can happen if the bgwriter or checkpointer processes decide to write out an index page while creation is still proceeding, leading to a bogus "unexpected spgdoinsert() failure" error. Problem reported by Jonathan S. Katz. Teodor Sigaev
*	Prevent using strncpy with src == dest in TupleDescInitEntry.	Tom Lane	2013-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The C and POSIX standards state that strncpy's behavior is undefined when source and destination areas overlap. While it remains dubious whether any implementations really misbehave when the pointers are exactly equal, some platforms are now starting to force the issue by complaining when an undefined call occurs. (In particular OS X 10.9 has been seen to dump core here, though the exact set of circumstances needed to trigger that remain elusive. Similar behavior can be expected to be optional on Linux and other platforms in the near future.) So tweak the code to explicitly do nothing when nothing need be done. Back-patch to all active branches. In HEAD, this also lets us get rid of an exception in valgrind.supp. Per discussion of a report from Matthias Schmitt.
*	Fix typos in comments.	Heikki Linnakangas	2013-10-24
\|
*	Fix bugs in SSI tuple locking.	Heikki Linnakangas	2013-10-08
\| \| \| \| \| \| \| \| \| \| \| \| \|	1. In heap_hot_search_buffer(), the PredicateLockTuple() call is passed wrong offset number. heapTuple->t_self is set to the tid of the first tuple in the chain that's visited, not the one actually being read. 2. CheckForSerializableConflictIn() uses the tuple's t_ctid field instead of t_self to check for exiting predicate locks on the tuple. If the tuple was updated, but the updater rolled back, t_ctid points to the aborted dead tuple. Reported by Hannu Krosing. Backpatch to 9.1.
*	Fix pgindent comment breakage	Alvaro Herrera	2013-09-24
\|
*	Rename various "freeze multixact" variables	Alvaro Herrera	2013-09-16
\| \| \| \| \| \| \| \| \|	It seems to make more sense to use "cutoff multixact" terminology throughout the backend code; "freeze" is associated with replacing of an Xid with FrozenTransactionId, which is not what we do for MultiXactIds. Andres Freund Some adjustments by Álvaro Herrera
*	Rename the "fast_promote" file to just "promote".	Heikki Linnakangas	2013-08-19
\| \| \| \| \| \| \| \| \| \|	This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.
*	Fix pg_upgrade failure from servers older than 9.3	Alvaro Herrera	2013-08-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When upgrading from servers of versions 9.2 and older, and MultiXactIds have been used in the old server beyond the first page (that is, 2048 multis or more in the default 8kB-page build), pg_upgrade would set the next multixact offset to use beyond what has been allocated in the new cluster. This would cause a failure the first time the new cluster needs to use this value, because the pg_multixact/offsets/ file wouldn't exist or wouldn't be large enough. To fix, ensure that the transient server instances launched by pg_upgrade extend the file as necessary. Per report from Jesse Denardo in CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com