aboutsummaryrefslogtreecommitdiff
path: root/src/backend
Commit message (Collapse)AuthorAge
...
* Tweak parse location assignment for CURRENT_DATE and related constructs.Tom Lane2014-01-21
| | | | | | | | | | | | | | | | | | | | | | | | | All these constructs generate parse trees consisting of a Const and a run-time type coercion (perhaps a FuncExpr or a CoerceViaIO). Modify the raw parse output so that we end up with the original token's location attached to the type coercion node while the Const has location -1; before, it was the other way around. This makes no difference in terms of what exprLocation() will say about the parse tree as a whole, so it should not have any user-visible impact. The point of changing it is that we do not want contrib/pg_stat_statements to treat these constructs as replaceable constants. It will do the right thing if the Const has location -1 rather than a valid location. This is a pretty ugly hack, but then this code is ugly already; we should someday replace this translation with special-purpose parse node(s) that would allow ruleutils.c to reconstruct the original query text. (See also commit 5d3fcc4c2e137417ef470d604fee5e452b22f6a7, which also hacked location assignment rules for the benefit of pg_stat_statements.) Back-patch to 9.2 where pg_stat_statements grew the ability to recognize replaceable constants. Kyotaro Horiguchi
* Fix inadvertent semantics change in last patch to plug memory leaks.Robert Haas2014-01-21
| | | | | | | | | Commit a5bca4ef034f71175d46462963af2329d22068c2 accidentally changed the semantics when the "skipping missing configuration file" is emitted, because it forced OK to true instead of leaving the value untouched. Spotted by Tom Lane.
* Plug more memory leaks when reloading config file.Robert Haas2014-01-21
| | | | | | | | Commit 138184adc5f7c60c184972e4d23f8cdb32aed77d plugged some but not all of the leaks from commit 2a0c81a12c7e6c5ac1557b0f1f4a581f23fd4ca7. This tightens things up some more. Amit Kapila, per an observation by Tom Lane
* Allow SET TABLESPACE to database defaultStephen Frost2014-01-18
| | | | | | | | | | | | | We've always allowed CREATE TABLE to create tables in the database's default tablespace without checking for CREATE permissions on that tablespace. Unfortunately, the original implementation of ALTER TABLE ... SET TABLESPACE didn't pick up on that exception. This changes ALTER TABLE ... SET TABLESPACE to allow the database's default tablespace without checking for CREATE rights on that tablespace, just as CREATE TABLE works today. Users could always do this through a series of commands (CREATE TABLE ... AS SELECT * FROM ...; DROP TABLE ...; etc), so let's fix the oversight in SET TABLESPACE's original implementation.
* Fix Hot Standby feedback sending when streaming busily.Heikki Linnakangas2014-01-16
| | | | | | | | | | | | Commit 6f60fdd7015b032bf49273c99f80913d57eac284 accidentally removed a call to XLogWalRcvSendHSFeedback() after flushing received WAL to disk. The consequence is that when walsender is busy streaming WAL, it doesn't send HS feedback messages. One is sent if nothing is received from the master for 100ms, but if there's a steady stream of WAL, it never happens. Backpatch to 9.3. Andres Freund and Amit Kapila
* Fix multiple bugs in index page locking during hot-standby WAL replay.Tom Lane2014-01-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
* Disallow LATERAL references to the target table of an UPDATE/DELETE.Tom Lane2014-01-11
| | | | | | | | | | | | | | On second thought, commit 0c051c90082da0b7e5bcaf9aabcbd4f361137cdc was over-hasty: rather than allowing this case, we ought to reject it for now. That leaves the field clear for a future feature that allows the target table to be re-specified in the FROM (or USING) clause, which will enable left-joining the target table to something else. We can then also allow LATERAL references to such an explicitly re-specified target table. But allowing them right now will create ambiguities or worse for such a feature, and it isn't something we documented 9.3 as supporting. While at it, add a convenience subroutine to avoid having several copies of the ereport for disalllowed-LATERAL-reference cases.
* Fix possible crashes due to using elog/ereport too early in startup.Tom Lane2014-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per reports from Andres Freund and Luke Campbell, a server failure during set_pglocale_pgservice results in a segfault rather than a useful error message, because the infrastructure needed to use ereport hasn't been initialized; specifically, MemoryContextInit hasn't been called. One known cause of this is starting the server in a directory it doesn't have permission to read. We could try to prevent set_pglocale_pgservice from using anything that depends on palloc or elog, but that would be messy, and the odds of future breakage seem high. Moreover there are other things being called in main.c that look likely to use palloc or elog too --- perhaps those things shouldn't be there, but they are there today. The best solution seems to be to move the call of MemoryContextInit to very early in the backend's real main() function. I've verified that an elog or ereport occurring immediately after that is now capable of sending something useful to stderr. I also added code to elog.c to print something intelligible rather than just crashing if MemoryContextInit hasn't created the ErrorContext. This could happen if MemoryContextInit itself fails (due to malloc failure), and provides some future-proofing against someone trying to sneak in new code even earlier in server startup. Back-patch to all supported branches. Since we've only heard reports of this type of failure recently, it may be that some recent change has made it more likely to see a crash of this kind; but it sure looks like it's broken all the way back.
* Fix compute_scalar_stats() for case that all values exceed WIDTH_THRESHOLD.Tom Lane2014-01-11
| | | | | | | | | | | | | | | | | | The standard typanalyze functions skip over values whose detoasted size exceeds WIDTH_THRESHOLD (1024 bytes), so as to limit memory bloat during ANALYZE. However, we (I think I, actually :-() failed to consider the possibility that *every* non-null value in a column is too wide. While compute_minimal_stats() seems to behave reasonably anyway in such a case, compute_scalar_stats() just fell through and generated no pg_statistic entry at all. That's unnecessarily pessimistic: we can still produce valid stanullfrac and stawidth values in such cases, since we do include too-wide values in the average-width calculation. Furthermore, since the general assumption in this code is that too-wide values are probably all distinct from each other, it seems reasonable to set stadistinct to -1 ("all distinct"). Per complaint from Kadri Raudsepp. This has been like this since roughly neolithic times, so back-patch to all supported branches.
* Accept pg_upgraded tuples during multixact freezingAlvaro Herrera2014-01-10
| | | | | | | | | | | | | | | | | | | | The new MultiXact freezing routines introduced by commit 8e9a16ab8f7 neglected to consider tuples that came from a pg_upgrade'd database; a vacuum run that tried to freeze such tuples would die with an error such as ERROR: MultiXactId 11415437 does no longer exist -- apparent wraparound To fix, ensure that GetMultiXactIdMembers is allowed to return empty multis when the infomask bits are right, as is done in other callsites. Per trouble report from F-Secure. In passing, fix a copy&paste bug reported by Andrey Karpov from VIVA64 from their PVS-Studio static checked, that instead of setting relminmxid to Invalid, we were setting relfrozenxid twice. Not an important mistake because that code branch is about relations for which we don't use the frozenxid/minmxid values at all in the first place, but seems to warrants a fix nonetheless.
* Fix "cannot accept a set" error when only some arms of a CASE return a set.Tom Lane2014-01-08
| | | | | | | | | | | | | | | In commit c1352052ef1d4eeb2eb1d822a207ddc2d106cb13, I implemented an optimization that assumed that a function's argument expressions would either always return a set (ie multiple rows), or always not. This is wrong however: we allow CASE expressions in which some arms return a set of some type and others just return a scalar of that type. There may be other examples as well. To fix, replace the run-time test of whether an argument returned a set with a static precheck (expression_returns_set). This adds a little bit of query startup overhead, but it seems barely measurable. Per bug #8228 from David Johnston. This has been broken since 8.0, so patch all supported branches.
* Fix pause_at_recovery_target + recovery_target_inclusive combination.Heikki Linnakangas2014-01-08
| | | | | | | | | | | | If pause_at_recovery_target is set, recovery pauses *before* applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.
* Fix bug in determining when recovery has reached consistency.Heikki Linnakangas2014-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | | When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.
* Fix LATERAL references to target table of UPDATE/DELETE.Tom Lane2014-01-07
| | | | | | | | | I failed to think much about UPDATE/DELETE when implementing LATERAL :-(. The implemented behavior ended up being that subqueries in the FROM or USING clause (respectively) could access the update/delete target table as though it were a lateral reference; which seems fine if they said LATERAL, but certainly ought to draw an error if they didn't. Fix it so you get a suitable error when you omit LATERAL. Per report from Emre Hasegeli.
* Move permissions check from do_pg_start_backup to pg_start_backupMagnus Hagander2014-01-07
| | | | | | | | | | And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund
* Avoid including tablespaces inside PGDATA twice in base backupsMagnus Hagander2014-01-07
| | | | | | | | | If a tablespace was crated inside PGDATA it was backed up both as part of the PGDATA backup and as the backup of the tablespace. Avoid this by skipping any directory inside PGDATA that contains one of the active tablespaces. Dimitri Fontaine and Magnus Hagander
* Handle 5-char filenames in SlruScanDirectoryAlvaro Herrera2014-01-02
| | | | | | | | | | | | | | | | | | Original users of slru.c were all producing 4-digit filenames, so that was all that that code was prepared to handle. Changes to multixact.c in the course of commit 0ac5ad5134f made pg_multixact/members create 5-digit filenames once a certain threshold was reached, which SlruScanDirectory wasn't prepared to deal with; in particular, 5-digit-name files were not removed during truncation. Change that routine to make it aware of those files, and have it process them just like any others. Right now, some pg_multixact/members directories will contain a mixture of 4-char and 5-char filenames. A future commit is expected fix things so that each slru.c user declares the correct maximum width for the files it produces, to avoid such unsightly mixtures. Noticed while investigating bug #8673 reported by Serge Negodyuck.
* Wrap multixact/members correctly during extensionAlvaro Herrera2014-01-02
| | | | | | | | | | | | | | In the 9.2 code for extending multixact/members, the logic was very simple because the number of entries in a members page was a proper divisor of 2^32, and thus at 2^32 wraparound the logic for page switch was identical than at any other page boundary. In commit 0ac5ad5134f I failed to realize this and introduced code that was not able to go over the 2^32 boundary. Fix that by ensuring that when we reach the last page of the last segment we correctly zero the initial page of the initial segment, using correct uint32-wraparound-safe arithmetic. Noticed while investigating bug #8673 reported by Serge Negodyuck, as diagnosed by Andres Freund.
* Handle wraparound during truncation in multixact/membersAlvaro Herrera2014-01-02
| | | | | | | | | | | | | | | | | | | | | | | In pg_multixact/members, relying on modulo-2^32 arithmetic for wraparound handling doesn't work all that well. Because we don't explicitely track wraparound of the allocation counter for members, it is possible that the "live" area exceeds 2^31 entries; trying to remove SLRU segments that are "old" according to the original logic might lead to removal of segments still in use. To fix, have the truncation routine use a tailored SlruScanDirectory callback that keeps track of the live area in actual use; that way, when the live range exceeds 2^31 entries, the oldest segments still live will not get removed untimely. This new SlruScanDir callback needs to take care not to remove segments that are "in the future": if new SLRU segments appear while the truncation is ongoing, make sure we don't remove them. This requires examination of shared memory state to recheck for false positives, but testing suggests that this doesn't cause a problem. The original coding didn't suffer from this pitfall because segments created when truncation is running are never considered to be removable. Per Andres Freund's investigation of bug #8673 reported by Serge Negodyuck.
* Fix broken support for event triggers as extension members.Tom Lane2013-12-30
| | | | | | | | | CREATE EVENT TRIGGER forgot to mark the event trigger as a member of its extension, and pg_dump didn't pay any attention anyway when deciding whether to dump the event trigger. Per report from Moshe Jacobson. Given the obvious lack of testing here, it's rather astonishing that ALTER EXTENSION ADD/DROP EVENT TRIGGER work, but they seem to.
* Properly detect invalid JSON numbers when generating JSON.Andrew Dunstan2013-12-27
| | | | | | | | | | | Instead of looking for characters that aren't valid in JSON numbers, we simply pass the output string through the JSON number parser, and if it fails the string is quoted. This means among other things that money and domains over money will be quoted correctly and generate valid JSON. Fixes bug #8676 reported by Anderson Cristian da Silva. Backpatched to 9.2 where JSON generation was introduced.
* Fix misplaced right paren bugs in pgstatfuncs.c.Kevin Grittner2013-12-27
| | | | | | | | | | | | | | The bug would only show up if the C sockaddr structure contained zero in the first byte for a valid address; otherwise it would fail to fail, which is probably why it went unnoticed for so long. Patch submitted by Joel Jacobson after seeing an article by Andrey Karpov in which he reports finding this through static code analysis using PVS-Studio. While I was at it I moved a definition of a local variable referenced in the buggy code to a more local context. Backpatch to all supported branches.
* Fix ANALYZE failure on a column that's a domain over a range.Tom Lane2013-12-23
| | | | | | Most other range operations seem to work all right on domains, but this one not so much, at least not since commit 918eee0c. Per bug #8684 from Brett Neumeier.
* Avoid useless palloc during transaction commitAlvaro Herrera2013-12-20
| | | | | | | | | | | | | | | | We can allocate the initial relations-to-drop array when first needed, instead of at function entry; this avoids allocating it when the function is not going to do anything, which is most of the time. Backpatch to 9.3, where this behavior was introduced by commit 279628a0a7cf5. There's more that could be done here, such as possible reworking of the code to avoid having to palloc anything, but that doesn't sound as backpatchable as this relatively minor change. Per complaint from Noah Misch in 20131031145234.GA621493@tornado.leadboat.com
* Optimize updating a row that's locked by same xidAlvaro Herrera2013-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Updating or locking a row that was already locked by the same transaction under the same Xid caused a MultiXact to be created; but this is unnecessary, because there's no usefulness in being able to differentiate two locks by the same transaction. In particular, if a transaction executed SELECT FOR UPDATE followed by an UPDATE that didn't modify columns of the key, we would dutifully represent the resulting combination as a multixact -- even though a single key-update is sufficient. Optimize the case so that only the strongest of both locks/updates is represented in Xmax. This can save some Xmax's from becoming MultiXacts, which can be a significant optimization. This missed optimization opportunity was spotted by Andres Freund while investigating a bug reported by Oliver Seemann in message CANCipfpfzoYnOz5jj=UZ70_R=CwDHv36dqWSpwsi27vpm1z5sA@mail.gmail.com and also directly as a performance regression reported by Dong Ye in message d54b8387.000012d8.00000010@YED-DEVD1.vmware.com Reportedly, this patch fixes the performance regression. Since the missing optimization was reported as a significant performance regression from 9.2, backpatch to 9.3. Andres Freund, tweaked by Álvaro Herrera
* Don't ignore tuple locks propagated by our updatesAlvaro Herrera2013-12-18
| | | | | | | | | | | | | | | | | | | | | | If a tuple was locked by transaction A, and transaction B updated it, the new version of the tuple created by B would be locked by A, yet visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the lock held by A wouldn't get checked if transaction B later deleted (or key-updated) the new version of the tuple. This might cause referential integrity checks to give false positives (that is, allow deletes that should have been rejected). This is an easy oversight to have made, because prior to improved tuple locks in commit 0ac5ad5134f it wasn't possible to have tuples created by our own transaction that were also locked by remote transactions, and so locks weren't even considered in that code path. It is recommended that foreign keys be rechecked manually in bulk after installing this update, in case some referenced rows are missing with some referencing row remaining. Per bug reported by Daniel Wood in CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
* Rework tuple freezing protocolAlvaro Herrera2013-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tuple freezing was broken in connection to MultiXactIds; commit 8e53ae025de9 tried to fix it, but didn't go far enough. As noted by Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted update might cause locks in the multi to go ignored by later transactions. This is because the code depended on a multixact above their cutoff point not having any lock-only member older than the cutoff point for Xids, which is easily defeated in READ COMMITTED transactions. The fix for this involves creating a new MultiXactId when necessary. But this cannot be done during WAL replay, and moreover multixact examination requires using CLOG access routines which are not supposed to be used during WAL replay either; so tuple freezing cannot be done with the old freeze WAL record. Therefore, separate the freezing computation from its execution, and change the WAL record to carry all necessary information. At WAL replay time, it's easy to re-execute freezing because we don't need to re-compute the new infomask/Xmax values but just take them from the WAL record. While at it, restructure the coding to ensure all page changes occur in a single critical section without much room for failures. The previous coding wasn't using a critical section, without any explanation as to why this was acceptable. In replication scenarios using the 9.3 branch, standby servers must be upgraded before their master, so that they are prepared to deal with the new WAL record once the master is upgraded; failure to do so will cause WAL replay to die with a PANIC message. Later upgrade of the standby will allow the process to continue where it left off, so there's no disruption of the data in the standby in any case. Standbys know how to deal with the old WAL record, so it's okay to keep the master running the old code for a while. In master, the old freeze WAL record is gone, for cleanliness' sake; there's no compatibility concern there. Backpatch to 9.3, where the original bug was introduced and where the previous fix was backpatched. Álvaro Herrera and Andres Freund
* Fix inherited UPDATE/DELETE with UNION ALL subqueries.Tom Lane2013-12-14
| | | | | | | | | | | | | | | | | | | | | Fix an oversight in commit b3aaf9081a1a95c245fd605dcf02c91b3a5c3a29: we do indeed need to process the planner's append_rel_list when copying RTE subqueries, because if any of them were flattenable UNION ALL subqueries, the append_rel_list shows which subquery RTEs were pulled up out of which other ones. Without this, UNION ALL subqueries aren't correctly inserted into the update plans for inheritance child tables after the first one, typically resulting in no update happening for those child table(s). Per report from Victor Yegorov. Experimentation with this case also exposed a fault in commit a7b965382cf0cb30aeacb112572718045e6d4be7: if an inherited UPDATE/DELETE was proven totally dummy by constraint exclusion, we might arrive at add_rtes_to_flat_rtable with root->simple_rel_array being NULL. This should be interpreted as not having any RelOptInfos. I chose to code the guard as a check against simple_rel_array_size, so as to also provide some protection against indexing off the end of the array. Back-patch to 9.2 where the faulty code was added.
* Fix typoAlvaro Herrera2013-12-13
|
* Rework MultiXactId cache codeAlvaro Herrera2013-12-13
| | | | | | | | | | | The original performs too poorly; in some scenarios it shows way too high while profiling. Try to make it a bit smarter to avoid excessive cosst. In particular, make it have a maximum size, and have entries be sorted in LRU order; once the max size is reached, evict the oldest entry to avoid it from growing too large. Per complaint from Andres Freund in connection with new tuple freezing code.
* Add HOLD/RESUME_INTERRUPTS in HandleCatchupInterrupt/HandleNotifyInterrupt.Tom Lane2013-12-13
| | | | | | | | | | | | | | | | | | This prevents a possible longjmp out of the signal handler if a timeout or SIGINT occurs while something within the handler has transiently set ImmediateInterruptOK. For safety we must hold off the timeout or cancel error until we're back in mainline, or at least till we reach the end of the signal handler when ImmediateInterruptOK was true at entry. This syncs these functions with the logic now present in handle_sig_alarm. AFAICT there is no live bug here in 9.0 and up, because I don't think we currently can wait for any heavyweight lock inside these functions, and there is no other code (except read-from-client) that will turn on ImmediateInterruptOK. However, that was not true pre-9.0: in older branches ProcessIncomingNotify might block trying to lock pg_listener, and then a SIGINT could lead to undesirable control flow. It might be all right anyway given the relatively narrow code ranges in which NOTIFY interrupts are enabled, but for safety's sake I'm back-patching this.
* Don't let timeout interrupts happen unless ImmediateInterruptOK is set.Tom Lane2013-12-13
| | | | | | | | | | | | | Serious oversight in commit 16e1b7a1b7f7ffd8a18713e83c8cd72c9ce48e07: we should not allow an interrupt to take control away from mainline code except when ImmediateInterruptOK is set. Just to be safe, let's adopt the same save-clear-restore dance that's been used for many years in HandleCatchupInterrupt and HandleNotifyInterrupt, so that nothing bad happens if a timeout handler invokes code that tests or even manipulates ImmediateInterruptOK. Per report of "stuck spinlock" failures from Christophe Pettus, though many other symptoms are possible. Diagnosis by Andres Freund.
* Fix WAL-logging of setting the visibility map bit.Heikki Linnakangas2013-12-13
| | | | | | | | | The operation that removes the remaining dead tuples from the page must be WAL-logged before the setting of the VM bit. Otherwise, if you replay the WAL to between those two records, you end up with the VM bit set, but the dead tuples are still there. Backpatch to 9.3, where this bug was introduced.
* Fix ancient docs/comments thinko: XID comparison is mod 2^32, not 2^31.Tom Lane2013-12-12
| | | | Pointed out by Gianni Ciolli.
* Fix possible crash with nested SubLinks.Tom Lane2013-12-10
| | | | | | | | | | | | | An expression such as WHERE (... x IN (SELECT ...) ...) IN (SELECT ...) could produce an invalid plan that results in a crash at execution time, if the planner attempts to flatten the outer IN into a semi-join. This happens because convert_testexpr() was not expecting any nested SubLinks and would wrongly replace any PARAM_SUBLINK Params belonging to the inner SubLink. (I think the comment denying that this case could happen was wrong when written; it's certainly been wrong for quite a long time, since very early versions of the semijoin flattening logic.) Per report from Teodor Sigaev. Back-patch to all supported branches.
* Fix improper abort during update chain lockingAlvaro Herrera2013-12-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 247c76a98909, I added some code to do fine-grained checking of MultiXact status of locking/updating transactions when traversing an update chain. There was a thinko in that patch which would have the traversing abort, that is return HeapTupleUpdated, when the other transaction is a committed lock-only. In this case we should ignore it and return success instead. Of course, in the case where there is a committed update, HeapTupleUpdated is the correct return value. A user-visible symptom of this bug is that in REPEATABLE READ and SERIALIZABLE transaction isolation modes spurious serializability errors can occur: ERROR: could not serialize access due to concurrent update In order for this to happen, there needs to be a tuple that's key-share- locked and also updated, and the update must abort; a subsequent transaction trying to acquire a new lock on that tuple would abort with the above error. The reason is that the initial FOR KEY SHARE is seen as committed by the new locking transaction, which triggers this bug. (If the UPDATE commits, then the serialization error is correctly reported.) When running a query in READ COMMITTED mode, what happens is that the locking is aborted by the HeapTupleUpdated return value, then EvalPlanQual fetches the newest version of the tuple, which is then the only version that gets locked. (The second time the tuple is checked there is no misbehavior on the committed lock-only, because it's not checked by the code that traverses update chains; so no bug.) Only the newest version of the tuple is locked, not older ones, but this is harmless. The isolation test added by this commit illustrates the desired behavior, including the proper serialization errors that get thrown. Backpatch to 9.3.
* Clear retry flags properly in replacement OpenSSL sock_write function.Tom Lane2013-12-05
| | | | | | | | | | | | Current OpenSSL code includes a BIO_clear_retry_flags() step in the sock_write() function. Either we failed to copy the code correctly, or they added this since we copied it. In any case, lack of the clear step appears to be the cause of the server lockup after connection loss reported in bug #8647 from Valentine Gogichashvili. Assume that this is correct coding for all OpenSSL versions, and hence back-patch to all supported branches. Diagnosis and patch by Alexander Kukushkin.
* Avoid resetting Xmax when it's a multi with an aborted updateAlvaro Herrera2013-12-05
| | | | | | | | | | | | | | | | | | HeapTupleSatisfiesUpdate can very easily "forget" tuple locks while checking the contents of a multixact and finding it contains an aborted update, by setting the HEAP_XMAX_INVALID bit. This would lead to concurrent transactions not noticing any previous locks held by transactions that might still be running, and thus being able to acquire subsequent locks they wouldn't be normally able to acquire. This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3, like that commit. This change reverts the change to the delete-abort-savept isolation test in 1ce150b7bb, because that behavior change was caused by this bug. Noticed by Andres Freund while investigating a different issue reported by Noah Misch.
* Fix full-page writes of internal GIN pages.Heikki Linnakangas2013-12-03
| | | | | | | | | | | | | | | Insertion to a non-leaf GIN page didn't make a full-page image of the page, which is wrong. The code used to do it correctly, but was changed (commit 853d1c3103fa961ae6219f0281885b345593d101) because the redo-routine didn't track incomplete splits correctly when the page was restored from a full page image. Of course, that was not right way to fix it, the redo routine should've been fixed instead. The redo-routine was surreptitiously fixed in 2010 (commit 4016bdef8aded77b4903c457050622a5a1815c16), so all we need to do now is revert the code that creates the record to its original form. This doesn't change the format of the WAL record. Backpatch to all supported versions.
* Fix crash in assign_collations_walker for EXISTS with empty SELECT list.Tom Lane2013-12-02
| | | | | We (I think I, actually) forgot about this corner case while coding collation resolution. Per bug #8648 from Arjen Nienhuis.
* Translation updatesPeter Eisentraut2013-12-02
|
* Fix a couple of bugs in MultiXactId freezingAlvaro Herrera2013-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look into a multixact to check the members against cutoff_xid. This means that a very old Xid could survive hidden within a multi, possibly outliving its CLOG storage. In the distant future, this would cause clog lookup failures: ERROR: could not access status of transaction 3883960912 DETAIL: Could not open file "pg_clog/0E78": No such file or directory. This mostly was problematic when the updating transaction aborted, since in that case the row wouldn't get pruned away earlier in vacuum and the multixact could possibly survive for a long time. In many cases, data that is inaccessible for this reason way can be brought back heuristically. As a second bug, heap_freeze_tuple() didn't properly handle multixacts that need to be frozen according to cutoff_multi, but whose updater xid is still alive. Instead of preserving the update Xid, it just set Xmax invalid, which leads to both old and new tuple versions becoming visible. This is pretty rare in practice, but a real threat nonetheless. Existing corrupted rows, unfortunately, cannot be repaired in an automated fashion. Existing physical replicas might have already incorrectly frozen tuples because of different behavior than in master, which might only become apparent in the future once pg_multixact/ is truncated; it is recommended that all clones be rebuilt after upgrading. Following code analysis caused by bug report by J Smith in message CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com and privately by F-Secure. Backpatch to 9.3, where freezing of MultiXactIds was introduced. Analysis and patch by Andres Freund, with some tweaks by Álvaro.
* Don't TransactionIdDidAbort in HeapTupleGetUpdateXidAlvaro Herrera2013-11-29
| | | | | | | | | | | | | | | | It is dangerous to do so, because some code expects to be able to see what's the true Xmax even if it is aborted (particularly while traversing HOT chains). So don't do it, and instead rely on the callers to verify for abortedness, if necessary. Several race conditions and bugs fixed in the process. One isolation test changes the expected output due to these. This also reverts commit c235a6a589b, which is no longer necessary. Backpatch to 9.3, where this function was introduced. Andres Freund
* Truncate pg_multixact/'s contents during crash recoveryAlvaro Herrera2013-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash recovery, because there was no guarantee that enough state had been setup, and because it wasn't deemed to be a good idea to remove data during crash recovery anyway. Since then, due to Hot-Standby, streaming replication and PITR, the amount of time a cluster can spend doing crash recovery has increased significantly, to the point that a cluster may even never come out of it. This has made not truncating the content of pg_multixact/ not defensible anymore. To fix, take care to setup enough state for multixact truncation before crash recovery starts (easy since checkpoints contain the required information), and move the current end-of-recovery actions to a new TrimMultiXact() function, analogous to TrimCLOG(). At some later point, this should probably done similarly to the way clog.c is doing it, which is to just WAL log truncations, but we can't do that for the back branches. Back-patch to 9.0. 8.4 also has the problem, but since there's no hot standby there, it's much less pressing. In 9.2 and earlier, this patch is simpler than in newer branches, because multixact access during recovery isn't required. Add appropriate checks to make sure that's not happening. Andres Freund
* Fix full-table-vacuum request mechanism for MultiXactIdsAlvaro Herrera2013-11-29
| | | | | | | | | | | | | | | | | While autovacuum dutifully launched anti-multixact-wraparound vacuums when the multixact "age" was reached, the vacuum code was not aware that it needed to make them be full table vacuums. As the resulting partial-table vacuums aren't capable of actually increasing relminmxid, autovacuum continued to launch anti-wraparound vacuums that didn't have the intended effect, until age of relfrozenxid caused the vacuum to finally be a full table one via vacuum_freeze_table_age. To fix, introduce logic for multixacts similar to that for plain TransactionIds, using the same GUCs. Backpatch to 9.3, where permanent MultiXactIds were introduced. Andres Freund, some cleanup by Álvaro
* Replace hardcoded 200000000 with autovacuum_freeze_max_ageAlvaro Herrera2013-11-29
| | | | | | | | | | | | Parts of the code used autovacuum_freeze_max_age to determine whether anti-multixact-wraparound vacuums are necessary, while others used a hardcoded 200000000 value. This leads to problems when autovacuum_freeze_max_age is set to a non-default value. Use the latter everywhere. Backpatch to 9.3, where vacuuming of multixacts was introduced. Andres Freund
* Be sure to release proc->backendLock after SetupLockInTable() failure.Tom Lane2013-11-29
| | | | | | | | | | | | | | The various places that transferred fast-path locks to the main lock table neglected to release the PGPROC's backendLock if SetupLockInTable failed due to being out of shared memory. In most cases this is no big deal since ensuing error cleanup would release all held LWLocks anyway. But there are some hot-standby functions that don't consider failure of FastPathTransferRelationLocks to be a hard error, and in those cases this oversight could lead to system lockup. For consistency, make all of these places look the same as FastPathTransferRelationLocks. Noted while looking for the cause of Dan Wood's bugs --- this wasn't it, but it's a bug anyway.
* Fix assorted race conditions in the new timeout infrastructure.Tom Lane2013-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent handle_sig_alarm from losing control partway through due to a query cancel (either an asynchronous SIGINT, or a cancel triggered by one of the timeout handler functions). That would at least result in failure to schedule any required future interrupt, and might result in actual corruption of timeout.c's data structures, if the interrupt happened while we were updating those. We could still lose control if an asynchronous SIGINT arrives just as the function is entered. This wouldn't break any data structures, but it would have the same effect as if the SIGALRM interrupt had been silently lost: we'd not fire any currently-due handlers, nor schedule any new interrupt. To forestall that scenario, forcibly reschedule any pending timer interrupt during AbortTransaction and AbortSubTransaction. We can avoid any extra kernel call in most cases by not doing that until we've allowed LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events. Another hazard is that some platforms (at least Linux and *BSD) block a signal before calling its handler and then unblock it on return. When we longjmp out of the handler, the unblock doesn't happen, and the signal is left blocked indefinitely. Again, we can fix that by forcibly unblocking signals during AbortTransaction and AbortSubTransaction. These latter two problems do not manifest when the longjmp reaches postgres.c, because the error recovery code there kills all pending timeout events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal mask is restored. So errors thrown outside any transaction should be OK already, and cleaning up in AbortTransaction and AbortSubTransaction should be enough to fix these issues. (We're assuming that any code that catches a query cancel error and doesn't re-throw it will do at least a subtransaction abort to clean up; but that was pretty much required already by other subsystems.) Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when disabling that event: if a lock timeout interrupt happened after the lock was granted, the ensuing query cancel is still going to happen at the next CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user cancel. Per reports from Dan Wood. Back-patch to 9.3 where the new timeout handling infrastructure was introduced. We may at some point decide to back-patch the signal unblocking changes further, but I'll desist from that until we hear actual field complaints about it.
* Fix latent(?) race condition in LockReleaseAll.Tom Lane2013-11-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We have for a long time checked the head pointer of each of the backend's proclock lists and skipped acquiring the corresponding locktable partition lock if the head pointer was NULL. This was safe enough in the days when proclock lists were changed only by the owning backend, but it is pretty questionable now that the fast-path patch added cases where backends add entries to other backends' proclock lists. However, we don't really wish to revert to locking each partition lock every time, because in simple transactions that would add a lot of useless lock/unlock cycles on already-heavily-contended LWLocks. Fortunately, the only way that another backend could be modifying our proclock list at this point would be if it was promoting a formerly fast-path lock of ours; and any such lock must be one that we'd decided not to delete in the previous loop over the locallock table. So it's okay if we miss seeing it in this loop; we'd just decide not to delete it again. However, once we've detected a non-empty list, we'd better re-fetch the list head pointer after acquiring the partition lock. This guards against possibly fetching a corrupt-but-non-null pointer if pointer fetch/store isn't atomic. It's not clear if any practical architectures are like that, but we've never assumed that before and don't wish to start here. In any case, the situation certainly deserves a code comment. While at it, refactor the partition traversal loop to use a for() construct instead of a while() loop with goto's. Back-patch, just in case the risk is real and not hypothetical.
* Use a more granular approach to follow update chainsAlvaro Herrera2013-11-28
| | | | | | | | Instead of simply checking the KEYS_UPDATED bit, we need to check whether each lock held on the future version of the tuple conflicts with the lock we're trying to acquire. Per bug report #8434 by Tomonari Katsumata