aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Silence warning about unused variable, when building without assertions.Heikki Linnakangas2012-03-08
|
* Typo fix.Robert Haas2012-03-06
| | | | Fujii Masao
* Make the comments more clear on the fact that UpdateFullPageWrites() is notHeikki Linnakangas2012-03-06
| | | | safe to call concurrently from multiple processes.
* Remove extra copies of LogwrtResult.Heikki Linnakangas2012-03-06
| | | | | | | | | | | | | | | This simplifies the code a little bit. The new rule is that to update XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas before we had two copies, one that was protected by WALWriteLock and another protected by info_lck. The code that updates them was already holding both locks, so merging the two is trivial. The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it was used in AdvanceXLInsertBuffer to update the backend-local copy, before acquiring the info_lck to read the up-to-date value. But the value of that seems dubious; at best it's saving one spinlock acquisition per completed WAL page, which is not significant compared to all the other work involved. And in practice, it's probably not saving even that much.
* Simplify the way changes to full_page_writes are logged.Heikki Linnakangas2012-03-06
| | | | | | | | | | | | | | | It's harmless to do full page writes even when not strictly necessary, so when turning full_page_writes on, we can set the global flag first, and then call XLogInsert. Likewise, when turning it off, we can write the WAL record first, and then clear the flag. This way XLogInsert doesn't need any special handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated enough already, so anything we can keep away from there is a good thing. Actually I don't think the atomicity of the shared memory flag matters, anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery, when there are no concurrent WAL insertions going on. But might as well make it safe, in case we allow changing full_page_writes on the fly in the future.
* More carefully validate xlog location string inputsMagnus Hagander2012-03-04
| | | | | | | Now that we have validate_xlog_location, call it from the previously existing functions taking xlog locatoins as a string input. Suggested by Fujii Masao
* Add function pg_xlog_location_diff to help comparisonsMagnus Hagander2012-03-04
| | | | | | | | Comparing two xlog locations are useful for example when calculating replication lag. Euler Taveira de Oliveira, reviewed by Fujii Masao, and some cleanups from me
* When a GiST page is split during index build, it might not have a buffer.Heikki Linnakangas2012-03-02
| | | | | | | | | | | | | Previously it was thought that it's impossible as the code stands, because insertions create buffers as tuples are cascaded downwards, and index split also creaters buffers eagerly for all halves. But the example from Jay Levitt demonstrates that it can happen, when the root page is split. It's in fact OK if the buffer doesn't exist, so we just need to remove the sanity check. In fact, we've been discussing the possibility of destroying empty buffers to conserve memory, which would render the sanity check completely useless anyway. Fix by Alexander Korotkov
* Add const qualifiers where they are accidentally cast awayPeter Eisentraut2012-02-28
| | | | | This only produces warnings under -Wcast-qual, but it's more correct and consistent in any case.
* Fix some more bugs in GIN's WAL replay logic.Tom Lane2012-02-26
| | | | | | | | | | | | | | | | | | In commit 4016bdef8aded77b4903c457050622a5a1815c16 I fixed a bunch of ginxlog.c bugs having to do with not handling XLogReadBuffer failures correctly. However, in ginRedoUpdateMetapage and ginRedoDeleteListPages, I unaccountably thought that failure to read the metapage would be impossible and just put in an elog(PANIC) call. This is of course wrong: failure is exactly what will happen if the index got dropped (or rebuilt) between creation of the WAL record and the crash we're trying to recover from. I believe this explains Nicholas Wilson's recent report of these errors getting reached. Also, fix memory leak in forgetIncompleteSplit. This wasn't of much concern when the code was written, but in a long-running standby server page split records could be expected to accumulate indefinitely. Back-patch to 8.4 --- before that, GIN didn't have a metapage.
* Add some enumeration commas, for consistencyPeter Eisentraut2012-02-24
|
* Don't clear btpo_cycleid during _bt_vacuum_one_page.Tom Lane2012-02-21
| | | | | | | | | | | | | | When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not actually within a vacuum operation, but rather in an ordinary insertion process that could well be running concurrently with a vacuum. So clearing the cycleid is incorrect, and could cause the concurrent vacuum to miss removing tuples that it needs to remove. This is a longstanding bug introduced by commit e6284649b9e30372b3990107a082bc7520325676 of 2006-07-25. I believe it explains Maxim Boguk's recent report of index corruption, and probably some other previously unexplained reports. In 9.0 and up this is a one-line fix; before that we need to introduce a flag to tell _bt_delitems what to do.
* Cosmetic cleanup for commit a760893dbda9934e287789d54bbd3c4ca3914ce0.Tom Lane2012-02-21
| | | | Mostly, fixing overlooked comments.
* Fix heap_multi_insert to set t_self field in the caller's tuples.Heikki Linnakangas2012-02-13
| | | | | | | | If tuples were toasted, heap_multi_insert didn't update the ctid on the original tuples. This caused a failure if there was an after trigger (including a foreign key), on the table, and a tuple got toasted. Per off-list report and test case from Ted Phelps
* Throw error sooner for unlogged GiST indexes.Tom Lane2012-02-08
| | | | | | Throwing an error only after we've built the main index fork is pretty unfriendly when the table already contains data. Per gripe from Jay Levitt.
* Rename LWLockWaitUntilFree to LWLockAcquireOrWait.Heikki Linnakangas2012-02-08
| | | | | LWLockAcquireOrWait makes it more clear that the lock is acquired if it's free.
* Add locking around WAL-replay modification of shared-memory variables.Tom Lane2012-02-06
| | | | | | | | | | | | | | | | | | | | | | | | | Originally, most of this code assumed that no Postgres backends could be running concurrently with it, and so no locking could be needed. That assumption fails in Hot Standby. While it's still true that Hot Standby backends should never change values like nextXid, they can examine them, and consistency is important in some cases such as when computing a snapshot. Therefore, prudence requires that WAL replay code obtain the relevant locks when modifying such variables, even though it can examine them without taking a lock. We were following that coding rule in some places but not all. This commit applies the coding rule uniformly to all updates of ShmemVariableCache and MultiXactState fields; a search of the replay routines did not find any other cases that seemed to be at risk. In addition, this commit fixes a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The additional locking seems to be more in the nature of future-proofing than fixing any live bug, so I am not going to back-patch it. The NEXTOID fix will be back-patched separately.
* Fix transient clobbering of shared buffers during WAL replay.Tom Lane2012-02-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.
* Improve comment.Tom Lane2012-02-04
|
* Avoid re-checking for visibility map extension too frequently.Robert Haas2012-02-01
| | | | | | | | | | When testing bits (but not when setting or clearing them), we now won't check whether the map has been extended. This significantly improves performance in the case where the visibility map doesn't exist yet, by avoiding an extra system call per tuple. To make sure backends notice eventually, send an smgr inval on VM extension. Dean Rasheed, with minor modifications by me.
* Make group commit more effective.Heikki Linnakangas2012-01-30
| | | | | | | | | | | | | | | | | | | | When a backend needs to flush the WAL, and someone else is already flushing the WAL, wait until it releases the WALInsertLock and check if we still need to do the flush or if the other backend already did the work for us, before acquiring WALInsertLock. This helps group commit, because when the WAL flush finishes, all the backends that were waiting for it can be woken up in one go, and the can all concurrently observe that they're done, rather than waking them up one by one in a cascading fashion. This is based on a new LWLock function, LWLockWaitUntilFree(), which has peculiar semantics. If the lock is immediately free, it grabs the lock and returns true. If it's not free, it waits until it is released, but then returns false without grabbing the lock. This is used in XLogFlush(), so that when the lock is acquired, the backend flushes the WAL, but if it's not, the backend first checks the current flush location before retrying. Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although this patch as committed ended up being very different from that.
* Assorted comment fixes, mostly just typos, but some obsolete statements.Tom Lane2012-01-29
| | | | YAMAMOTO Takashi
* Allow pg_basebackup from standby node with safety checking.Simon Riggs2012-01-25
| | | | | | | Base backup follows recommended procedure, plus goes to great lengths to ensure that partial page writes are avoided. Jun Ishizuka and Fujii Masao, with minor modifications
* fastgetattr is in access/htup.h, not access/heapam.hRobert Haas2012-01-16
| | | | Noted by Peter Geoghegan
* Correctly initialise shared recoveryLastRecPtr in recovery.Simon Riggs2012-01-13
| | | | | | | | Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao
* Fix CLUSTER/VACUUM FULL for toast values owned by recently-updated rows.Tom Lane2012-01-12
| | | | | | | | | | | | | | | | | | | | | | | | | | In commit 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, I made CLUSTER and VACUUM FULL try to preserve toast value OIDs from the original toast table to the new one. However, if we have to copy both live and recently-dead versions of a row that has a toasted column, those versions may well reference the same toast value with the same OID. The patch then led to duplicate-key failures as we tried to insert the toast value twice with the same OID. (The previous behavior was not very desirable either, since it would have silently inserted the same value twice with different OIDs. That wastes space, but what's worse is that the toast values inserted for already-dead heap rows would not be reclaimed by subsequent ordinary VACUUMs, since they go into the new toast table marked live not deleted.) To fix, check if the copied OID already exists in the new toast table, and if so, assume that it stores the desired value. This is reasonably safe since the only case where we will copy an OID from a previous toast pointer is when toast_insert_or_update was given that toast pointer and so we just pulled the data from the old table; if we got two different values that way then we have big problems anyway. We do have to assume that no other backend is inserting items into the new toast table concurrently, but that's surely safe for CLUSTER and VACUUM FULL. Per bug #6393 from Maxim Boguk. Back-patch to 9.0, same as the previous patch.
* Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was alwaysHeikki Linnakangas2012-01-11
| | | | passed as 'true'.
* Refactor XLogInsert a bit. The rdata entries for backup blocks are nowHeikki Linnakangas2012-01-11
| | | | | | constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.
* Make the number of CLOG buffers adaptive, based on shared_buffers.Robert Haas2012-01-06
| | | | | | | | | | | | Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.
* Update copyright notices for year 2012.Bruce Momjian2012-01-01
|
* Send new protocol keepalive messages to standby servers.Simon Riggs2011-12-31
| | | | | Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.
* Rethink representation of index clauses' mapping to index columns.Tom Lane2011-12-24
| | | | | | | | | | | | | | | | | | | | | In commit e2c2c2e8b1df7dfdb01e7e6f6191a569ce3c3195 I made use of nested list structures to show which clauses went with which index columns, but on reflection that's a data structure that only an old-line Lisp hacker could love. Worse, it adds unnecessary complication to the many places that don't much care which clauses go with which index columns. Revert to the previous arrangement of flat lists of clauses, and instead add a parallel integer list of column numbers. The places that care about the pairing can chase both lists with forboth(), while the places that don't care just examine one list the same as before. The only real downside to this is that there are now two more lists that need to be passed to amcostestimate functions in case they care about column matching (which btcostestimate does, so not passing the info is not an option). Rather than deal with 11-argument amcostestimate functions, pass just the IndexPath and expect the functions to extract fields from it. That gets us down to 7 arguments which is better than 11, and it seems more future-proof against likely additions to the information we keep about an index path.
* Add a security_barrier option for views.Robert Haas2011-12-22
| | | | | | | | | | | | | | When a view is marked as a security barrier, it will not be pulled up into the containing query, and no quals will be pushed down into it, so that no function or operator chosen by the user can be applied to rows not exposed by the view. Views not configured with this option cannot provide robust row-level security, but will perform far better. Patch by KaiGai Kohei; original problem report by Heikki Linnakangas (in October 2009!). Review (in earlier versions) by Noah Misch and others. Design advice by Tom Lane and myself. Further review and cleanup by me.
* Avoid crashing when we have problems unlinking files post-commit.Tom Lane2011-12-20
| | | | | | | | | | | | | | | | | | smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists *does* throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.
* Add support for privileges on typesPeter Eisentraut2011-12-20
| | | | | | | | | This adds support for the more or less SQL-conforming USAGE privilege on types and domains. The intent is to be able restrict which users can create dependencies on types, which restricts the way in which owners can alter types. reviewed by Yeb Havinga
* Rename updateNodeLink to spgUpdateNodeLink.Tom Lane2011-12-19
| | | | | | | On reflection, the original name seems way too generic for a global symbol. A quick check shows this is the only exported function name in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the rest of them seem all right.
* Teach SP-GiST to do index-only scans.Tom Lane2011-12-19
| | | | | | | | | | | | Operator classes can specify whether or not they support this; this preserves the flexibility to use lossy representations within an index. In passing, move constant data about a given index into the rd_amcache cache area, instead of doing fresh lookups each time we start an index operation. This is mainly to try to make sure that spgcanreturn() has insignificant cost; I still don't have any proof that it matters for actual index accesses. Also, get rid of useless copying of FmgrInfo pointers; we can perfectly well use the relcache's versions in-place.
* Replace simple constant pg_am.amcanreturn with an AM support function.Tom Lane2011-12-18
| | | | | | | | | The need for this was debated when we put in the index-only-scan feature, but at the time we had no near-term expectation of having AMs that could support such scans for only some indexes; so we kept it simple. However, the SP-GiST AM forces the issue, so let's fix it. This patch only installs the new API; no behavior actually changes.
* Defend against null scankeys in spgist searches.Tom Lane2011-12-17
| | | | Should've thought of that one earlier.
* Fix some long-obsolete references to XLogOpenRelation.Tom Lane2011-12-17
| | | | | These were missed in commit a213f1ee6c5a1bbe1f074ca201975e76ad2ed50c, which removed that function.
* Fix compiler warning seen on 64-bit machine.Tom Lane2011-12-17
|
* Add SP-GiST (space-partitioned GiST) index access method.Tom Lane2011-12-17
| | | | | | | | | | | | SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
* Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.Tom Lane2011-12-12
| | | | | | | | | | | | | | | | | | | Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
* Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,Heikki Linnakangas2011-12-09
| | | | | | | | | | | | | | | we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.
* Create a "sort support" interface API for faster sorting.Tom Lane2011-12-07
| | | | | | | | | | | | This patch creates an API whereby a btree index opclass can optionally provide non-SQL-callable support functions for sorting. In the initial patch, we only use this to provide a directly-callable comparator function, which can be invoked with a bit less overhead than the traditional SQL-callable comparator. While that should be of value in itself, the real reason for doing this is to provide a datatype-extensible framework for more aggressive optimizations, as in Peter Geoghegan's recent work. Robert Haas and Tom Lane
* During recovery, if we reach consistent state and still have entries in theHeikki Linnakangas2011-12-02
| | | | | | | | | | | | | | | invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao
* Improve table locking behavior in the face of current DDL.Robert Haas2011-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the previous coding, callers were faced with an awkward choice: look up the name, do permissions checks, and then lock the table; or look up the name, lock the table, and then do permissions checks. The first choice was wrong because the results of the name lookup and permissions checks might be out-of-date by the time the table lock was acquired, while the second allowed a user with no privileges to interfere with access to a table by users who do have privileges (e.g. if a malicious backend queues up for an AccessExclusiveLock on a table on which AccessShareLock is already held, further attempts to access the table will be blocked until the AccessExclusiveLock is obtained and the malicious backend's transaction rolls back). To fix, allow callers of RangeVarGetRelid() to pass a callback which gets executed after performing the name lookup but before acquiring the relation lock. If the name lookup is retried (because invalidation messages are received), the callback will be re-executed as well, so we get the best of both worlds. RangeVarGetRelid() is renamed to RangeVarGetRelidExtended(); callers not wishing to supply a callback can continue to invoke it as RangeVarGetRelid(), which is now a macro. Since the only one caller that uses nowait = true now passes a callback anyway, the RangeVarGetRelid() macro defaults nowait as well. The callback can also be used for supplemental locking - for example, REINDEX INDEX needs to acquire the table lock before the index lock to reduce deadlock possibilities. There's a lot more work to be done here to fix all the cases where this can be a problem, but this commit provides the general infrastructure and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE, LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE. Per discussion with Noah Misch and Alvaro Herrera.
* Use IEEE infinity, not 1e10, for null-and-not-null case in gistpenalty().Tom Lane2011-11-27
| | | | | | | Use of a randomly chosen large value was never exactly graceful, and now that there are penalty functions that are intentionally using infinity, it doesn't seem like a good idea for null-vs-not-null to be using something less.
* Take fillfactor into account in the new COPY bulk heap insert code.Heikki Linnakangas2011-11-26
| | | | Jeff Janes
* Fix erroneous replay of GIN_UPDATE_META_PAGE WAL records.Tom Lane2011-11-25
| | | | | | | | | | | | | A simple thinko in ginRedoUpdateMetapage, namely failing to increment a loop counter, led to inserting records into the last pending-list page in the wrong order (the opposite of that intended). So far as I can tell, this would not upset the code that eventually flushes pending items into the main part of the GIN index. But it did break the code that searched the pending list for matches, resulting in transient failure to find matching entries during index lookups, as illustrated in bug #6307 from Maksym Boguk. Back-patch to 8.4 where the incorrect code was introduced.