aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Make the number of CLOG buffers adaptive, based on shared_buffers.Robert Haas2012-01-06
| | | | | | | | | | | | Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.
* Update copyright notices for year 2012.Bruce Momjian2012-01-01
|
* Send new protocol keepalive messages to standby servers.Simon Riggs2011-12-31
| | | | | Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.
* Rethink representation of index clauses' mapping to index columns.Tom Lane2011-12-24
| | | | | | | | | | | | | | | | | | | | | In commit e2c2c2e8b1df7dfdb01e7e6f6191a569ce3c3195 I made use of nested list structures to show which clauses went with which index columns, but on reflection that's a data structure that only an old-line Lisp hacker could love. Worse, it adds unnecessary complication to the many places that don't much care which clauses go with which index columns. Revert to the previous arrangement of flat lists of clauses, and instead add a parallel integer list of column numbers. The places that care about the pairing can chase both lists with forboth(), while the places that don't care just examine one list the same as before. The only real downside to this is that there are now two more lists that need to be passed to amcostestimate functions in case they care about column matching (which btcostestimate does, so not passing the info is not an option). Rather than deal with 11-argument amcostestimate functions, pass just the IndexPath and expect the functions to extract fields from it. That gets us down to 7 arguments which is better than 11, and it seems more future-proof against likely additions to the information we keep about an index path.
* Add a security_barrier option for views.Robert Haas2011-12-22
| | | | | | | | | | | | | | When a view is marked as a security barrier, it will not be pulled up into the containing query, and no quals will be pushed down into it, so that no function or operator chosen by the user can be applied to rows not exposed by the view. Views not configured with this option cannot provide robust row-level security, but will perform far better. Patch by KaiGai Kohei; original problem report by Heikki Linnakangas (in October 2009!). Review (in earlier versions) by Noah Misch and others. Design advice by Tom Lane and myself. Further review and cleanup by me.
* Avoid crashing when we have problems unlinking files post-commit.Tom Lane2011-12-20
| | | | | | | | | | | | | | | | | | smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists *does* throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.
* Add support for privileges on typesPeter Eisentraut2011-12-20
| | | | | | | | | This adds support for the more or less SQL-conforming USAGE privilege on types and domains. The intent is to be able restrict which users can create dependencies on types, which restricts the way in which owners can alter types. reviewed by Yeb Havinga
* Rename updateNodeLink to spgUpdateNodeLink.Tom Lane2011-12-19
| | | | | | | On reflection, the original name seems way too generic for a global symbol. A quick check shows this is the only exported function name in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the rest of them seem all right.
* Teach SP-GiST to do index-only scans.Tom Lane2011-12-19
| | | | | | | | | | | | Operator classes can specify whether or not they support this; this preserves the flexibility to use lossy representations within an index. In passing, move constant data about a given index into the rd_amcache cache area, instead of doing fresh lookups each time we start an index operation. This is mainly to try to make sure that spgcanreturn() has insignificant cost; I still don't have any proof that it matters for actual index accesses. Also, get rid of useless copying of FmgrInfo pointers; we can perfectly well use the relcache's versions in-place.
* Replace simple constant pg_am.amcanreturn with an AM support function.Tom Lane2011-12-18
| | | | | | | | | The need for this was debated when we put in the index-only-scan feature, but at the time we had no near-term expectation of having AMs that could support such scans for only some indexes; so we kept it simple. However, the SP-GiST AM forces the issue, so let's fix it. This patch only installs the new API; no behavior actually changes.
* Defend against null scankeys in spgist searches.Tom Lane2011-12-17
| | | | Should've thought of that one earlier.
* Fix some long-obsolete references to XLogOpenRelation.Tom Lane2011-12-17
| | | | | These were missed in commit a213f1ee6c5a1bbe1f074ca201975e76ad2ed50c, which removed that function.
* Fix compiler warning seen on 64-bit machine.Tom Lane2011-12-17
|
* Add SP-GiST (space-partitioned GiST) index access method.Tom Lane2011-12-17
| | | | | | | | | | | | SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
* Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.Tom Lane2011-12-12
| | | | | | | | | | | | | | | | | | | Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
* Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,Heikki Linnakangas2011-12-09
| | | | | | | | | | | | | | | we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.
* Create a "sort support" interface API for faster sorting.Tom Lane2011-12-07
| | | | | | | | | | | | This patch creates an API whereby a btree index opclass can optionally provide non-SQL-callable support functions for sorting. In the initial patch, we only use this to provide a directly-callable comparator function, which can be invoked with a bit less overhead than the traditional SQL-callable comparator. While that should be of value in itself, the real reason for doing this is to provide a datatype-extensible framework for more aggressive optimizations, as in Peter Geoghegan's recent work. Robert Haas and Tom Lane
* During recovery, if we reach consistent state and still have entries in theHeikki Linnakangas2011-12-02
| | | | | | | | | | | | | | | invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao
* Improve table locking behavior in the face of current DDL.Robert Haas2011-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the previous coding, callers were faced with an awkward choice: look up the name, do permissions checks, and then lock the table; or look up the name, lock the table, and then do permissions checks. The first choice was wrong because the results of the name lookup and permissions checks might be out-of-date by the time the table lock was acquired, while the second allowed a user with no privileges to interfere with access to a table by users who do have privileges (e.g. if a malicious backend queues up for an AccessExclusiveLock on a table on which AccessShareLock is already held, further attempts to access the table will be blocked until the AccessExclusiveLock is obtained and the malicious backend's transaction rolls back). To fix, allow callers of RangeVarGetRelid() to pass a callback which gets executed after performing the name lookup but before acquiring the relation lock. If the name lookup is retried (because invalidation messages are received), the callback will be re-executed as well, so we get the best of both worlds. RangeVarGetRelid() is renamed to RangeVarGetRelidExtended(); callers not wishing to supply a callback can continue to invoke it as RangeVarGetRelid(), which is now a macro. Since the only one caller that uses nowait = true now passes a callback anyway, the RangeVarGetRelid() macro defaults nowait as well. The callback can also be used for supplemental locking - for example, REINDEX INDEX needs to acquire the table lock before the index lock to reduce deadlock possibilities. There's a lot more work to be done here to fix all the cases where this can be a problem, but this commit provides the general infrastructure and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE, LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE. Per discussion with Noah Misch and Alvaro Herrera.
* Use IEEE infinity, not 1e10, for null-and-not-null case in gistpenalty().Tom Lane2011-11-27
| | | | | | | Use of a randomly chosen large value was never exactly graceful, and now that there are penalty functions that are intentionally using infinity, it doesn't seem like a good idea for null-vs-not-null to be using something less.
* Take fillfactor into account in the new COPY bulk heap insert code.Heikki Linnakangas2011-11-26
| | | | Jeff Janes
* Fix erroneous replay of GIN_UPDATE_META_PAGE WAL records.Tom Lane2011-11-25
| | | | | | | | | | | | | A simple thinko in ginRedoUpdateMetapage, namely failing to increment a loop counter, led to inserting records into the last pending-list page in the wrong order (the opposite of that intended). So far as I can tell, this would not upset the code that eventually flushes pending items into the main part of the GIN index. But it did break the code that searched the pending list for matches, resulting in transient failure to find matching entries during index lookups, as illustrated in bug #6307 from Maksym Boguk. Back-patch to 8.4 where the incorrect code was introduced.
* Move "hot" members of PGPROC into a separate PGXACT array.Robert Haas2011-11-25
| | | | | | | | | | | | This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas
* Continue to allow VACUUM to mark last block of index dirtySimon Riggs2011-11-22
| | | | | even when there is no work to do. Further analysis required. Revert of patch c1458cc495ff800cd176a1c2e56d8b62680d9b71
* Avoid marking buffer dirty when VACUUM has no work to do.Simon Riggs2011-11-18
| | | | | | | When wal_level = 'hot_standby' we touched the last page of the relation during a VACUUM, even if nothing else had happened. That would alter the LSN of the last block and set the mtime of the relation file unnecessarily. Noted by Thom Brown.
* Wakeup WALWriter as needed for asynchronous commit performance.Simon Riggs2011-11-13
| | | | | | | | Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.
* Fix another bug in the redo of COPY batches.Heikki Linnakangas2011-11-10
| | | | | I got alignment wrong in the redo routine. Spotted by redoing the log genereated by copy regression test.
* Fix bugs in the COPY heap-insert batching patch.Heikki Linnakangas2011-11-09
| | | | | | | | | | Forgot to call RestoreBkpBlocks() in the redo-function, as pointed out by Simon Riggs. In redo of a regular heap insert, it's taken care of in heap_redo(), but this new record type uses the heap2 RM, and heap2_redo() does not take care of that for you. Also, failed to reset the vmbuffer and all_visibile_cleared local variables after switching to a new buffer.
* In COPY, insert tuples to the heap in batches.Heikki Linnakangas2011-11-09
| | | | | | | This greatly reduces the WAL volume, especially when the table is narrow. The overhead of locking the heap page is also reduced. Reduced WAL traffic also makes it scale a lot better, if you run multiple COPY processes at the same time.
* Make VACUUM avoid waiting for a cleanup lock, where possible.Robert Haas2011-11-07
| | | | | | | | | | | In a regular VACUUM, it's OK to skip pages for which a cleanup lock isn't immediately available; the next VACUUM will deal with them. If we're scanning the entire relation to advance relfrozenxid, we might need to wait, but only if there are tuples on the page that actually require freezing. These changes should greatly reduce the incidence of of vacuum processes getting "stuck". Simon Riggs and Robert Haas
* Don't assume that a tuple's header size is unchanged during toasting.Tom Lane2011-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This assumption can be wrong when the toaster is passed a raw on-disk tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation that added columns without rewriting the table. In such a case the tuple's natts value is smaller than what we expect from the tuple descriptor, and so its t_hoff value could be smaller too. In fact, the tuple might not have a null bitmap at all, and yet our current opinion of it is that it contains some trailing nulls. In such a situation, toast_insert_or_update did the wrong thing, because to save a few lines of code it would use the old t_hoff value as the offset where heap_fill_tuple should start filling data. This did not leave enough room for the new nulls bitmap, with the result that the first few bytes of data could be overwritten with null flag bits, as in a recent report from Hubert Depesz Lubaczewski. The particular case reported requires ALTER TABLE ADD COLUMN followed by CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and further requires that there be some out-of-line toasted fields in one of the tuples to be copied; else we'll not reach the troublesome code. The problem can only manifest in this form in 8.4 and later, because before commit a77eaa6a95009a3441e0d475d1980259d45da072, CREATE TABLE AS or INSERT/SELECT wouldn't result in raw disk tuples getting passed directly to heap_insert --- there would always have been at least a junkfilter in between, and that would reconstitute the tuple header with an up-to-date t_natts and hence t_hoff. But I'm backpatching the tuptoaster change all the way anyway, because I'm not convinced there are no older code paths that present a similar risk.
* Move user functions related to WAL into xlogfuncs.cSimon Riggs2011-11-04
|
* Avoid scanning nulls at the beginning of a btree index scan.Tom Lane2011-11-02
| | | | | | | | | | | If we have an inequality key that constrains the other end of the index, it doesn't directly help us in doing the initial positioning ... but it does imply a NOT NULL constraint on the index column. If the index stores nulls at this end, we can use the implied NOT NULL condition for initial positioning, just as if it had been stated explicitly. This avoids wasting time when there are a lot of nulls in the column. This is the reverse of the examples given in bugs #6278 and #6283, which were about failing to stop early when we encounter nulls at the end of the indexscan.
* Fix btree stop-at-nulls logic properly.Tom Lane2011-11-02
| | | | | | | | | | | | As pointed out by Naoya Anzai, my previous try at this was a few bricks shy of a load, because I had forgotten that the initial-positioning logic might not try to skip over nulls at the end of the index the scan will start from. We ought to fix that, because it represents an unnecessary inefficiency, but first let's get the scan-stop logic back to a safe state. With this patch, we preserve the performance benefit requested in bug #6278 for the case of scanning forward into NULLs (in a NULLS LAST index), but the reverse case of scanning backward across NULLs when there's no suitable initial-positioning qual is still inefficient.
* Update more comments about checkpoints being done by bgwriterSimon Riggs2011-11-02
|
* Reduce checkpoints and WAL traffic on low activity database serverSimon Riggs2011-11-02
| | | | | | | | | | | | Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.
* Refactor xlog.c to create src/backend/postmaster/startup.cSimon Riggs2011-11-02
| | | | | Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c
* Derive oldestActiveXid at correct time for Hot Standby.Simon Riggs2011-11-02
| | | | | | | | | There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop
* Fix timing of Startup CLOG and MultiXact during Hot StandbySimon Riggs2011-11-02
| | | | Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
* Fix race condition with toast table access from a stale syscache entry.Tom Lane2011-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a tuple in a syscache contains an out-of-line toasted field, and we try to fetch that field shortly after some other transaction has committed an update or deletion of the tuple, there is a race condition: vacuum could come along and remove the toast tuples before we can fetch them. This leads to transient failures like "missing chunk number 0 for toast value NNNNN in pg_toast_2619", as seen in recent reports from Andrew Hammond and Tim Uckun. The design idea of syscache is that access to stale syscache entries should be prevented by relation-level locks, but that fails for at least two cases where toasted fields are possible: ANALYZE updates pg_statistic rows without locking out sessions that might want to plan queries on the same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without any meaningful lock at all. The least risky fix seems to be an idea that Heikki suggested when we were dealing with a related problem back in August: forcibly detoast any out-of-line fields before putting a tuple into syscache in the first place. This avoids the problem because at the time we fetch the parent tuple from the catalog, we should be holding an MVCC snapshot that will prevent removal of the toast tuples, even if the parent tuple is outdated immediately after we fetch it. (Note: I'm not convinced that this statement holds true at every instant where we could be fetching a syscache entry at all, but it does appear to hold true at the times where we could fetch an entry that could have a toasted field. We will need to be a bit wary of adding toast tables to low-level catalogs that don't have them already.) An additional benefit is that subsequent uses of the syscache entry should be faster, since they won't have to detoast the field. Back-patch to all supported versions. The problem is significantly harder to reproduce in pre-9.0 releases, because of their willingness to flush every entry in a syscache whenever the underlying catalog is vacuumed (cf CatalogCacheFlushRelation); but there is still a window for trouble.
* Comment changes to show bgwriter no longer performs checkpoints.Simon Riggs2011-11-01
|
* Stop btree indexscans upon reaching nulls in either direction.Tom Lane2011-10-31
| | | | | | | | | | | The existing scan-direction-sensitive tests were overly complex, and failed to stop the scan in cases where it's perfectly legitimate to do so. Per bug #6278 from Maksym Boguk. Back-patch to 8.3, which is as far back as the patch applies easily. Doesn't seem worth sweating over a relatively minor performance issue in 8.2 at this late date. (But note that this was a performance regression from 8.1 and before, so 8.2 is being left as an outlier.)
* Update visibilitymap.c header comments.Robert Haas2011-10-29
| | | | Recent work on index-only scans left this somewhat out of date.
* Support synchronization of snapshots through an export/import procedure.Tom Lane2011-10-22
| | | | | | | | | | | | | | A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane
* Suppress -Wunused-result warnings about write() and fwrite().Tom Lane2011-10-18
| | | | | | | This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.
* Avoid assuming that index-only scan data matches the index's rowtype.Tom Lane2011-10-16
| | | | | | | | | | | | | | | | | | | | | | In general the data returned by an index-only scan should have the datatypes originally computed by FormIndexDatum. If the index opclasses use "storage" datatypes different from their input datatypes, the scan tuple will not have the same rowtype attributed to the index; but we had a hard-wired assumption that that was true in nodeIndexonlyscan.c. We'd already hacked around the issue for the one case where the types are different in btree indexes (btree name_ops), but this would definitely come back to bite us if we ever implement index-only scans in GiST. To fix, require the index AM to explicitly provide the tupdesc for the tuple it is returning. btree can just pass back the index's tupdesc, but GiST will have to work harder when and if it supports index-only scans. I had previously proposed fixing this by allowing the index AM to fill the scan tuple slot directly; but on reflection that seemed like a module layering violation, since TupleTableSlots are creatures of the executor. At least in the btree case, it would also be less efficient, since the tuple deconstruction work would occur even for rows later found to be invisible to the scan's snapshot.
* Teach btree to handle ScalarArrayOpExpr quals natively.Tom Lane2011-10-16
| | | | | This allows "indexedcol op ANY(ARRAY[...])" conditions to be used in plain indexscans, and particularly in index-only scans.
* Measure the number of all-visible pages for use in index-only scan costing.Tom Lane2011-10-14
| | | | | | | | | | | | | | | | | Add a column pg_class.relallvisible to remember the number of pages that were all-visible according to the visibility map as of the last VACUUM (or ANALYZE, or some other operations that update pg_class.relpages). Use relallvisible/relpages, instead of an arbitrary constant, to estimate how many heap page fetches can be avoided during an index-only scan. This is pretty primitive and will no doubt see refinements once we've acquired more field experience with the index-only scan mechanism, but it's way better than using a constant. Note: I had to adjust an underspecified query in the window.sql regression test, because it was changing answers when the plan changed to use an index-only scan. Some of the adjacent tests perhaps should be adjusted as well, but I didn't do that here.
* Modify RelationGetBufferForTuple() to use a typedef, rather than aBruce Momjian2011-10-12
| | | | struct, to help pgindent.
* Add comment on why pulling data from a "name" index column can't crash.Tom Lane2011-10-11
| | | | | | | | | | | It's been bothering me for several days that pretending that the cstring data stored in a btree name_ops column is really a "name" Datum could lead to reading past the end of memory. However, given the current memory layout used for index-only scans in the btree code, a crash is in fact not possible. Document that so we don't break it. I have not thought of any other solutions that aren't fairly ugly too, and most of them lose the functionality of index-only scans on name columns altogether, so this seems like the way to go.