aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/heap/heapam.c
Commit message (Collapse)AuthorAge
...
* Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBufferHeikki Linnakangas2008-10-31
| | | | | | | | | | | | functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.
* No need for extra code to log freezing zero tuples. Callers already check thatAlvaro Herrera2008-10-27
| | | | they are freezing a nonzero amount anyway.
* Modify the parser's error reporting to include a specific hint for the caseTom Lane2008-10-08
| | | | | | | of referencing a WITH item that's not yet in scope according to the SQL spec's semantics. This seems to be an easy error to make, and the bare "relation doesn't exist" message doesn't lead one's mind in the correct direction to fix it.
* Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, theHeikki Linnakangas2008-09-30
| | | | | | | | | | | | | free space information is stored in a dedicated FSM relation fork, with each relation (except for hash indexes; they don't use FSM). This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any trace of them from the backend, initdb, and documentation. Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also introduce a new variant of the get_raw_page(regclass, int4, int4) function in contrib/pageinspect that let's you to return pages from any relation fork, and a new fsm_page_contents() function to inspect the new FSM pages.
* Initialize the minimum frozen Xid in vac_update_datfrozenxid usingAlvaro Herrera2008-09-11
| | | | | | | | | | | | | | | | | | | | GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not depend on the latter being correctly set elsewhere, and while it is more expensive, this code path is not performance-critical. This is a real risk for autovacuum, because it can execute whole cycles without doing a single vacuum, which would mean that RecentGlobalXmin would stay at its initialization value, FirstNormalTransactionId, causing a bogus value to be inserted in pg_database. This bug could explain some recent reports of failure to truncate pg_clog. At the same time, change the initialization of RecentGlobalXmin to InvalidTransactionId, and ensure that it's set to something else whenever it's going to be used. Using it as FirstNormalTransactionId in HOT page pruning could incur in data loss. InitPostgres takes care of setting it to a valid value, but the extra checks are there to prevent "special" backends from behaving in unusual ways. Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
* Introduce the concept of relation forks. An smgr relation can now consistHeikki Linnakangas2008-08-11
| | | | | | | | | | | | | | | | of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.
* Clean up the use of some page-header-access macros: principally, useTom Lane2008-07-13
| | | | | | | | | | SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that makes the code clearer, and avoid casting between Page and PageHeader where possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas. I did not apply the parts of the proposed patch that would have resulted in slightly changing the on-disk format of hash indexes; it seems to me that's not a win as long as there's any chance of having in-place upgrade for 8.4.
* Improve our #include situation by moving pointer types away from theAlvaro Herrera2008-06-19
| | | | | | | corresponding struct definitions. This allows other headers to avoid including certain highly-loaded headers such as rel.h and relscan.h, instead using just relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less unnecessary dependencies.
* Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relationHeikki Linnakangas2008-06-12
| | | | | | | | | | forks. XLogOpenRelation() and the associated light-weight relation cache in xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument, instead of Relation. For functions that still need a Relation struct during WAL replay, there's a new function called CreateFakeRelcacheEntry() that returns a fake entry like XLogOpenRelation() used to.
* Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It isAlvaro Herrera2008-06-08
| | | | | | | | | | more logical that way, and also it reduces the amount of unnecessary includes in bufpage.h, which is widely used. Zdenek Kotala. My previous patch to bufpage.h should also have credited him as author, but I forgot (sorry about that).
* Put back bufmgr.h in bufpage.h -- it is needed by some macros.Alvaro Herrera2008-05-12
| | | | | Remove #include bufmgr.h from (most?) source files which already include bufpage.h.
* Restructure some header files a bit, in particular heapam.h, by removing someAlvaro Herrera2008-05-12
| | | | | | | | | | | | unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.
* Remove heap_release_fetch, which is no longer used anywhere; this simplifiesTom Lane2008-04-03
| | | | heap_fetch a little.
* Move the HTSU_Result enum definition into snapshot.h, to avoid includingAlvaro Herrera2008-03-26
| | | | | | tqual.h into heapam.h. This makes all inclusion of tqual.h explicit. I also sorted alphabetically the includes on some source files.
* Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files.Alvaro Herrera2008-03-26
| | | | Per complaint from Tom Lane.
* Separate snapshot management code from tuple visibility code, create aAlvaro Herrera2008-03-26
| | | | | | | | | | | | | snapmgmt.c file for the former. The header files have also been reorganized in three parts: the most basic snapshot definitions are now in a new file snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c. tqual.h has been reduced to the bare minimum. This patch is just a first step towards managing live snapshots within a transaction; there is no functionality change. Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and subsequent discussion.
* Refactor heap_page_prune so that instead of changing item states on-the-fly,Tom Lane2008-03-08
| | | | | | | | | | | | | | | | | it accumulates the set of changes to be made and then applies them. It had to accumulate the set of changes anyway to prepare a WAL record for the pruning action, so this isn't an enormous change; the only new complexity is to not doubly mark tuples that are visited twice in the scan. The main advantage is that we can substantially reduce the scope of the critical section in which the changes are applied, thus avoiding PANIC in foreseeable cases like running out of memory in inval.c. A nice secondary advantage is that it is now far clearer that WAL replay will actually do the same thing that the original pruning did. This commit doesn't do anything about the open problem that CacheInvalidateHeapTuple doesn't have the right semantics for a CTID change caused by collapsing out a redirect pointer. But whatever we do about that, it'll be a good idea to not do it inside a critical section.
* Fix PREPARE TRANSACTION to reject the case where the transaction has dropped aTom Lane2008-03-04
| | | | | | | temporary table; we can't support that because there's no way to clean up the source backend's internal state if the eventual COMMIT PREPARED is done by another backend. This was checked correctly in 8.1 but I broke it in 8.2 :-(. Patch by Heikki Linnakangas, original trouble report by John Smith.
* Add a GUC variable "synchronize_seqscans" to allow clients to disable the newTom Lane2008-01-30
| | | | | synchronized-scanning behavior, and make pg_dump disable sync scans so that it will reliably preserve row ordering. Per recent discussions.
* Fix CREATE INDEX CONCURRENTLY so that it won't use synchronized scan forTom Lane2008-01-14
| | | | | | | | its second pass over the table. It has to start at block zero, else the "merge join" logic for detecting which TIDs are already in the index doesn't work. Hence, extend heapam.c's API so that callers can enable or disable syncscan. (I put in an option to disable buffer access strategy, too, just in case somebody needs it.) Per report from Hannes Dorbath.
* Update copyrights in source tree to 2008.Bruce Momjian2008-01-01
|
* Avoid incrementing the CommandCounter when CommandCounterIncrement is calledTom Lane2007-11-30
| | | | | | | | | | | | | | | | | | | | but no database changes have been made since the last CommandCounterIncrement. This should result in a significant improvement in the number of "commands" that can typically be performed within a transaction before hitting the 2^32 CommandId size limit. In particular this buys back (and more) the possible adverse consequences of my previous patch to fix plan caching behavior. The implementation requires tracking whether the current CommandCounter value has been "used" to mark any tuples. CommandCounter values stored into snapshots are presumed not to be used for this purpose. This requires some small executor changes, since the executor used to conflate the curcid of the snapshot it was using with the command ID to mark output tuples with. Separating these concepts allows some small simplifications in executor APIs. Something for the TODO list: look into having CommandCounterIncrement not do AcceptInvalidationMessages. It seems fairly bogus to be doing it there, but exactly where to do it instead isn't clear, and I'm disinclined to mess with asynchronous behavior during late beta.
* pgindent run for 8.3.Bruce Momjian2007-11-15
|
* Use "alternative" instead of "alternate" where it is clearer.Peter Eisentraut2007-11-07
|
* Tweak toast-related logic in heapam.c so that the toaster is only invokedTom Lane2007-10-16
| | | | | | | when relkind = RELKIND_RELATION. This syncs these tests with the Asserts in tuptoaster.c, and ensures that we won't ever try to, for example, compress a sequence's tuple. Problem found by Greg Stark while stress-testing with much-smaller-than-normal page sizes.
* Improve handling of prune/no-prune decisions by storing a page's oldestTom Lane2007-09-21
| | | | | | unpruned XMAX in its header. At the cost of 4 bytes per page, this keeps us from performing heap_page_prune when there's no chance of pruning anything. Seems to be necessary per Heikki's preliminary performance testing.
* HOT updates. When we update a tuple without changing any of its indexedTom Lane2007-09-20
| | | | | | | | | | | | columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version. In addition, this patch introduces the ability to "prune" dead tuples on a per-page basis, without having to do a complete VACUUM pass to recover space. VACUUM is still needed to clean up dead index entries, however. Pavan Deolasee, with help from a bunch of other people.
* Redefine the lp_flags field of item pointers as having four states, ratherTom Lane2007-09-12
| | | | | | | | | than two independent bits (one of which was never used in heap pages anyway, or at least hadn't been in a very long time). This gives us flexibility to add the HOT notions of redirected and dead item pointers without requiring anything so klugy as magic values of lp_off and lp_len. The state values are chosen so that for the states currently in use (pre-HOT) there is no change in the physical representation.
* Don't take ProcArrayLock while exiting a transaction that has no XID; there isTom Lane2007-09-07
| | | | | | | | | | no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.
* Implement lazy XID allocation: transactions that do not modify any databaseTom Lane2007-09-05
| | | | | | | | | | | | | | | rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.
* Fix oversight in async-commit patch: there were some places in heapam.cTom Lane2007-08-14
| | | | | | that still thought they could set HEAP_XMAX_COMMITTED immediately after seeing the other transaction commit. Make them use the same logic as tqual.c does to determine if the hint bit can be set yet.
* Teach heapam code to know the difference between a real seqscan and theTom Lane2007-06-09
| | | | | | | | | pseudo HeapScanDesc created for a bitmap heap scan. This avoids some useless overhead during a bitmap scan startup, in particular invoking the syncscan code. (We might someday want to do that, but right now it's merely useless contention for shared memory, to say nothing of possibly pushing useful entries out of syncscan's small LRU list.) This also allows elimination of ugly pgstat_discount_heap_scan() kluge.
* Arrange for large sequential scans to synchronize with each other, so thatTom Lane2007-06-08
| | | | | | | when multiple backends are scanning the same relation concurrently, each page is (ideally) read only once. Jeff Davis, with review by Heikki and Tom.
* Make large sequential scans and VACUUMs work in a limited-size "ring" ofTom Lane2007-05-30
| | | | | | | | | | | | | | | | | | | | | | | buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.
* Fix up pgstats counting of live and dead tuples to recognize that committedTom Lane2007-05-27
| | | | | | | | | | | and aborted transactions have different effects; also teach it not to assume that prepared transactions are always committed. Along the way, simplify the pgstats API by tying counting directly to Relations; I cannot detect any redeeming social value in having stats pointers in HeapScanDesc and IndexScanDesc structures. And fix a few corner cases in which counts might be missed because the relation's pgstat_info pointer hadn't been set.
* Make CLUSTER MVCC-safe. Heikki LinnakangasTom Lane2007-04-08
|
* Decouple the values of TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE.Tom Lane2007-04-03
| | | | | | | | | | | Add the latter to the values checked in pg_control, since it can't be changed without invalidating toast table content. This commit in itself shouldn't change any behavior, but it lays some necessary groundwork for experimentation with these toast-control numbers. Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some thought still needs to be given to needs_toast_table() in toasting.c before unleashing random changes.
* Teach CLUSTER to skip writing WAL if not needed (ie, not using archiving)Tom Lane2007-03-29
| | | | | --- Simon. Also, code review and cleanup for the previous COPY-no-WAL patches --- Tom.
* Clean up the representation of special snapshots by including a "methodTom Lane2007-03-25
| | | | | | | | | | | | | | | | | | | | | pointer" in every Snapshot struct. This allows removal of the case-by-case tests in HeapTupleSatisfiesVisibility, which should make it a bit faster (I didn't try any performance tests though). More importantly, we are no longer violating portable C practices by assuming that small integers are distinct from all pointer values, and HeapTupleSatisfiesDirty no longer has a non-reentrant API involving side-effects on a global variable. There were a couple of places calling HeapTupleSatisfiesXXX routines directly rather than through the HeapTupleSatisfiesVisibility macro. Since these places had to be changed anyway, I chose to make them go through the macro for uniformity. Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC to emphasize that it's only used with MVCC-type snapshots. I was sorely tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot, but forebore for the moment to avoid confusion and reduce the likelihood that this patch breaks some of the pending patches. Might want to reconsider doing that later.
* Combine cmin and cmax fields of HeapTupleHeaders into a single field, byTom Lane2007-02-09
| | | | | | | | | | keeping private state in each backend that has inserted and deleted the same tuple during its current top-level transaction. This is sufficient since there is no need to be able to determine the cmin/cmax from any other transaction. This gets us back down to 23-byte headers, removing a penalty paid in 8.0 to support subtransactions. Patch by Heikki Linnakangas, with minor revisions by moi, following a design hashed out awhile back on the pghackers list.
* Rename MaxTupleSize to MaxHeapTupleSize to clarify that it's not meant toTom Lane2007-02-05
| | | | | | | | | | | | | | | | | | | | describe the maximum size of index tuples (which is typically AM-dependent anyway); and consequently remove the bogus deduction for "special space" that was built into it. Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two bytes per toast chunk, and to ensure that the calculation correctly tracks any future changes in page header size. The computation had been inaccurate in a way that didn't cause any harm except space wastage, but future changes could have broken it more drastically. Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte more than it could safely be. This didn't cause any harm in practice because it's only compared against maxalign'd lengths, but future changes in the size of page headers or btree special space could have exposed the problem. initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the storage of toast tables.
* Don't MAXALIGN in the checks to decide whether a tuple is over TOAST'sTom Lane2007-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | threshold for tuple length. On 4-byte-MAXALIGN machines, the toast code creates tuples that have t_len exactly TOAST_TUPLE_THRESHOLD ... but this number is not itself maxaligned, so if heap_insert maxaligns t_len before comparing to TOAST_TUPLE_THRESHOLD, it'll uselessly recurse back to tuptoaster.c, wasting cycles. (It turns out that this does not happen on 8-byte-MAXALIGN machines, because for them the outer MAXALIGN in the TOAST_MAX_CHUNK_SIZE macro reduces TOAST_MAX_CHUNK_SIZE so that toast tuples will be less than TOAST_TUPLE_THRESHOLD in size. That MAXALIGN is really incorrect, but we can't remove it now, see below.) There isn't any particular value in maxaligning before comparing to the thresholds, so just don't do that, which saves a small number of cycles in itself. These numbers should be rejiggered to minimize wasted space on toast-relation pages, but we can't do that in the back branches because changing TOAST_MAX_CHUNK_SIZE would force an initdb (by changing the contents of toast tables). We can move the toast decision thresholds a bit, though, which is what this patch effectively does. Thanks to Pavan Deolasee for discovering the unintended recursion. Back-patch into 8.2, but not further, pending more testing. (HEAD is about to get a further patch modifying the thresholds, so it won't help much for testing this form of the patch.)
* Prevent WAL logging when COPY is done in the same transation thatBruce Momjian2007-01-25
| | | | | | created it. Simon Riggs
* Enable another five tuple status bits by using the high bits of theBruce Momjian2007-01-09
| | | | | | nattr field, and rename the field. Heikki Linnakangas
* Update CVS HEAD for 2007 copyright. Back branches are typically notBruce Momjian2007-01-05
| | | | back-stamped for this.
* Repair two related errors in heap_lock_tuple: it was failing to recognizeTom Lane2006-11-17
| | | | | | | | | cases where we already hold the desired lock "indirectly", either via membership in a MultiXact or because the lock was originally taken by a different subtransaction of the current transaction. These cases must be accounted for to avoid needless deadlocks and/or inappropriate replacement of an exclusive lock with a shared lock. Per report from Clarence Gardner and subsequent investigation.
* Fix recently-understood problems with handling of XID freezing, particularlyTom Lane2006-11-05
| | | | | | | | | | | | | | | in PITR scenarios. We now WAL-log the replacement of old XIDs with FrozenTransactionId, so that such replacement is guaranteed to propagate to PITR slave databases. Also, rather than relying on hint-bit updates to be preserved, pg_clog is not truncated until all instances of an XID are known to have been replaced by FrozenTransactionId. Add new GUC variables and pg_autovacuum columns to allow management of the freezing policy, so that users can trade off the size of pg_clog against the amount of freezing work done. Revise the already-existing code that forces autovacuum of tables approaching the wraparound point to make it more bulletproof; also, revise the autovacuum logic so that anti-wraparound vacuuming is done per-table rather than per-database. initdb forced because of changes in pg_class, pg_database, and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
* pgindent run for 8.2.Bruce Momjian2006-10-04
|
* Now that we've rearranged relation open to get a lock before touchingTom Lane2006-08-18
| | | | | | the rel, it's easy to get rid of the narrow race-condition window that used to exist in VACUUM and CLUSTER. Did some minor code-beautification work in the same area, too.
* Change the relation_open protocol so that we obtain lock on a relationTom Lane2006-07-31
| | | | | | | | | | | | (table or index) before trying to open its relcache entry. This fixes race conditions in which someone else commits a change to the relation's catalog entries while we are in process of doing relcache load. Problems of that ilk have been reported sporadically for years, but it was not really practical to fix until recently --- for instance, the recent addition of WAL-log support for in-place updates helped. Along the way, remove pg_am.amconcurrent: all AMs are now expected to support concurrent update.