aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* tableam: Move heap specific logic from estimate_rel_size below tableam.Andres Freund2019-03-30
| | | | | | | | | | | | | This just moves the table/matview[/toast] determination of relation size to a callback, and uses a copy of the existing logic to implement that callback for heap. It probably would make sense to also move the index specific logic into a callback, so the metapage handling (and probably more) can be index specific. But that's a separate task. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: VACUUM and ANALYZE support.Andres Freund2019-03-30
| | | | | | | | | | | | This is a relatively straightforward move of the current implementation to sit below tableam. As the current analyze sampling implementation is pretty inherently block based, the tableam analyze interface is as well. It might make sense to generalize that at some point, but that seems like a larger project that shouldn't be undertaken at the same time as the introduction of tableam. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Generated columnsPeter Eisentraut2019-03-30
| | | | | | | | | | | | | | This is an SQL-standard feature that allows creating columns that are computed from expressions rather than assigned, similar to a view or materialized view but on a column basis. This implements one kind of generated column: stored (computed on write). Another kind, virtual (computed on read), is planned for the future, and some room is left for it. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
* Tweak some nbtree-related code comments.Peter Geoghegan2019-03-29
|
* tableam: Comment fixes.Andres Freund2019-03-29
| | | | | Author: Haribabu Kommi Discussion: CAJrrPGeeYOqP3hkZyohDx_8dot4zvPuPMDBmhJ=iC85cTBNeYw@mail.gmail.com
* Warn more strongly about the dangers of exclusive backup mode.Robert Haas2019-03-29
| | | | | | | | | | | Especially, warn about the hazards of mishandling the backup_label file. Adjust a couple of server messages to be more clear about the hazards associated with removing backup_label files, too. David Steele and Robert Haas, reviewed by Laurenz Albe, Martín Marqués, Peter Eisentraut, and Magnus Hagander. Discussion: http://postgr.es/m/7d85c387-000e-16f0-e00b-50bf83c22127@pgmasters.net
* tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.Andres Freund2019-03-28
| | | | | | | | | | | | | | | | This moves the responsibility for: - creating the storage necessary for a relation, including creating a new relfilenode for a relation with existing storage - non-transactional truncation of a relation - VACUUM FULL / CLUSTER's rewrite of a table below tableam. This is fairly straight forward, with a bit of complexity smattered in to move the computation of xid / multixid horizons below the AM, as they don't make sense for every table AM. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Fix typo.Thomas Munro2019-03-29
| | | | Author: Masahiko Sawada
* Use FullTransactionId for the transaction stack.Thomas Munro2019-03-28
| | | | | | | | | | | | | | | | Provide GetTopFullTransactionId() and GetCurrentFullTransactionId(). The intended users of these interfaces are access methods that use xids for visibility checks but don't want to have to go back and "freeze" existing references some time later before the 32 bit xid counter wraps around. Use a new struct to serialize the transaction state for parallel query, because FullTransactionId doesn't fit into the previous serialization scheme very well. Author: Thomas Munro Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
* Add basic infrastructure for 64 bit transaction IDs.Thomas Munro2019-03-28
| | | | | | | | | | | | | | | | | | | | | | | | Instead of inferring epoch progress from xids and checkpoints, introduce a 64 bit FullTransactionId type and use it to track xid generation. This fixes an unlikely bug where the epoch is reported incorrectly if the range of active xids wraps around more than once between checkpoints. The only user-visible effect of this commit is to correct the epoch used by txid_current() and txid_status(), also visible with pg_controldata, in those rare circumstances. It also creates some basic infrastructure so that later patches can use 64 bit transaction IDs in more places. The new type is a struct that we pass by value, as a form of strong typedef. This prevents the sort of accidental confusion between TransactionId and FullTransactionId that would be possible if we were to use a plain old uint64. Author: Thomas Munro Reported-by: Amit Kapila Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
* tableam: Support for an index build's initial table scan(s).Andres Freund2019-03-27
| | | | | | | | | | | | | To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Suppress uninitialized-variable warning.Tom Lane2019-03-27
| | | | | Apparently Andres' compiler is smart enough to see that hpage must be initialized before use ... but mine isn't.
* Switch some palloc/memset calls to palloc0Michael Paquier2019-03-27
| | | | | | | | | | | Some code paths have been doing some allocations followed by an immediate memset() to initialize the allocated area with zeros, this is a bit overkill as there are already interfaces to do both things in one call. Author: Daniel Gustafsson Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/vN0OodBPkKs7g2Z1uyk3CUEmhdtspHgYCImhlmSxv1Xn6nY1ZnaaGHL8EWUIQ-NEv36tyc4G5-uA3UXUF2l4sFXtK_EQgLN1hcgunlFVKhA=@yesql.se
* Compute XID horizon for page level index vacuum on primary.Andres Freund2019-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the xid horizon was only computed during WAL replay. That had two major problems: 1) It relied on knowing what the table pointed to looks like. That was easy enough before the introducing of tableam (we knew it had to be heap, although some trickery around logging the heap relfilenodes was required). But to properly handle table AMs we need per-database catalog access to look up the AM handler, which recovery doesn't allow. 2) Not knowing the xid horizon also makes it hard to support logical decoding on standbys. When on a catalog table, we need to be able to conflict with slots that have an xid horizon that's too old. But computing the horizon by visiting the heap only works once consistency is reached, but we always need to be able to detect conflicts. There's also a secondary problem, in that the current method performs redundant work on every standby. But that's counterbalanced by potentially computing the value when not necessary (either because there's no standby, or because there's no connected backends). Solve 1) and 2) by moving computation of the xid horizon to the primary and by involving tableam in the computation of the horizon. To address the potentially increased overhead, increase the efficiency of the xid horizon computation for heap by sorting the tids, and eliminating redundant buffer accesses. When prefetching is available, additionally perform prefetching of buffers. As this is more of a maintenance task, rather than something routinely done in every read only query, we add an arbitrary 10 to the effective concurrency - thereby using IO concurrency, when not globally enabled. That's possibly not the perfect formula, but seems good enough for now. Bumps WAL format, as latestRemovedXid is now part of the records, and the heap's relfilenode isn't anymore. Author: Andres Freund, Amit Khandekar, Robert Haas Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Fix oversight in data-type change for autovacuum_vacuum_cost_delay.Tom Lane2019-03-26
| | | | | | | | | | | | | Commit caf626b2c missed that the relevant reloptions entry needs to be moved from the intRelOpts[] array to realRelOpts[]. Somewhat surprisingly, it seems to work anyway, perhaps because the desired default and limit values are all integers. We ought to have either a simpler data structure or better cross-checking here, but that's for another patch. Nikolay Shaplov Discussion: https://postgr.es/m/4861742.12LTaSB3sv@x200m
* Remove heap_hot_search().Andres Freund2019-03-25
| | | | | | | | | | | | | After 71bdc99d0d7, "tableam: Add helper for indexes to check if a corresponding table tuples exist." there's no in-core user left. As there's unlikely to be an external user, and such an external user could easily be adjusted to use table_index_fetch_tuple_check(), remove heap_hot_search(). Per complaint from Peter Geoghegan Author: Andres Freund Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com
* tableam: Add table_get_latest_tid, to wrap heap_get_latest_tid.Andres Freund2019-03-25
| | | | | | | | | | This primarily is to allow WHERE CURRENT OF to continue to work as it currently does. It's not clear to me that these semantics make sense for every AM, but it works for the in-core heap, and the out of core zheap. We can refine it further at a later point if necessary. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Add helper for indexes to check if a corresponding table tuples exist.Andres Freund2019-03-25
| | | | | | | | This is, likely exclusively, useful to verify that conflicts detected in a unique index are with live tuples, rather than dead ones. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Add "split after new tuple" nbtree optimization.Peter Geoghegan2019-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add additional heuristics to the algorithm for locating an optimal split location. New logic identifies localized monotonically increasing values in indexes with multiple columns. When this insertion pattern is detected, page splits split just after the new item that provoked a page split (or apply leaf fillfactor in the style of a rightmost page split). This optimization is a variation of the long established leaf fillfactor optimization used during rightmost page splits. 50/50 page splits are only appropriate with a pattern of truly random insertions, where the average space utilization ends up at 65% - 70%. Without this patch, affected cases have leaf pages that are no more than about 50% full on average. Future insertions can never make use of the free space left behind. With this patch, affected cases have leaf pages that are about 90% full on average (assuming a fillfactor of 90). Localized monotonically increasing insertion patterns are presumed to be fairly common in real-world applications. There is a fair amount of anecdotal evidence for this. Both pg_depend system catalog indexes (pg_depend_depender_index and pg_depend_reference_index) are at least 20% smaller after the regression tests are run when the optimization is available. Furthermore, many of the indexes created by a fair use implementation of TPC-C for Postgres are consistently about 40% smaller when the optimization is available. Note that even pg_upgrade'd v3 indexes make use of this optimization. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
* tableam: Add and use table_fetch_row_version().Andres Freund2019-03-25
| | | | | | | | | | | | | | | | | | This is essentially the tableam version of heapam_fetch(), i.e. fetching a tuple identified by a tid, performing visibility checks. Note that this different from table_index_fetch_tuple(), which is for index lookups. It therefore has to handle a tid pointing to an earlier version of a tuple if the AM uses an optimization like heap's HOT. Add comments to that end. This commit removes the stats_relation argument from heap_fetch, as it's been unused for a long time. Author: Andres Freund Reviewed-By: Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Remove dead code from nbtsplitloc.c.Peter Geoghegan2019-03-24
| | | | | | | | | | It doesn't make sense to consider the possibility that there will only be one candidate split point when choosing among split points to find the split with the lowest penalty. This is a vestige of an earlier version of the patch that became commit fab25024. Issue spotted while rereviewing coverage of the nbtree patch series using gcov.
* Transaction chainingPeter Eisentraut2019-03-24
| | | | | | | | | | | | | Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which start new transactions with the same transaction characteristics as the just finished one, per SQL standard. Support for transaction chaining in PL/pgSQL is also added. This functionality is especially useful when running COMMIT in a loop in PL/pgSQL. Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr> Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
* tableam: Add tuple_{insert, delete, update, lock} and use.Andres Freund2019-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds new, required, table AM callbacks for insert/delete/update and lock_tuple. To be able to reasonably use those, the EvalPlanQual mechanism had to be adapted, moving more logic into the AM. Previously both delete/update/lock call-sites and the EPQ mechanism had to have awareness of the specific tuple format to be able to fetch the latest version of a tuple. Obviously that needs to be abstracted away. To do so, move the logic that find the latest row version into the AM. lock_tuple has a new flag argument, TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last version, rather than the current one. It'd have been possible to do so via a separate callback as well, but finding the last version usually also necessitates locking the newest version, making it sensible to combine the two. This replaces the previous use of EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously signaled either a concurrent update or delete, is now split into two, to avoid callers needing AM specific knowledge to differentiate. The move of finding the latest row version into tuple_lock means that encountering a row concurrently moved into another partition will now raise an error about "tuple to be locked" rather than "tuple to be updated/deleted" - which is accurate, as that always happens when locking rows. While possible slightly less helpful for users, it seems like an acceptable trade-off. As part of this commit HTSU_Result has been renamed to TM_Result, and its members been expanded to differentiated between updating and deleting. HeapUpdateFailureData has been renamed to TM_FailureData. The interface to speculative insertion is changed so nodeModifyTable.c does not have to set the speculative token itself anymore. Instead there's a version of tuple_insert, tuple_insert_speculative, that performs the speculative insertion (without requiring a flag to signal that fact), and the speculative insertion is either made permanent with table_complete_speculative(succeeded = true) or aborted with succeeded = false). Note that multi_insert is not yet routed through tableam, nor is COPY. Changing multi_insert requires changes to copy.c that are large enough to better be done separately. Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED VIEW are also only going to be adjusted in a later commit. Author: Andres Freund and Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
* Add nbtree high key "continuescan" optimization.Peter Geoghegan2019-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree forward index scans to check the high key before moving to the right sibling page in the hope of finding that it isn't actually necessary to do so. The new check may indicate that the scan definitely cannot find matching tuples to the right, ending the scan immediately. We already opportunistically force a similar "continuescan orientated" key check of the final non-pivot tuple when it's clear that it cannot be returned to the scan due to being dead-to-all. The new high key check is complementary. The new approach for forward scans is more effective than checking the final non-pivot tuple, especially with composite indexes and non-unique indexes. The improvements to the logic for picking a split point added by commit fab25024 make it likely that relatively dissimilar high keys will appear on a page. A distinguishing key value that can only appear on non-pivot tuples on the right sibling page will often be present in leaf page high keys. Since forcing the final item to be key checked no longer makes any difference in the case of forward scans, the existing extra key check is now only used for backwards scans. Backward scans continue to opportunistically check the final non-pivot tuple, which is actually the first non-pivot tuple on the page (not the last). Note that even pg_upgrade'd v3 indexes make use of this optimization. Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
* Fix bug in the GiST vacuum's 2nd stage.Heikki Linnakangas2019-03-22
| | | | | | | We mustn't assume that the IndexVacuumInfo pointer passed to bulkdelete() stage is still valid in the vacuumcleanup() stage. Per very pink buildfarm.
* Delete empty pages during GiST VACUUM.Heikki Linnakangas2019-03-22
| | | | | | | | | | | | | | | To do this, we scan GiST two times. In the first pass we make note of empty leaf pages and internal pages. At second pass we scan through internal pages, looking for downlinks to the empty pages. Deleting internal pages is still not supported, like in nbtree, the last child of an internal page is never deleted. That means that if you have a workload where new keys are always inserted to different area than where old keys are removed, the index will still grow without bound. But the rate of growth will be an order of magnitude slower than before. Author: Andrey Borodin Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
* Collations with nondeterministic comparisonPeter Eisentraut2019-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a flag "deterministic" to collations. If that is false, such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. This functionality is only supported with the ICU provider. At least glibc doesn't appear to have any locales that work in a nondeterministic way, so it's not worth supporting this for the libc provider. The term "deterministic comparison" in this context is from Unicode Technical Standard #10 (https://unicode.org/reports/tr10/#Deterministic_Comparison). This patch makes changes in three areas: - CREATE COLLATION DDL changes and system catalog changes to support this new flag. - Many executor nodes and auxiliary code are extended to track collations. Previously, this code would just throw away collation information, because the eventually-called user-defined functions didn't use it since they only cared about equality, which didn't need collation information. - String data type functions that do equality comparisons and hashing are changed to take the (non-)deterministic flag into account. For comparison, this just means skipping various shortcuts and tie breakers that use byte-wise comparison. For hashing, we first need to convert the input string to a canonical "sort key" using the ICU analogue of strxfrm(). Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
* Fix spurious compiler warning in nbtxlog.c.Peter Geoghegan2019-03-20
| | | | | | Cleanup from commit dd299df8. Per complaint from Tom Lane.
* Consider secondary factors during nbtree splits.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree to give some consideration to how "distinguishing" candidate leaf page split points are. This should not noticeably affect the balance of free space within each half of the split, while still making suffix truncation truncate away significantly more attributes on average. The logic for choosing a leaf split point now uses a fallback mode in the case where the page is full of duplicates and it isn't possible to find even a minimally distinguishing split point. When the page is full of duplicates, the split should pack the left half very tightly, while leaving the right half mostly empty. Our assumption is that logical duplicates will almost always be inserted in ascending heap TID order with v4 indexes. This strategy leaves most of the free space on the half of the split that will likely be where future logical duplicates of the same value need to be placed. The number of cycles added is not very noticeable. This is important because deciding on a split point takes place while at least one exclusive buffer lock is held. We avoid using authoritative insertion scankey comparisons to save cycles, unlike suffix truncation proper. We use a faster binary comparison instead. Note that even pg_upgrade'd v3 indexes make use of these optimizations. Benchmarking has shown that even v3 indexes benefit, despite the fact that suffix truncation will only truncate non-key attributes in INCLUDE indexes. Grouping relatively similar tuples together is beneficial in and of itself, since it reduces the number of leaf pages that must be accessed by subsequent index scans. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
* Make heap TID a tiebreaker nbtree index column.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make nbtree treat all index tuples as having a heap TID attribute. Index searches can distinguish duplicates by heap TID, since heap TID is always guaranteed to be unique. This general approach has numerous benefits for performance, and is prerequisite to teaching VACUUM to perform "retail index tuple deletion". Naively adding a new attribute to every pivot tuple has unacceptable overhead (it bloats internal pages), so suffix truncation of pivot tuples is added. This will usually truncate away the "extra" heap TID attribute from pivot tuples during a leaf page split, and may also truncate away additional user attributes. This can increase fan-out, especially in a multi-column index. Truncation can only occur at the attribute granularity, which isn't particularly effective, but works well enough for now. A future patch may add support for truncating "within" text attributes by generating truncated key values using new opclass infrastructure. Only new indexes (BTREE_VERSION 4 indexes) will have insertions that treat heap TID as a tiebreaker attribute, or will have pivot tuples undergo suffix truncation during a leaf page split (on-disk compatibility with versions 2 and 3 is preserved). Upgrades to version 4 cannot be performed on-the-fly, unlike upgrades from version 2 to version 3. contrib/amcheck continues to work with version 2 and 3 indexes, while also enforcing stricter invariants when verifying version 4 indexes. These stricter invariants are the same invariants described by "3.1.12 Sequencing" from the Lehman and Yao paper. A later patch will enhance the logic used by nbtree to pick a split point. This patch is likely to negatively impact performance without smarter choices around the precise point to split leaf pages at. Making these two mostly-distinct sets of enhancements into distinct commits seems like it might clarify their design, even though neither commit is particularly useful on its own. The maximum allowed size of new tuples is reduced by an amount equal to the space required to store an extra MAXALIGN()'d TID in a new high key during leaf page splits. The user-facing definition of the "1/3 of a page" restriction is already imprecise, and so does not need to be revised. However, there should be a compatibility note in the v12 release notes. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas, Alexander Korotkov Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
* Refactor nbtree insertion scankeys.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use dedicated struct to represent nbtree insertion scan keys. Having a dedicated struct makes the difference between search type scankeys and insertion scankeys a lot clearer, and simplifies the signature of several related functions. This is based on a suggestion by Andrey Lepikhov. Streamline how unique index insertions cache binary search progress. Cache the state of in-progress binary searches within _bt_check_unique() for later instead of having callers avoid repeating the binary search in an ad-hoc manner. This makes it easy to add a new optimization: _bt_check_unique() now falls out of its loop immediately in the common case where it's already clear that there couldn't possibly be a duplicate. The new _bt_check_unique() scheme makes it a lot easier to manage cached binary search effort afterwards, from within _bt_findinsertloc(). This is needed for the upcoming patch to make nbtree tuples unique by treating heap TID as a final tiebreaker column. Unique key binary searches need to restore lower and upper bounds. They cannot simply continue to use the >= lower bound as the offset to insert at, because the heap TID tiebreaker column must be used in comparisons for the restored binary search (unlike the original _bt_check_unique() binary search, where scankey's heap TID column must be omitted). Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas, Andrey Lepikhov Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
* Tweak nbtsearch.c function prototype order.Peter Geoghegan2019-03-19
| | | | | nbtsearch.c's static function prototypes were slightly out of order. Make the order consistent with static function definition order.
* Fix memory leak in printtup.c.Tom Lane2019-03-18
| | | | | | | | | | | | | | | | Commit f2dec34e1 changed things so that printtup's output stringinfo buffer was allocated outside the per-row temporary context, not inside it. This creates a need to free that buffer explicitly when the temp context is freed, but that was overlooked. In most cases, this is all happening inside a portal or executor context that will go away shortly anyhow, but that's not always true. Notably, the stringinfo ends up getting leaked when JDBC uses row-at-a-time fetches. For a query that returns wide rows, that adds up after awhile. Per bug #15700 from Matthias Otterbach. Back-patch to v11 where the faulty code was added. Discussion: https://postgr.es/m/15700-8c408321a87d56bb@postgresql.org
* Fold vacuum's 'int options' parameter into VacuumParams.Robert Haas2019-03-18
| | | | | | | | | | | | | Many places need both, so this allows a few functions to take one fewer parameter. More importantly, as soon as we add a VACUUM option that takes a non-Boolean parameter, we need to replace 'int options' with a struct, and it seems better to think of adding more fields to VacuumParams rather than passing around both VacuumParams and a separate struct as well. Patch by me, reviewed by Masahiko Sawada Discussion: http://postgr.es/m/CA+Tgmob6g6-s50fyv8E8he7APfwCYYJ4z0wbZC2yZeSz=26CYQ@mail.gmail.com
* Refactor more code logic to update the control fileMichael Paquier2019-03-18
| | | | | | | | | | | | | | | | | | | | | | ce6afc6 has begun the refactoring work by plugging pg_rewind into a central routine to update the control file, and left around two extra copies, with one in xlog.c for the backend and one in pg_resetwal.c. By adding an extra option to the central routine in controldata_utils.c to control if a flush of the control file needs to be done, it is proving to be straight-forward to make xlog.c and pg_resetwal.c use the central code path at the condition of moving the wait event tracking there. Hence, this allows to have only one central code path to update the control file, shaving the code from the duplicates. This refactoring actually fixes a problem in pg_resetwal. Previously, the control file was first removed before being recreated. So if a crash happened between the moment the file was removed and the moment the file was created, then it would have been possible to not have a control file anymore in the database folder. Author: Fabien Coelho Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/alpine.DEB.2.21.1903170935210.2506@lancre
* Avoid casting away a constPeter Eisentraut2019-03-16
|
* Enable parallel query with SERIALIZABLE isolation.Thomas Munro2019-03-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the SERIALIZABLE isolation level prevented parallel query from being used. Allow the two features to be used together by sharing the leader's SERIALIZABLEXACT with parallel workers. An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to share, and new logic is introduced to coordinate the early release of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE optimization, as follows: The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by some other transaction) will 'partially release' the SERIALIZABLEXACT, meaning that the conflicts and locks it holds are released, but the SERIALIZABLEXACT itself will remain active because other backends might still have a pointer to it. Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears its own MySerializableXact variable and frees local resources so that it can skip SSI checks for the rest of the transaction. In the special case of the leader process, it transfers the SERIALIZABLEXACT to a new variable SavedSerializableXact, so that it can be completely released at the end of the transaction after all workers have exited. Remove the serializable_okay flag added to CreateParallelContext() by commit 9da0cc35, because it's now redundant. Author: Thomas Munro Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
* Correct obsolete nbtree page split comment.Peter Geoghegan2019-03-12
| | | | | | | | | | | | | Commit 40dae7ec537, which made the nbtree page split algorithm more robust, made _bt_insert_parent() only unlock the right child of the parent page before inserting a new downlink into the parent. Update a comment from the Berkeley days claiming that both left and right child pages are unlocked before the new downlink actually gets inserted. The claim that it is okay to release both locks early based on Lehman and Yao's say-so never made much sense. Lehman and Yao must sometimes "couple" buffer locks across a pair of internal pages when relocating a downlink, unlike the corresponding code within _bt_getstack().
* Ensure sufficient alignment for ParallelTableScanDescData in BTShared.Andres Freund2019-03-11
| | | | | | | | | | | | | | | | | | | | | | | Previously ParallelTableScanDescData was just a member in BTShared, but after c2fe139c2 that doesn't guarantee sufficient alignment as specific AMs might (are likely to) need atomic variables in the struct. One might think that MAXALIGNing would be sufficient, but as a comment in shm_toc_allocate() explains, that's not enough. For now, copy the hack described there. For parallel sequential scans no such change is needed, as its allocations go through shm_toc_allocate(). An alternative approach would have been to allocate the parallel scan descriptor in a separate TOC entry, but there seems little benefit in doing so. Per buildfarm member dromedary. Author: Andres Freund Discussion: https://postgr.es/m/20190311203126.ty5gbfz42gjbm6i6@alap3.anarazel.de
* tableam: Add and use scan APIs.Andres Freund2019-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Too allow table accesses to be not directly dependent on heap, several new abstractions are needed. Specifically: 1) Heap scans need to be generalized into table scans. Do this by introducing TableScanDesc, which will be the "base class" for individual AMs. This contains the AM independent fields from HeapScanDesc. The previous heap_{beginscan,rescan,endscan} et al. have been replaced with a table_ version. There's no direct replacement for heap_getnext(), as that returned a HeapTuple, which is undesirable for a other AMs. Instead there's table_scan_getnextslot(). But note that heap_getnext() lives on, it's still used widely to access catalog tables. This is achieved by new scan_begin, scan_end, scan_rescan, scan_getnextslot callbacks. 2) The portion of parallel scans that's shared between backends need to be able to do so without the user doing per-AM work. To achieve that new parallelscan_{estimate, initialize, reinitialize} callbacks are introduced, which operate on a new ParallelTableScanDesc, which again can be subclassed by AMs. As it is likely that several AMs are going to be block oriented, block oriented callbacks that can be shared between such AMs are provided and used by heap. table_block_parallelscan_{estimate, intiialize, reinitialize} as callbacks, and table_block_parallelscan_{nextpage, init} for use in AMs. These operate on a ParallelBlockTableScanDesc. 3) Index scans need to be able to access tables to return a tuple, and there needs to be state across individual accesses to the heap to store state like buffers. That's now handled by introducing a sort-of-scan IndexFetchTable, which again is intended to be subclassed by individual AMs (for heap IndexFetchHeap). The relevant callbacks for an AM are index_fetch_{end, begin, reset} to create the necessary state, and index_fetch_tuple to retrieve an indexed tuple. Note that index_fetch_tuple implementations need to be smarter than just blindly fetching the tuples for AMs that have optimizations similar to heap's HOT - the currently alive tuple in the update chain needs to be fetched if appropriate. Similar to table_scan_getnextslot(), it's undesirable to continue to return HeapTuples. Thus index_fetch_heap (might want to rename that later) now accepts a slot as an argument. Core code doesn't have a lot of call sites performing index scans without going through the systable_* API (in contrast to loads of heap_getnext calls and working directly with HeapTuples). Index scans now store the result of a search in IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the target is not generally a HeapTuple anymore that seems cleaner. To be able to sensible adapt code to use the above, two further callbacks have been introduced: a) slot_callbacks returns a TupleTableSlotOps* suitable for creating slots capable of holding a tuple of the AMs type. table_slot_callbacks() and table_slot_create() are based upon that, but have additional logic to deal with views, foreign tables, etc. While this change could have been done separately, nearly all the call sites that needed to be adapted for the rest of this commit also would have been needed to be adapted for table_slot_callbacks(), making separation not worthwhile. b) tuple_satisfies_snapshot checks whether the tuple in a slot is currently visible according to a snapshot. That's required as a few places now don't have a buffer + HeapTuple around, but a slot (which in heap's case internally has that information). Additionally a few infrastructure changes were needed: I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now internally uses a slot to keep track of tuples. While systable_getnext() still returns HeapTuples, and will so for the foreseeable future, the index API (see 1) above) now only deals with slots. The remainder, and largest part, of this commit is then adjusting all scans in postgres to use the new APIs. Author: Andres Freund, Haribabu Kommi, Alvaro Herrera Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
* Fix typos in commit 8586bf7ed8.Amit Kapila2019-03-11
| | | | | Author: Amit Kapila Discussion: https://postgr.es/m/CAA4eK1KNv1Mg2krf4E9ssWFnE=8A9mZ1VbVywXBZTFSzb+wP2g@mail.gmail.com
* Move hash_any prototype from access/hash.h to utils/hashutils.hAlvaro Herrera2019-03-11
| | | | | | | | | | | | | | | | | | | | | ... as well as its implementation from backend/access/hash/hashfunc.c to backend/utils/hash/hashfn.c. access/hash is the place for the hash index AM, not really appropriate for generic facilities, which is what hash_any is; having things the old way meant that anything using hash_any had to include the AM's include file, pointlessly polluting its namespace with unrelated, unnecessary cruft. Also move the HTEqual strategy number to access/stratnum.h from access/hash.h. To avoid breaking third-party extension code, add an #include "utils/hashutils.h" to access/hash.h. (An easily removed line by committers who enjoy their asbestos suits to protect them from angry extension authors.) Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
* Adjust error message for partial writes in WAL segmentsMichael Paquier2019-03-11
| | | | | | | | | | | | 93473c6 has removed openLogOff, changing on the way the error message which is used to report partial writes to WAL segments. The newly-introduced error message used the offset up to which the write has happened, keeping always the same total length to write. This changes the error message so as the number of bytes left to write are reported. Reported-by: Michael Paquier Author: Robert Haas Discussion: https://postgr.es/m/20190306235251.GA17293@paquier.xyz
* Convert [autovacuum_]vacuum_cost_delay into floating-point GUCs.Tom Lane2019-03-10
| | | | | | | | | | | | | | | | | | | | | This change makes it possible to specify sub-millisecond delays, which work well on most modern platforms, though that was not true when the cost-delay feature was designed. To support this without breaking existing configuration entries, improve guc.c to allow floating-point GUCs to have units. Also, allow "us" (microseconds) as an input/output unit for time-unit GUCs. (It's not allowed as a base unit, at least not yet.) Likewise change the autovacuum_vacuum_cost_delay reloption to be floating-point; this forces a catversion bump because the layout of StdRdOptions changes. This patch doesn't in itself change the default values or allowed ranges for these parameters, and it should not affect the behavior for any already-allowed setting for them. Discussion: https://postgr.es/m/1798.1552165479@sss.pgh.pa.us
* Support for INCLUDE attributes in GiST indexesAlexander Korotkov2019-03-10
| | | | | | | | | | | | | | | | | Similarly to B-tree, GiST index access method gets support of INCLUDE attributes. These attributes aren't used for tree navigation and aren't present in non-leaf pages. But they are present in leaf pages and can be fetched during index-only scan. The point of having INCLUDE attributes in GiST indexes is slightly different from the point of having them in B-tree. The main point of INCLUDE attributes in B-tree is to define UNIQUE constraint over part of attributes enabled for index-only scan. In GiST the main point of INCLUDE attributes is to use index-only scan for attributes, whose data types don't have GiST opclasses. Discussion: https://postgr.es/m/73A1A452-AD5F-40D4-BD61-978622FF75C1%40yandex-team.ru Author: Andrey Borodin, with small changes by me Reviewed-by: Andreas Karlsson
* Tighten use of OpenTransientFile and CloseTransientFileMichael Paquier2019-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | This fixes two sets of issues related to the use of transient files in the backend: 1) OpenTransientFile() has been used in some code paths with read-write flags while read-only is sufficient, so switch those calls to be read-only where necessary. These have been reported by Joe Conway. 2) When opening transient files, it is up to the caller to close the file descriptors opened. In error code paths, CloseTransientFile() gets called to clean up things before issuing an error. However in normal exit paths, a lot of callers of CloseTransientFile() never actually reported errors, which could leave a file descriptor open without knowing about it. This is an issue I complained about a couple of times, but never had the courage to write and submit a patch, so here we go. Note that one frontend code path is impacted by this commit so as an error is issued when fetching control file data, making backend and frontend to be treated consistently. Reported-by: Joe Conway, Michael Paquier Author: Michael Paquier Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
* tableam: introduce table AM infrastructure.Andres Freund2019-03-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces the concept of table access methods, i.e. CREATE ACCESS METHOD ... TYPE TABLE and CREATE TABLE ... USING (storage-engine). No table access functionality is delegated to table AMs as of this commit, that'll be done in following commits. Subsequent commits will incrementally abstract table access functionality to be routed through table access methods. That change is too large to be reviewed & committed at once, so it'll be done incrementally. Docs will be updated at the end, as adding them incrementally would likely make them less coherent, and definitely is a lot more work, without a lot of benefit. Table access methods are specified similar to index access methods, i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with callbacks. In contrast to index AMs that struct needs to live as long as a backend, typically that's achieved by just returning a pointer to a constant struct. Psql's \d+ now displays a table's access method. That can be disabled with HIDE_TABLEAM=true, which is mainly useful so regression tests can be run against different AMs. It's quite possible that this behaviour still needs to be fine tuned. For now it's not allowed to set a table AM for a partitioned table, as we've not resolved how partitions would inherit that. Disallowing allows us to introduce, if we decide that's the way forward, such a behaviour without a compatibility break. Catversion bumped, to add the heap table AM and references to it. Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
* Removed unused variable, openLogOff.Robert Haas2019-03-06
| | | | | | Antonin Houska Discussion: http://postgr.es/m/30413.1551870730@localhost
* Scan GiST indexes in physical order during VACUUM.Heikki Linnakangas2019-03-05
| | | | | | | | | | | | Scanning an index in physical order is faster than walking it in logical order, because sequential I/O is faster than random I/O. The idea and code structure is borrowed from B-tree vacuum code. Patch by Andrey Borodin, with changes by me. Based on early work by Konstantin Kuznetsov, although the patch has been rewritten multiple times since his original version. Discussion: https://www.postgresql.org/message-id/1B9FAC6F-FA19-4A24-8C1B-F4F574844892%40yandex-team.ru
* Note case where nbtree VACUUM finishes splits.Peter Geoghegan2019-03-04
| | | | | | | | The nbtree README claims that VACUUM can never finish interrupted page splits by design. That isn't entirely accurate, though. Note an exception to the general rule. Discussion: https://postgr.es/m/CAH2-Wz=_Xvv8byzK_LvY4ci76OgsHCQzoKF7We8yG9waO7j6rA@mail.gmail.com