aboutsummaryrefslogtreecommitdiff
path: root/contrib
Commit message (Collapse)AuthorAge
* postgres_fdw: Fix typo in comment.Etsuro Fujita2019-05-13
|
* postgres_fdw: Fix cost estimation for aggregate pushdown.Etsuro Fujita2019-05-09
| | | | | | | | | | | | | | | | | In commit 7012b132d0, which added support for aggregate pushdown in postgres_fdw, the expense of evaluating the final scan/join target computed by make_group_input_target() was not accounted for at all in costing aggregate pushdown paths with local statistics. The right fix for this would be to have a separate upper stage to adjust the final scan/join relation (see comments for apply_scanjoin_target_to_paths()); but for now, fix by adding the tlist eval cost when costing aggregate pushdown paths with local statistics. Apply this to HEAD only to avoid destabilizing existing plan choices. Author: Etsuro Fujita Reviewed-By: Antonin Houska Discussion: https://postgr.es/m/5C66A056.60007%40lab.ntt.co.jp
* Clean up the behavior and API of catalog.c's is-catalog-relation tests.Tom Lane2019-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The right way for IsCatalogRelation/Class to behave is to return true for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId), without any of the ad-hoc fooling around with schema membership. The previous code was wrong because (1) it claimed that information_schema tables were not catalog relations but their toast tables were, which is silly; and (2) if you dropped and recreated information_schema, which is a supported operation, the behavior changed. That's even sillier. With this definition, "catalog relations" are exactly the ones traceable to the postgres.bki data, which seems like what we want. With this simplification, we don't actually need access to the pg_class tuple to identify a catalog relation; we only need its OID. Hence, replace IsCatalogClass with "IsCatalogRelationOid(oid)". But keep IsCatalogRelation as a convenience function. This allows fixing some arguably-wrong semantics in contrib/sepgsql and ReindexRelationConcurrently, which were using an IsSystemNamespace test where what they really should be using is IsCatalogRelationOid. The previous coding failed to protect toast tables of system catalogs, and also was not on board with the general principle that user-created tables do not become catalogs just by virtue of being renamed into pg_catalog. We can also get rid of a messy hack in ReindexMultipleTables. While we're at it, also rename IsSystemNamespace to IsCatalogNamespace, because the previous name invited confusion with the more expansive semantics used by IsSystemRelation/Class. Also improve the comments in catalog.c. There are a few remaining places in replication-related code that are special-casing OIDs below FirstNormalObjectId. I'm inclined to think those are wrong too, and if there should be any special case it should just extend to FirstBootstrapObjectId. But first we need to debate whether a FOR ALL TABLES publication should include information_schema. Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
* Revert "Avoid the creation of the free space map for small heap relations".Amit Kapila2019-05-07
| | | | | | | | | | | | | | | | | | | | | This feature was using a process local map to track the first few blocks in the relation. The map was reset each time we get the block with enough freespace. It was discussed that it would be better to track this map on a per-relation basis in relcache and then invalidate the same whenever vacuum frees up some space in the page or when FSM is created. The new design would be better both in terms of API design and performance. List of commits reverted, in reverse chronological order: 06c8a5090e Improve code comments in b0eaa4c51b. 13e8643bfc During pg_upgrade, conditionally skip transfer of FSMs. 6f918159a9 Add more tests for FSM. 9c32e4c350 Clear the local map when not used. 29d108cdec Update the documentation for FSM behavior.. 08ecdfe7e5 Make FSM test portable. b0eaa4c51b Avoid creation of the free space map for small heap relations. Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
* Suppress compiler warning in non-SSL, non-assert builds.Noah Misch2019-05-03
| | | | | | Jeff Janes, reviewed by Michael Paquier. Discussion: https://postgr.es/m/CAMkU=1x8taZfsbPkv_MsWbTtzibW_yQHXoMhF_DTtm=z2hVHDg@mail.gmail.com
* Fix potential assertion failure when reindexing a pg_class index.Andres Freund2019-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reindexing individual indexes on pg_class it was possible to either trigger an assertion failure: TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((index)->rd_id))) That's because reindex_index() called SetReindexProcessing() - which enables an asserts ensuring no index insertions happen into the index - before calling RelationSetNewRelfilenode(). That not correct for indexes on pg_class, because RelationSetNewRelfilenode() updates the relevant pg_class row, which needs to update the indexes. The are two reasons this wasn't noticed earlier. Firstly the bug doesn't trigger when reindexing all of pg_class, as reindex_relation has code "hiding" all yet-to-be-reindexed indexes. Secondly, the bug only triggers when the the update to pg_class doesn't turn out to be a HOT update - otherwise there's no index insertion to trigger the bug. Most of the time there's enough space, making this bug hard to trigger. To fix, move RelationSetNewRelfilenode() to before the SetReindexProcessing() (and, together with some other code, to outside of the PG_TRY()). To make sure the error checking intended by SetReindexProcessing() is more robust, modify CatalogIndexInsert() to check ReindexIsProcessingIndex() even when the update is a HOT update. Also add a few regression tests for REINDEXing of system catalogs. The last two improvements would have prevented some of the issues fixed in 5c1560606dc4c from being introduced in the first place. Reported-By: Michael Paquier Diagnosed-By: Tom Lane and Andres Freund Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz Backpatch: 9.4-, the bug is present in all branches
* Avoid postgres_fdw crash for a targetlist entry that's just a Param.Tom Lane2019-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | foreign_grouping_ok() is willing to put fairly arbitrary expressions into the targetlist of a remote SELECT that's doing grouping or aggregation on the remote side, including expressions that have no foreign component to them at all. This is possibly a bit dubious from an efficiency standpoint; but it rises to the level of a crash-causing bug if the expression is just a Param or non-foreign Var. In that case, the expression will necessarily also appear in the fdw_exprs list of values we need to send to the remote server, and then setrefs.c's set_foreignscan_references will mistakenly replace the fdw_exprs entry with a Var referencing the targetlist result. The root cause of this problem is bad design in commit e7cb7ee14: it put logic into set_foreignscan_references that IMV is postgres_fdw-specific, and yet this bug shows that it isn't postgres_fdw-specific enough. The transformation being done on fdw_exprs assumes that fdw_exprs is to be evaluated with the fdw_scan_tlist as input, which is not how postgres_fdw uses it; yet it could be the right thing for some other FDW. (In the bigger picture, setrefs.c has no business assuming this for the other expression fields of a ForeignScan either.) The right fix therefore would be to expand the FDW API so that the FDW could inform setrefs.c how it intends to evaluate these various expressions. We can't change that in the back branches though, and we also can't just summarily change setrefs.c's behavior there, or we're likely to break external FDWs. As a stopgap, therefore, hack up postgres_fdw so that it won't attempt to send targetlist entries that look exactly like the fdw_exprs entries they'd produce. In most cases this actually produces a superior plan, IMO, with less data needing to be transmitted and returned; so we probably ought to think harder about whether we should ship tlist expressions at all when they don't contain any foreign Vars or Aggs. But that's an optimization not a bug fix so I left it for later. One case where this produces an inferior plan is where the expression in question is actually a GROUP BY expression: then the restriction prevents us from using remote grouping. It might be possible to work around that (since that would reduce to group-by-a-constant on the remote side); but it seems like a pretty unlikely corner case, so I'm not sure it's worth expending effort solely to improve that. In any case the right long-term answer is to fix the API as sketched above, and then revert this hack. Per bug #15781 from Sean Johnston. Back-patch to v10 where the problem was introduced. Discussion: https://postgr.es/m/15781-2601b1002bad087c@postgresql.org
* Sanitize line pointers within contrib/amcheck.Peter Geoghegan2019-04-25
| | | | | | | | | | | | | | | | | Adopt a more defensive approach to accessing index tuples in contrib/amcheck: verify that each line pointer looks sane before accessing associated tuple using pointer arithmetic based on line pointer's offset. This avoids undefined behavior and assertion failures in cases where line pointers are corrupt. Issue spotted following a complaint about an assertion failure by Grigory Smolkin, which involved a test harness that deliberately corrupts indexes. This is arguably a bugfix, but no backpatch given the lack of field reports beyond Grigory's. Discussion: https://postgr.es/m/CAH2-WzmkurhCqnyLHxk0VkOZqd49+ZZsp1xAJOg7j2x7dmp_XQ@mail.gmail.com
* postgres_fdw: Fix incorrect handling of row movement for remote partitions.Etsuro Fujita2019-04-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3d956d9562 added support for update row movement in postgres_fdw. This patch fixes the following issues introduced by that commit: * When a remote partition chosen to insert routed rows into was also an UPDATE subplan target rel that would be updated later, the UPDATE that used a direct modification plan modified those routed rows incorrectly because those routed rows were visible to the later UPDATE command. The right fix for this would be to have some way in postgres_fdw in which the later UPDATE command ignores those routed rows, but it seems hard to do so with the current infrastructure. For now throw an error in that case. * When a remote partition chosen to insert routed rows into was also an UPDATE subplan target rel, fmstate created for the UPDATE that used a non-direct modification plan was mistakenly overridden by another fmstate created for inserting those routed rows into the partition. This caused 1) server crash when the partition would be updated later, and 2) resource leak when the partition had been already updated. To avoid that, adjust the treatment of the fmstate for the inserting. As for #1, since we would also have the incorrectness issue as mentioned above, error out in that case as well. Update the docs to mention that postgres_fdw currently does not handle the case where a remote partition chosen to insert a routed row into is also an UPDATE subplan target rel that will be updated later. Author: Amit Langote and Etsuro Fujita Reviewed-by: Amit Langote Backpatch-through: 11 where row movement in postgres_fdw was added Discussion: https://postgr.es/m/21e7eaa4-0d4d-20c2-a1f7-c7e96f4ce440@lab.ntt.co.jp
* Fix two memory leaks around force-storing tuples in slots.Andres Freund2019-04-19
| | | | | | | | | | | | | | | | | | | | As reported by Tom, when ExecStoreMinimalTuple() had to perform a conversion to store the minimal tuple in the slot, it forgot to respect the shouldFree flag, and leaked the tuple into the current memory context if true. Fix that by freeing the tuple in that case. Looking at the relevant code made me (Andres) realize that not having the shouldFree parameter to ExecForceStoreHeapTuple() was a bad idea. Some callers had to locally implement the necessary logic, and in one case it was missing, creating a potential per-group leak in non-hashed aggregation. The choice to not free the tuple in ExecComputeStoredGenerated() is not pretty, but not introduced by this commit - I'll start a separate discussion about it. Reported-By: Tom Lane Discussion: https://postgr.es/m/366.1555382816@sss.pgh.pa.us
* Fix more strcmp() calls using boolean-like comparisons for result checksMichael Paquier2019-04-12
| | | | | | | | | | Such calls can confuse the reader as strcmp() uses an integer as result. The places patched here have been spotted by Thomas Munro, David Rowley and myself. Author: Michael Paquier Reviewed-by: David Rowley Discussion: https://postgr.es/m/20190411021946.GG2728@paquier.xyz
* Avoid Python memory leaks in hstore_plpython and jsonb_plpython.Tom Lane2019-04-06
| | | | | | | | | | | | Fix some places where we might fail to do Py_DECREF() on a Python object (thereby leaking it for the rest of the session). Almost all of the risks were in error-recovery paths, which we don't really expect to hit anyway. Hence, while this is definitely a bug fix, it doesn't quite seem worth back-patching. Nikita Glukhov, Michael Paquier, Tom Lane Discussion: https://postgr.es/m/28053a7d-10d8-fc23-b05c-b4749c873f63@postgrespro.ru
* Add support TCP user timeout in libpq and the backend serverMichael Paquier2019-04-06
| | | | | | | | | | | | | | | | | Similarly to the set of parameters for keepalive, a connection parameter for libpq is added as well as a backend GUC, called tcp_user_timeout. Increasing the TCP user timeout is useful to allow a connection to survive extended periods without end-to-end connection, and decreasing it allows application to fail faster. By default, the parameter is 0, which makes the connection use the system default, and follows a logic close to the keepalive parameters in its handling. When connecting through a Unix-socket domain, the parameters have no effect. Author: Ryohei Nagaura Reviewed-by: Fabien Coelho, Robert Haas, Kyotaro Horiguchi, Kirk Jamison, Mikalai Keida, Takayuki Tsunakawa, Andrei Yahorau Discussion: https://postgr.es/m/EDA4195584F5064680D8130B1CA91C45367328@G01JPEXMBYT04
* Add facility to copy replication slotsAlvaro Herrera2019-04-05
| | | | | | | | | | | | | | This allows the user to create duplicates of existing replication slots, either logical or physical, and even changing properties such as whether they are temporary or the output plugin used. There are multiple uses for this, such as initializing multiple replicas using the slot for one base backup; when doing investigation of logical replication issues; and to select a different output plugins. Author: Masahiko Sawada Reviewed-by: Michael Paquier, Andres Freund, Petr Jelinek Discussion: https://postgr.es/m/CAD21AoAm7XX8y_tOPP6j4Nzzch12FvA1wPqiO690RCk+uYVstg@mail.gmail.com
* Add test coverage for rootdescend verification.Peter Geoghegan2019-04-04
| | | | | | | | | | | Commit c1afd175, which added support for rootdescend verification to amcheck, added only minimal regression test coverage. Address this by making sure that rootdescend verification is run on a multi-level index. In passing, simplify some of the regression tests that exercise multi-level nbtree page deletion. Both issues spotted while rereviewing coverage of the nbtree patch series using gcov.
* file_fdw: Fix for generated columnsPeter Eisentraut2019-04-04
| | | | | | | | | Since file_fdw uses COPY internally, but COPY doesn't allow listing generated columns in its column list, we need to make sure that we don't add generated columns to the column lists internally generated by file_fdw. Reported-by: Erik Rijkers <er@xs4all.nl>
* Add SETTINGS option to EXPLAIN, to print modified settings.Tomas Vondra2019-04-04
| | | | | | | | | | | | | | | Query planning is affected by a number of configuration options, and it may be crucial to know which of those options were set to non-default values. With this patch you can say EXPLAIN (SETTINGS ON) to include that information in the query plan. Only options affecting planning, with values different from the built-in default are printed. This patch also adds auto_explain.log_settings option, providing the same capability in auto_explain module. Author: Tomas Vondra Reviewed-by: Rafia Sabih, John Naylor Discussion: https://postgr.es/m/e1791b4c-df9c-be02-edc5-7c8874944be0@2ndquadrant.com
* Report progress of CREATE INDEX operationsAlvaro Herrera2019-04-02
| | | | | | | | | | | | | | | | | | | | This uses the progress reporting infrastructure added by c16dc1aca5e0, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
* postgres_fdw: Perform the (FINAL, NULL) upperrel operations remotely.Etsuro Fujita2019-04-02
| | | | | | | | | | | | | | | | The upper-planner pathification allows FDWs to arrange to push down different types of upper-stage operations to the remote side. This commit teaches postgres_fdw to do it for the (FINAL, NULL) upperrel, which is responsible for doing LockRows, LIMIT, and/or ModifyTable. This provides the ability for postgres_fdw to handle SELECT commands so that it 1) skips the LockRows step (if any) (note that this is safe since it performs early locking) and 2) pushes down the LIMIT and/or OFFSET restrictions (if any) to the remote side. This doesn't handle the INSERT/UPDATE/DELETE cases. Author: Etsuro Fujita Reviewed-By: Antonin Houska and Jeff Janes Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk
* postgres_fdw: Modify regression tests for EPQ-related planning problems.Etsuro Fujita2019-04-02
| | | | | | | | | This prevents the tests added by commit 4bbf6edfbd and adjusted by commit 99f6a17dd6 from being useless by plan changes created by an upcoming commit. Author: Etsuro Fujita Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk
* postgres_fdw: Perform the (ORDERED, NULL) upperrel operations remotely.Etsuro Fujita2019-04-02
| | | | | | | | | | | | | | The upper-planner pathification allows FDWs to arrange to push down different types of upper-stage operations to the remote side. This commit teaches postgres_fdw to do it for the (ORDERED, NULL) upperrel, which is responsible for evaluating the query's ORDER BY ordering. Since postgres_fdw is already able to evaluate that ordering remotely for foreign baserels and foreign joinrels (see commit aa09cd242f et al.), this adds support for that for foreign grouping relations. Author: Etsuro Fujita Reviewed-By: Antonin Houska and Jeff Janes Discussion: https://postgr.es/m/87pnz1aby9.fsf@news-spur.riddles.org.uk
* Only allow heap in a number of contrib modules.Andres Freund2019-04-01
| | | | | | | | | | Contrib modules pgrowlocks, pgstattuple and some functionality in pageinspect currently only supports the heap table AM. As they are all concerned with low-level details that aren't reasonably exposed via tableam, error out if invoked on a non heap relation. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: sample scan.Andres Freund2019-03-31
| | | | | | | | | | | | | | | | | | | | This moves sample scan support to below tableam. It's not optional as there is, in contrast to e.g. bitmap heap scans, no alternative way to perform tablesample queries. If an AM can't deal with the block based API, it will have to throw an ERROR. The tableam callbacks for this are block based, but given the current TsmRoutine interface, that seems to be required. The new interface doesn't require TsmRoutines to perform visibility checks anymore - that requires the TsmRoutine to know details about the AM, which we want to avoid. To continue to allow taking the returned number of tuples account SampleScanState now has a donetuples field (which previously e.g. existed in SystemRowsSamplerData), which is only incremented after the visibility check succeeds. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Speed up planning when partitions can be pruned at plan time.Tom Lane2019-03-30
| | | | | | | | | | | | | | | | | | | | | | Previously, the planner created RangeTblEntry and RelOptInfo structs for every partition of a partitioned table, even though many of them might later be deemed uninteresting thanks to partition pruning logic. This incurred significant overhead when there are many partitions. Arrange to postpone creation of these data structures until after we've processed the query enough to identify restriction quals for the partitioned table, and then apply partition pruning before not after creation of each partition's data structures. In this way we need not open the partition relations at all for partitions that the planner has no real interest in. For queries that can be proven at plan time to access only a small number of partitions, this patch improves the practical maximum number of partitions from under 100 to perhaps a few thousand. Amit Langote, reviewed at various times by Dilip Kumar, Jesper Pedersen, Yoshikazu Imai, and David Rowley Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
* Generated columnsPeter Eisentraut2019-03-30
| | | | | | | | | | | | | | This is an SQL-standard feature that allows creating columns that are computed from expressions rather than assigned, similar to a view or materialized view but on a column basis. This implements one kind of generated column: stored (computed on write). Another kind, virtual (computed on read), is planned for the future, and some room is left for it. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
* Tweak some nbtree-related code comments.Peter Geoghegan2019-03-29
|
* tableam: Support for an index build's initial table scan(s).Andres Freund2019-03-27
| | | | | | | | | | | | | To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Switch some palloc/memset calls to palloc0Michael Paquier2019-03-27
| | | | | | | | | | | Some code paths have been doing some allocations followed by an immediate memset() to initialize the allocated area with zeros, this is a bit overkill as there are already interfaces to do both things in one call. Author: Daniel Gustafsson Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/vN0OodBPkKs7g2Z1uyk3CUEmhdtspHgYCImhlmSxv1Xn6nY1ZnaaGHL8EWUIQ-NEv36tyc4G5-uA3UXUF2l4sFXtK_EQgLN1hcgunlFVKhA=@yesql.se
* Get rid of duplicate child RTE for a partitioned table.Tom Lane2019-03-26
| | | | | | | | | | | | | | | | | | We've been creating duplicate RTEs for partitioned tables just because we do so for regular inheritance parent tables. But unlike regular-inheritance parents which are themselves regular tables and thus need to be scanned, partitioned tables don't need the extra RTE. This makes the conditions for building a child RTE the same as those for building an AppendRelInfo, allowing minor simplification in expand_single_inheritance_child. Since the planner's actual processing is driven off the AppendRelInfo list, nothing much changes beyond that, we just have one fewer useless RTE entry. Amit Langote, reviewed and hacked a bit by me Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp
* Suppress Append and MergeAppend plan nodes that have a single child.Tom Lane2019-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | If there's only one child relation, the Append or MergeAppend isn't doing anything useful, and can be elided. It does have a purpose during planning though, which is to serve as a buffer between parent and child Var numbering. Therefore we keep it all the way through to setrefs.c, and get rid of it only after fixing references in the plan level(s) above it. This works largely the same as setrefs.c's ancient hack to get rid of no-op SubqueryScan nodes, and can even share some code with that. Note the change to make setrefs.c use apply_tlist_labeling rather than ad-hoc code. This has the effect of propagating the child's resjunk and ressortgroupref labels, which formerly weren't propagated when removing a SubqueryScan. Doing that is demonstrably necessary for the [Merge]Append cases, and seems harmless for SubqueryScan, if only because trivial_subqueryscan is afraid to collapse cases where the resjunk marking differs. (I suspect that restriction could now be removed, though it's unclear that it'd make any new matches possible, since the outer query can't have references to a child resjunk column.) David Rowley, reviewed by Alvaro Herrera and Tomas Vondra Discussion: https://postgr.es/m/CAKJS1f_7u8ATyJ1JGTMHFoKDvZdeF-iEBhs+sM_SXowOr9cArg@mail.gmail.com
* Initialize structure at declarationPeter Eisentraut2019-03-25
| | | | | | Avoids extra memset call and cast. Discussion: https://www.postgresql.org/message-id/flat/7a5cbea7-b8df-e910-0f10-04014bcad701%402ndquadrant.com
* Avoid double-free in vacuumlo error path.Tom Lane2019-03-24
| | | | | | | | | | | The code would do "PQclear(res)" twice if lo_unlink failed, evidently due to careless thinking about how far out a "break" would break. Remove the extra PQclear and adjust the loop logic so that we'll fall out of both levels of loop after an error, as was clearly the intent. Spotted by Coverity. I have no idea why it took this long to notice, since the bug has been there since commit 67ccbb080. Accordingly, back-patch to all supported branches.
* tableam: Add tuple_{insert, delete, update, lock} and use.Andres Freund2019-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds new, required, table AM callbacks for insert/delete/update and lock_tuple. To be able to reasonably use those, the EvalPlanQual mechanism had to be adapted, moving more logic into the AM. Previously both delete/update/lock call-sites and the EPQ mechanism had to have awareness of the specific tuple format to be able to fetch the latest version of a tuple. Obviously that needs to be abstracted away. To do so, move the logic that find the latest row version into the AM. lock_tuple has a new flag argument, TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last version, rather than the current one. It'd have been possible to do so via a separate callback as well, but finding the last version usually also necessitates locking the newest version, making it sensible to combine the two. This replaces the previous use of EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously signaled either a concurrent update or delete, is now split into two, to avoid callers needing AM specific knowledge to differentiate. The move of finding the latest row version into tuple_lock means that encountering a row concurrently moved into another partition will now raise an error about "tuple to be locked" rather than "tuple to be updated/deleted" - which is accurate, as that always happens when locking rows. While possible slightly less helpful for users, it seems like an acceptable trade-off. As part of this commit HTSU_Result has been renamed to TM_Result, and its members been expanded to differentiated between updating and deleting. HeapUpdateFailureData has been renamed to TM_FailureData. The interface to speculative insertion is changed so nodeModifyTable.c does not have to set the speculative token itself anymore. Instead there's a version of tuple_insert, tuple_insert_speculative, that performs the speculative insertion (without requiring a flag to signal that fact), and the speculative insertion is either made permanent with table_complete_speculative(succeeded = true) or aborted with succeeded = false). Note that multi_insert is not yet routed through tableam, nor is COPY. Changing multi_insert requires changes to copy.c that are large enough to better be done separately. Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED VIEW are also only going to be adjusted in a later commit. Author: Andres Freund and Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
* Collations with nondeterministic comparisonPeter Eisentraut2019-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a flag "deterministic" to collations. If that is false, such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. This functionality is only supported with the ICU provider. At least glibc doesn't appear to have any locales that work in a nondeterministic way, so it's not worth supporting this for the libc provider. The term "deterministic comparison" in this context is from Unicode Technical Standard #10 (https://unicode.org/reports/tr10/#Deterministic_Comparison). This patch makes changes in three areas: - CREATE COLLATION DDL changes and system catalog changes to support this new flag. - Many executor nodes and auxiliary code are extended to track collations. Previously, this code would just throw away collation information, because the eventually-called user-defined functions didn't use it since they only cared about equality, which didn't need collation information. - String data type functions that do equality comparisons and hashing are changed to take the (non-)deterministic flag into account. For comparison, this just means skipping various shortcuts and tie breakers that use byte-wise comparison. For hashing, we first need to convert the input string to a canonical "sort key" using the ICU analogue of strxfrm(). Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
* Allow amcheck to re-find tuples using new search.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach contrib/amcheck's bt_index_parent_check() function to take advantage of the uniqueness property of heapkeyspace indexes in support of a new verification option: non-pivot tuples (non-highkey tuples on the leaf level) can optionally be re-found using a new search for each, that starts from the root page. If a tuple cannot be re-found, report that the index is corrupt. The new "rootdescend" verification option is exhaustive, and can therefore make a call to bt_index_parent_check() take a lot longer. Re-finding tuples during verification is mostly intended as an option for backend developers, since the corruption scenarios that it alone is uniquely capable of detecting seem fairly far-fetched. For example, "rootdescend" verification is much more likely to detect corruption of the least significant byte of a key from a pivot tuple in the root page of a B-Tree that already has at least three levels. Typically, only a few tuples on a cousin leaf page are at risk of "getting overlooked" by index scans in this scenario. The corrupt key in the root page is only slightly corrupt: corrupt enough to give wrong answers to some queries, and yet not corrupt enough to allow the problem to be detected without verifying agreement between the leaf page and the root page, skipping at least one internal page level. The existing bt_index_parent_check() checks never cross more than a single level. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com
* Make heap TID a tiebreaker nbtree index column.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make nbtree treat all index tuples as having a heap TID attribute. Index searches can distinguish duplicates by heap TID, since heap TID is always guaranteed to be unique. This general approach has numerous benefits for performance, and is prerequisite to teaching VACUUM to perform "retail index tuple deletion". Naively adding a new attribute to every pivot tuple has unacceptable overhead (it bloats internal pages), so suffix truncation of pivot tuples is added. This will usually truncate away the "extra" heap TID attribute from pivot tuples during a leaf page split, and may also truncate away additional user attributes. This can increase fan-out, especially in a multi-column index. Truncation can only occur at the attribute granularity, which isn't particularly effective, but works well enough for now. A future patch may add support for truncating "within" text attributes by generating truncated key values using new opclass infrastructure. Only new indexes (BTREE_VERSION 4 indexes) will have insertions that treat heap TID as a tiebreaker attribute, or will have pivot tuples undergo suffix truncation during a leaf page split (on-disk compatibility with versions 2 and 3 is preserved). Upgrades to version 4 cannot be performed on-the-fly, unlike upgrades from version 2 to version 3. contrib/amcheck continues to work with version 2 and 3 indexes, while also enforcing stricter invariants when verifying version 4 indexes. These stricter invariants are the same invariants described by "3.1.12 Sequencing" from the Lehman and Yao paper. A later patch will enhance the logic used by nbtree to pick a split point. This patch is likely to negatively impact performance without smarter choices around the precise point to split leaf pages at. Making these two mostly-distinct sets of enhancements into distinct commits seems like it might clarify their design, even though neither commit is particularly useful on its own. The maximum allowed size of new tuples is reduced by an amount equal to the space required to store an extra MAXALIGN()'d TID in a new high key during leaf page splits. The user-facing definition of the "1/3 of a page" restriction is already imprecise, and so does not need to be revised. However, there should be a compatibility note in the v12 release notes. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas, Alexander Korotkov Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
* Refactor nbtree insertion scankeys.Peter Geoghegan2019-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use dedicated struct to represent nbtree insertion scan keys. Having a dedicated struct makes the difference between search type scankeys and insertion scankeys a lot clearer, and simplifies the signature of several related functions. This is based on a suggestion by Andrey Lepikhov. Streamline how unique index insertions cache binary search progress. Cache the state of in-progress binary searches within _bt_check_unique() for later instead of having callers avoid repeating the binary search in an ad-hoc manner. This makes it easy to add a new optimization: _bt_check_unique() now falls out of its loop immediately in the common case where it's already clear that there couldn't possibly be a duplicate. The new _bt_check_unique() scheme makes it a lot easier to manage cached binary search effort afterwards, from within _bt_findinsertloc(). This is needed for the upcoming patch to make nbtree tuples unique by treating heap TID as a final tiebreaker column. Unique key binary searches need to restore lower and upper bounds. They cannot simply continue to use the >= lower bound as the offset to insert at, because the heap TID tiebreaker column must be used in comparisons for the restored binary search (unlike the original _bt_check_unique() binary search, where scankey's heap TID column must be omitted). Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas, Andrey Lepikhov Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
* Remove leftover reference to oid column.Andres Freund2019-03-18
| | | | | | | I (Andres) missed this in 578b229718e8. Author: John Naylor Discussion: https://postgr.es/m/CACPNZCtd+ckUgibRFs9KewK4Yr5rj3Oipefquupw+XJZebFhrA@mail.gmail.com
* Don't auto-restart per-database autoprewarm workers.Robert Haas2019-03-18
| | | | | | | | | | | | | We should try to prewarm each database only once. Otherwise, if prewarming fails for some reason, it will just keep retrying in an infnite loop. This can happen if, for example, the database has been dropped. The existing code was intended to implement the try-once behavior, but failed to do so because it neglected to set worker.bgw_restart_time to BGW_NEVER_RESTART. Mithun Cy, per a report from Hans Buschmann Discussion: http://postgr.es/m/CA+hUKGKpQJCWcgyy3QTC9vdn6uKAR_8r__A-MMm2GYfj45caag@mail.gmail.com
* Fix volatile vs. pointer confusionPeter Eisentraut2019-03-14
| | | | | | | | | | | | | | Variables used after a longjmp() need to be declared volatile. In case of a pointer, it's the pointer itself that needs to be declared volatile, not the pointed-to value. So we need PyObject *volatile items; instead of volatile PyObject *items; /* wrong */ Discussion: https://www.postgresql.org/message-id/flat/f747368d-9e1a-c46a-ac76-3c27da32e8e4%402ndquadrant.com
* tableam: Add and use scan APIs.Andres Freund2019-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Too allow table accesses to be not directly dependent on heap, several new abstractions are needed. Specifically: 1) Heap scans need to be generalized into table scans. Do this by introducing TableScanDesc, which will be the "base class" for individual AMs. This contains the AM independent fields from HeapScanDesc. The previous heap_{beginscan,rescan,endscan} et al. have been replaced with a table_ version. There's no direct replacement for heap_getnext(), as that returned a HeapTuple, which is undesirable for a other AMs. Instead there's table_scan_getnextslot(). But note that heap_getnext() lives on, it's still used widely to access catalog tables. This is achieved by new scan_begin, scan_end, scan_rescan, scan_getnextslot callbacks. 2) The portion of parallel scans that's shared between backends need to be able to do so without the user doing per-AM work. To achieve that new parallelscan_{estimate, initialize, reinitialize} callbacks are introduced, which operate on a new ParallelTableScanDesc, which again can be subclassed by AMs. As it is likely that several AMs are going to be block oriented, block oriented callbacks that can be shared between such AMs are provided and used by heap. table_block_parallelscan_{estimate, intiialize, reinitialize} as callbacks, and table_block_parallelscan_{nextpage, init} for use in AMs. These operate on a ParallelBlockTableScanDesc. 3) Index scans need to be able to access tables to return a tuple, and there needs to be state across individual accesses to the heap to store state like buffers. That's now handled by introducing a sort-of-scan IndexFetchTable, which again is intended to be subclassed by individual AMs (for heap IndexFetchHeap). The relevant callbacks for an AM are index_fetch_{end, begin, reset} to create the necessary state, and index_fetch_tuple to retrieve an indexed tuple. Note that index_fetch_tuple implementations need to be smarter than just blindly fetching the tuples for AMs that have optimizations similar to heap's HOT - the currently alive tuple in the update chain needs to be fetched if appropriate. Similar to table_scan_getnextslot(), it's undesirable to continue to return HeapTuples. Thus index_fetch_heap (might want to rename that later) now accepts a slot as an argument. Core code doesn't have a lot of call sites performing index scans without going through the systable_* API (in contrast to loads of heap_getnext calls and working directly with HeapTuples). Index scans now store the result of a search in IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the target is not generally a HeapTuple anymore that seems cleaner. To be able to sensible adapt code to use the above, two further callbacks have been introduced: a) slot_callbacks returns a TupleTableSlotOps* suitable for creating slots capable of holding a tuple of the AMs type. table_slot_callbacks() and table_slot_create() are based upon that, but have additional logic to deal with views, foreign tables, etc. While this change could have been done separately, nearly all the call sites that needed to be adapted for the rest of this commit also would have been needed to be adapted for table_slot_callbacks(), making separation not worthwhile. b) tuple_satisfies_snapshot checks whether the tuple in a slot is currently visible according to a snapshot. That's required as a few places now don't have a buffer + HeapTuple around, but a slot (which in heap's case internally has that information). Additionally a few infrastructure changes were needed: I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now internally uses a slot to keep track of tuples. While systable_getnext() still returns HeapTuples, and will so for the foreseeable future, the index API (see 1) above) now only deals with slots. The remainder, and largest part, of this commit is then adjusting all scans in postgres to use the new APIs. Author: Andres Freund, Haribabu Kommi, Alvaro Herrera Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
* Move hash_any prototype from access/hash.h to utils/hashutils.hAlvaro Herrera2019-03-11
| | | | | | | | | | | | | | | | | | | | | ... as well as its implementation from backend/access/hash/hashfunc.c to backend/utils/hash/hashfn.c. access/hash is the place for the hash index AM, not really appropriate for generic facilities, which is what hash_any is; having things the old way meant that anything using hash_any had to include the AM's include file, pointlessly polluting its namespace with unrelated, unnecessary cruft. Also move the HTEqual strategy number to access/stratnum.h from access/hash.h. To avoid breaking third-party extension code, add an #include "utils/hashutils.h" to access/hash.h. (An easily removed line by committers who enjoy their asbestos suits to protect them from angry extension authors.) Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
* Tighten use of OpenTransientFile and CloseTransientFileMichael Paquier2019-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | This fixes two sets of issues related to the use of transient files in the backend: 1) OpenTransientFile() has been used in some code paths with read-write flags while read-only is sufficient, so switch those calls to be read-only where necessary. These have been reported by Joe Conway. 2) When opening transient files, it is up to the caller to close the file descriptors opened. In error code paths, CloseTransientFile() gets called to clean up things before issuing an error. However in normal exit paths, a lot of callers of CloseTransientFile() never actually reported errors, which could leave a file descriptor open without knowing about it. This is an issue I complained about a couple of times, but never had the courage to write and submit a patch, so here we go. Note that one frontend code path is impacted by this commit so as an error is issued when fetching control file data, making backend and frontend to be treated consistently. Reported-by: Joe Conway, Michael Paquier Author: Michael Paquier Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
* Remove unused macroPeter Eisentraut2019-02-28
| | | | It has never been used as long as hstore has been in the tree.
* Use slots in trigger infrastructure, except for the actual invocation.Andres Freund2019-02-26
| | | | | | | | | | | | | | | | | | | | | In preparation for abstracting table storage, convert trigger.c to track tuples in slots. Which also happens to make code calling triggers simpler. As the calling interface for triggers themselves is not changed in this patch, HeapTuples still are extracted from the slot at that time. But that's handled solely inside trigger.c, not visible to callers. It's quite likely that we'll want to revise the external trigger interface, but that's a separate large project. As part of this work the slots used for old/new/return tuples are moved from EState into ResultRelInfo, as different updated tables might need different slots. The slots are now also now created on-demand, which is good both from an efficiency POV, but also makes the modifying code simpler. Author: Andres Freund, Amit Khandekar and Ashutosh Bapat Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Doc: Update the documentation for FSM behavior for small tables.Amit Kapila2019-02-20
| | | | | | | | | | | | | | In commit b0eaa4c51b, we have avoided the creation of FSM for small tables. So the functions that use FSM to compute the free space can return zero for such tables. This was previously also possible for the cases where the vacuum has not been triggered to update FSM. This commit updates the comments in code and documentation to reflect this behavior. Author: John Naylor Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CACPNZCtba-3m1q3A8gxA_vxg=T7gQf7gMbpR4Ciy5LntY-j+0Q@mail.gmail.com
* Suppress another case of MSVC warning 4146.Noah Misch2019-02-16
|
* In imath.h, replace stdint.h usage with c.h equivalents.Noah Misch2019-02-16
| | | | | | | | | As things stood, buildfarm member dory failed. MSVC versions lacking stdint.h are unusable for building PostgreSQL, but pg_config.h.win32 doesn't know that. Even so, we support other systems lacking stdint.h, including buildfarm member gaur. Per a suggestion from Tom Lane. Discussion: https://postgr.es/m/9598.1550353336@sss.pgh.pa.us
* Import changes from IMath versions (1.3, 1.29].Noah Misch2019-02-16
| | | | | | | | | | | | | | Upstream fixed bugs over the years, but none are confirmed to have affected pgcrypto. We're better off naively tracking upstream than reactively maintaining a twelve-year-old snapshot of upstream. Add a header comment describing the synchronization procedure. Discard use of INVERT_COMPARE_RESULT(); the domain of the comparisons in question is {-1,0,1}, controlled entirely by code in imath.c. Andrew Gierth reviewed the Makefile change. Tom Lane reviewed the synchronization procedure description. Discussion: https://postgr.es/m/20190203035704.GA6226@rfd.leadboat.com
* Allow user control of CTE materialization, and change the default behavior.Tom Lane2019-02-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically we've always materialized the full output of a CTE query, treating WITH as an optimization fence (so that, for example, restrictions from the outer query cannot be pushed into it). This is appropriate when the CTE query is INSERT/UPDATE/DELETE, or is recursive; but when the CTE query is non-recursive and side-effect-free, there's no hazard of changing the query results by pushing restrictions down. Another argument for materialization is that it can avoid duplicate computation of an expensive WITH query --- but that only applies if the WITH query is called more than once in the outer query. Even then it could still be a net loss, if each call has restrictions that would allow just a small part of the WITH query to be computed. Hence, let's change the behavior for WITH queries that are non-recursive and side-effect-free. By default, we will inline them into the outer query (removing the optimization fence) if they are called just once. If they are called more than once, we will keep the old behavior by default, but the user can override this and force inlining by specifying NOT MATERIALIZED. Lastly, the user can force the old behavior by specifying MATERIALIZED; this would mainly be useful when the query had deliberately been employing WITH as an optimization fence to prevent a poor choice of plan. Andreas Karlsson, Andrew Gierth, David Fetter Discussion: https://postgr.es/m/87sh48ffhb.fsf@news-spur.riddles.org.uk