aboutsummaryrefslogtreecommitdiff
path: root/src/backend/executor/execParallel.c
Commit message (Collapse)AuthorAge
* Change the name of the Result Cache node to MemoizeDavid Rowley2021-07-14
| | | | | | | | | | | "Result Cache" was never a great name for this node, but nobody managed to come up with another name that anyone liked enough. That was until David Johnston mentioned "Node Memoization", which Tom Lane revised to just "Memoize". People seem to like "Memoize", so let's do the rename. Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us Backpatch-through: 14, where Result Cache was introduced
* adjust query id feature to use pg_stat_activity.query_idBruce Momjian2021-04-20
| | | | | | | | | | | Previously, it was pg_stat_activity.queryid to match the pg_stat_statements queryid column. This is an adjustment to patch 4f0b0966c8. This also adjusts some of the internal function calls to match. Catversion bumped. Reported-by: Álvaro Herrera, Julien Rouhaud Discussion: https://postgr.es/m/20210408032704.GA7498@alvherre.pgsql
* Revert "Cope with NULL query string in ExecInitParallelPlan()."Tom Lane2021-04-15
| | | | | | | | This reverts commit b3ee4c503872f3d0a5d6a7cbde48815f555af15b. We don't need it in the wake of the preceding commit, which added an upstream check that the querystring isn't null. Discussion: https://postgr.es/m/2197698.1617984583@sss.pgh.pa.us
* Fixes for query_id featureBruce Momjian2021-04-08
| | | | | | | | | | | | | | | | | | Ignore parallel workers in pg_stat_statements Oversight in 4f0b0966c8 which exposed queryid in parallel workers. Counters are aggregated by the main backend process so parallel workers would report duplicated activity, and could also report activity for the wrong entry as they are only aware of the top level queryid. Fix thinko in pg_stat_get_activity when retrieving the queryid. Remove unnecessary call to pgstat_report_queryid(). Reported-by: Amit Kapila, Andres Freund, Thomas Munro Discussion: https://postgr.es/m/20210408051735.lfbdzun5zdlax5gd@alap3.anarazel.de p634GTSOqnDW86Owrn6qDAVosC5dJjXjp7BMfc5Gz1Q@mail.gmail.com Author: Julien Rouhaud
* Cope with NULL query string in ExecInitParallelPlan().Andres Freund2021-04-07
| | | | | | | | | It's far from clear that this is the right approach - but a good portion of the buildfarm has been red for a few hours, on the last day of the CF. And this fixes at least the obvious crash. So let's go with that for now. Discussion: https://postgr.es/m/20210407225806.majgznh4lk34hjvu%40alap3.anarazel.de
* Make use of in-core query id added by commit 5fd9dfa5f5Bruce Momjian2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | Use the in-core query id computation for pg_stat_activity, log_line_prefix, and EXPLAIN VERBOSE. Similar to other fields in pg_stat_activity, only the queryid from the top level statements are exposed, and if the backends status isn't active then the queryid from the last executed statements is displayed. Add a %Q placeholder to include the queryid in log_line_prefix, which will also only expose top level statements. For EXPLAIN VERBOSE, if a query identifier has been computed, either by enabling compute_query_id or using a third-party module, display it. Bump catalog version. Discussion: https://postgr.es/m/20210407125726.tkvjdbw76hxnpwfi@nol Author: Julien Rouhaud Reviewed-by: Alvaro Herrera, Nitin Jadhav, Zhihong Yu
* Add Result Cache executor node (take 2)David Rowley2021-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Revert b6002a796David Rowley2021-04-01
| | | | | | | | | | | | | This removes "Add Result Cache executor node". It seems that something weird is going on with the tracking of cache hits and misses as highlighted by many buildfarm animals. It's not yet clear what the problem is as other parts of the plan indicate that the cache did work correctly, it's just the hits and misses that were being reported as 0. This is especially a bad time to have the buildfarm so broken, so reverting before too many more animals go red. Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
* Add Result Cache executor nodeDavid Rowley2021-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Create ResultRelInfos later in InitPlan, index them by RT index.Heikki Linnakangas2020-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of allocating all the ResultRelInfos upfront in one big array, allocate them in ExecInitModifyTable(). es_result_relations is now an array of ResultRelInfo pointers, rather than an array of structs, and it is indexed by the RT index. This simplifies things: we get rid of the separate concept of a "result rel index", and don't need to set it in setrefs.c anymore. This also allows follow-up optimizations (not included in this commit yet) to skip initializing ResultRelInfos for target relations that were not needed at runtime, and removal of the es_result_relation_info pointer. The EState arrays of regular result rels and root result rels are merged into one array. Similarly, the resultRelations and rootResultRelations lists in PlannedStmt are merged into one. It's not actually clear to me why they were kept separate in the first place, but now that the es_result_relations array is indexed by RT index, it certainly seems pointless. The PlannedStmt->resultRelations list is now only needed for ExecRelationIsTargetRelation(). One visible effect of this change is that ExecRelationIsTargetRelation() will now return 'true' also for the partition root, if a partitioned table is updated. That seems like a good thing, although the function isn't used in core code, and I don't see any reason for an FDW to call it on a partition root. Author: Amit Langote Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
* Fix EXPLAIN ANALYZE for parallel HashAgg plansDavid Rowley2020-06-19
| | | | | | | | | | | | | | | Since 1f39bce02, HashAgg nodes have had the ability to spill to disk when memory consumption exceeds work_mem. That commit added new properties to EXPLAIN ANALYZE to show the maximum memory usage and disk usage, however, it didn't quite go as far as showing that information for parallel workers. Since workers may have experienced something very different from the main process, we should show this information per worker, as is done in Sort. Reviewed-by: Justin Pryzby Reviewed-by: Jeff Davis Discussion: https://postgr.es/m/CAApHDvpEKbfZa18mM1TD7qV6PG+w97pwCWq5tVD0dX7e11gRJw@mail.gmail.com Backpatch-through: 13, where the hashagg spilling code was added.
* Implement Incremental SortTomas Vondra2020-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incremental Sort is an optimized variant of multikey sort for cases when the input is already sorted by a prefix of the requested sort keys. For example when the relation is already sorted by (key1, key2) and we need to sort it by (key1, key2, key3) we can simply split the input rows into groups having equal values in (key1, key2), and only sort/compare the remaining column key3. This has a number of benefits: - Reduced memory consumption, because only a single group (determined by values in the sorted prefix) needs to be kept in memory. This may also eliminate the need to spill to disk. - Lower startup cost, because Incremental Sort produce results after each prefix group, which is beneficial for plans where startup cost matters (like for example queries with LIMIT clause). We consider both Sort and Incremental Sort, and decide based on costing. The implemented algorithm operates in two different modes: - Fetching a minimum number of tuples without check of equality on the prefix keys, and sorting on all columns when safe. - Fetching all tuples for a single prefix group and then sorting by comparing only the remaining (non-prefix) keys. We always start in the first mode, and employ a heuristic to switch into the second mode if we believe it's beneficial - the goal is to minimize the number of unnecessary comparions while keeping memory consumption below work_mem. This is a very old patch series. The idea was originally proposed by Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the patch was taken over by James Coleman, who wrote and rewrote most of the current code. There were many reviewers/contributors since 2013 - I've done my best to pick the most active ones, and listed them in this commit message. Author: James Coleman, Alexander Korotkov Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
* Add infrastructure to track WAL usage.Amit Kapila2020-04-04
| | | | | | | | | | | | | | | | | | | | | | This allows gathering the WAL generation statistics for each statement execution. The three statistics that we collect are the number of WAL records, the number of full page writes and the amount of WAL bytes generated. This helps the users who have write-intensive workload to see the impact of I/O due to WAL. This further enables us to see approximately what percentage of overall WAL is due to full page writes. In the future, we can extend this functionality to allow us to compute the the exact amount of WAL data due to full page writes. This patch in itself is just an infrastructure to compute WAL usage data. The upcoming patches will expose this data via explain, auto_explain, pg_stat_statements and verbose (auto)vacuum output. Author: Kirill Bychik, Julien Rouhaud Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Further adjust EXPLAIN's choices of table alias names.Tom Lane2019-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch causes EXPLAIN to always assign a separate table alias to the parent RTE of an append relation (inheritance set); before, such RTEs were ignored if not actually scanned by the plan. Since the child RTEs now always have that same alias to start with (cf. commit 55a1954da), the net effect is that the parent RTE usually gets the alias used or implied by the query text, and the children all get that alias with "_N" appended. (The exception to "usually" is if there are duplicate aliases in different subtrees of the original query; then some of those original RTEs will also have "_N" appended.) This results in more uniform output for partitioned-table plans than we had before: the partitioned table itself gets the original alias, and all child tables have aliases with "_N", rather than the previous behavior where one of the children would get an alias without "_N". The reason for giving the parent RTE an alias, even if it isn't scanned by the plan, is that we now use the parent's alias to qualify Vars that refer to an appendrel output column and appear above the Append or MergeAppend that computes the appendrel. But below the append, Vars refer to some one of the child relations, and are displayed that way. This seems clearer than the old behavior where a Var that could carry values from any child relation was displayed as if it referred to only one of them. While at it, change ruleutils.c so that the code paths used by EXPLAIN deal in Plan trees not PlanState trees. This effectively reverts a decision made in commit 1cc29fe7c, which seemed like a good idea at the time to make ruleutils.c consistent with explain.c. However, it's problematic because we'd really like to allow executor startup pruning to remove all the children of an append node when possible, leaving no child PlanState to resolve Vars against. (That's not done here, but will be in the next patch.) This requires different handling of subplans and initplans than before, but is otherwise a pretty straightforward change. Discussion: https://postgr.es/m/001001d4f44b$2a2cca50$7e865ef0$@lab.ntt.co.jp
* Make the order of the header file includes consistent in backend modules.Amit Kapila2019-11-12
| | | | | | | | | | | Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
* Fix inconsistency in comments atop ExecParallelEstimate.Amit Kapila2019-06-07
| | | | | | | | | | | | | When this code was initially introduced in commit d1b7c1ff, the structure used was SharedPlanStateInstrumentation, but later when it got changed to Instrumentation structure in commit b287df70, we forgot to update the comment. Reported-by: Wu Fei Author: Wu Fei Reviewed-by: Amit Kapila Backpatch-through: 9.6 Discussion: https://postgr.es/m/52E6E0843B9D774C8C73D6CF64402F0562215EB2@G08CNEXMBPEKD02.g08.fujitsu.local
* Phase 2 pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | | Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
* Initial pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
* Enable parallel query with SERIALIZABLE isolation.Thomas Munro2019-03-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the SERIALIZABLE isolation level prevented parallel query from being used. Allow the two features to be used together by sharing the leader's SERIALIZABLEXACT with parallel workers. An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to share, and new logic is introduced to coordinate the early release of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE optimization, as follows: The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by some other transaction) will 'partially release' the SERIALIZABLEXACT, meaning that the conflicts and locks it holds are released, but the SERIALIZABLEXACT itself will remain active because other backends might still have a pointer to it. Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears its own MySerializableXact variable and frees local resources so that it can skip SSI checks for the rest of the transaction. In the special case of the leader process, it transfers the SERIALIZABLEXACT to a new variable SavedSerializableXact, so that it can be completely released at the end of the transaction after all workers have exited. Remove the serializable_okay flag added to CreateParallelContext() by commit 9da0cc35, because it's now redundant. Author: Thomas Munro Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
* Refactor planner's header files.Tom Lane2019-01-29
| | | | | | | | | | | | | | | | | | | | | | | | Create a new header optimizer/optimizer.h, which exposes just the planner functions that can be used "at arm's length", without need to access Paths or the other planner-internal data structures defined in nodes/relation.h. This is intended to provide the whole planner API seen by most of the rest of the system; although FDWs still need to use additional stuff, and more thought is also needed about just what selfuncs.c should rely on. The main point of doing this now is to limit the amount of new #include baggage that will be needed by "planner support functions", which I expect to introduce later, and which will be in relevant datatype modules rather than anywhere near the planner. This commit just moves relevant declarations into optimizer.h from other header files (a couple of which go away because everything got moved), and adjusts #include lists to match. There's further cleanup that could be done if we want to decide that some stuff being exposed by optimizer.h doesn't belong in the planner at all, but I'll leave that for another day. Discussion: https://postgr.es/m/11460.1548706639@sss.pgh.pa.us
* Fix misc typos in comments.Heikki Linnakangas2019-01-23
| | | | | | Spotted mostly by Fabien Coelho. Discussion: https://www.postgresql.org/message-id/alpine.DEB.2.21.1901230947050.16643@lancre
* Update copyright for 2019Bruce Momjian2019-01-02
| | | | Backpatch-through: certain files through 9.4
* Remove some unnecessary fields from Plan trees.Tom Lane2018-10-07
| | | | | | | | | | | | | | | | | | | | In the wake of commit f2343653f, we no longer need some fields that were used before to control executor lock acquisitions: * PlannedStmt.nonleafResultRelations can go away entirely. * partitioned_rels can go away from Append, MergeAppend, and ModifyTable. However, ModifyTable still needs to know the RT index of the partition root table if any, which was formerly kept in the first entry of that list. Add a new field "rootRelation" to remember that. rootRelation is partly redundant with nominalRelation, in that if it's set it will have the same value as nominalRelation. However, the latter field has a different purpose so it seems best to keep them distinct. Amit Langote, reviewed by David Rowley and Jesper Pedersen, and whacked around a bit more by me Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
* Fix issues around EXPLAIN with JIT.Andres Freund2018-10-03
| | | | | | | | | | | | | | | | | I (Andres) was more than a bit hasty in committing 33001fd7a7072d48327 after last minute changes, leading to a number of problems (jit output was only shown for JIT in parallel workers, and just EXPLAIN without ANALYZE didn't work). Lukas luckily found these issues quickly. Instead of combining instrumentation in in standard_ExecutorEnd(), do so on demand in the new ExplainPrintJITSummary(). Also update a documentation example of the JIT output, changed in 52050ad8ebec8d831. Author: Lukas Fittl, with minor changes by me Discussion: https://postgr.es/m/CAP53PkxmgJht69pabxBXJBM+0oc6kf3KHMborLP7H2ouJ0CCtQ@mail.gmail.com Backpatch: 11, where JIT compilation was introduced
* Remove obsolete commentAlvaro Herrera2018-09-25
| | | | | The documented shortcoming was actually fixed in 4c728f3829 so the comment is not true anymore.
* Collect JIT instrumentation from workers.Andres Freund2018-09-25
| | | | | | | | | | | | | | | | Previously, when using parallel query, EXPLAIN (ANALYZE)'s JIT compilation timings did not include the overhead from doing so on the workers. Fix that. We do so by simply aggregating the cost of doing JIT compilation on workers and the leader together. Arguably that's not quite accurate, because the total time spend doing so is spent in parallel - but it's hard to do much better. For additional detail, when VERBOSE is specified, the stats for workers are displayed separately. Author: Amit Khandekar and Andres Freund Discussion: https://postgr.es/m/CAJ3gD9eLrz51RK_gTkod+71iDcjpB_N8eC6vU2AW-VicsAERpQ@mail.gmail.com Backpatch: 11-
* Fix failure with initplans used conditionally during EvalPlanQual rechecks.Tom Lane2018-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The EvalPlanQual machinery assumes that any initplans (that is, uncorrelated sub-selects) used during an EPQ recheck would have already been evaluated during the main query; this is implicit in the fact that execPlan pointers are not copied into the EPQ estate's es_param_exec_vals. But it's possible for that assumption to fail, if the initplan is only reached conditionally. For example, a sub-select inside a CASE expression could be reached during a recheck when it had not been previously, if the CASE test depends on a column that was just updated. This bug is old, appearing to date back to my rewrite of EvalPlanQual in commit 9f2ee8f28, but was not detected until Kyle Samson reported a case. To fix, force all not-yet-evaluated initplans used within the EPQ plan subtree to be evaluated at the start of the recheck, before entering the EPQ environment. This could be inefficient, if such an initplan is expensive and goes unused again during the recheck --- but that's piling one layer of improbability atop another. It doesn't seem worth adding more complexity to prevent that, at least not in the back branches. It was convenient to use the new-in-v11 ExecEvalParamExecParams function to implement this, but I didn't like either its name or the specifics of its API, so revise that. Back-patch all the way. Rather than rewrite the patch to avoid depending on bms_next_member() in the oldest branches, I chose to back-patch that function into 9.4 and 9.3. (This isn't the first time back-patches have needed that, and it exhausted my patience.) I also chose to back-patch some test cases added by commits 71404af2a and 342a1ffa2 into 9.4 and 9.3, so that the 9.x versions of eval-plan-qual.spec are all the same. Andrew Gierth diagnosed the problem and contributed the added test cases, though the actual code changes are by me. Discussion: https://postgr.es/m/A033A40A-B234-4324-BE37-272279F7B627@tripadvisor.com
* Match the buffer usage tracking for leader and worker backends.Amit Kapila2018-08-03
| | | | | | | | | | | | | In the leader backend, we don't track the buffer usage for ExecutorStart phase whereas in worker backend we track it for ExecutorStart phase as well. This leads to different value for buffer usage stats for the parallel and non-parallel query. Change the code so that worker backend also starts tracking buffer usage after ExecutorStart. Author: Amit Kapila and Robert Haas Reviewed-by: Robert Haas and Andres Freund Backpatch-through: 9.6 where this code was introduced Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
* Basic planner and executor integration for JIT.Andres Freund2018-03-22
| | | | | | | | | | | | | | | | | | | | | This adds simple cost based plan time decision about whether JIT should be performed. jit_above_cost, jit_optimize_above_cost are compared with the total cost of a plan, and if the cost is above them JIT is performed / optimization is performed respectively. For that PlannedStmt and EState have a jitFlags (es_jit_flags) field that stores information about what JIT operations should be performed. EState now also has a new es_jit field, which can store a JitContext. When there are no errors the context is released in standard_ExecutorEnd(). It is likely that the default values for jit_[optimize_]above_cost will need to be adapted further, but in my test these values seem to work reasonably. Author: Andres Freund, with feedback by Peter Eisentraut Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de
* Support parallel btree index builds.Robert Haas2018-02-02
| | | | | | | | | | | | | | | | | | | | | | To make this work, tuplesort.c and logtape.c must also support parallelism, so this patch adds that infrastructure and then applies it to the particular case of parallel btree index builds. Testing to date shows that this can often be 2-3x faster than a serial index build. The model for deciding how many workers to use is fairly primitive at present, but it's better than not having the feature. We can refine it as we get more experience. Peter Geoghegan with some help from Rushabh Lathia. While Heikki Linnakangas is not an author of this patch, he wrote other patches without which this feature would not have been possible, and therefore the release notes should possibly credit him as an author of this feature. Reviewed by Claudio Freire, Heikki Linnakangas, Thomas Munro, Tels, Amit Kapila, me. Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
* Update copyright for 2018Bruce Momjian2018-01-02
| | | | Backpatch-through: certain files through 9.3
* Add parallel-aware hash joins.Andres Freund2017-12-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel Hash Join with Parallel Hash. While hash joins could already appear in parallel queries, they were previously always parallel-oblivious and had a partial subplan only on the outer side, meaning that the work of the inner subplan was duplicated in every worker. After this commit, the planner will consider using a partial subplan on the inner side too, using the Parallel Hash node to divide the work over the available CPU cores and combine its results in shared memory. If the join needs to be split into multiple batches in order to respect work_mem, then workers process different batches as much as possible and then work together on the remaining batches. The advantages of a parallel-aware hash join over a parallel-oblivious hash join used in a parallel query are that it: * avoids wasting memory on duplicated hash tables * avoids wasting disk space on duplicated batch files * divides the work of building the hash table over the CPUs One disadvantage is that there is some communication between the participating CPUs which might outweigh the benefits of parallelism in the case of small hash tables. This is avoided by the planner's existing reluctance to supply partial plans for small scans, but it may be necessary to estimate synchronization costs in future if that situation changes. Another is that outer batch 0 must be written to disk if multiple batches are required. A potential future advantage of parallel-aware hash joins is that right and full outer joins could be supported, since there is a single set of matched bits for each hashtable, but that is not yet implemented. A new GUC enable_parallel_hash is defined to control the feature, defaulting to on. Author: Thomas Munro Reviewed-By: Andres Freund, Robert Haas Tested-By: Rafia Sabih, Prabhat Sahu Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
* When passing query strings to workers, pass the terminating \0.Robert Haas2017-12-20
| | | | | | | | | Otherwise, when the query string is read, we might trailing garbage beyond the end, unless there happens to be a \0 there by good luck. Report and patch by Thomas Munro. Reviewed by Rafia Sabih. Discussion: http://postgr.es/m/CAEepm=2SJs7X+_vx8QoDu8d1SMEOxtLhxxLNzZun_BvNkuNhrw@mail.gmail.com
* Try again to fix accumulation of parallel worker instrumentation.Robert Haas2017-12-19
| | | | | | | | | | | | | | | | | When a Gather or Gather Merge node is started and stopped multiple times, accumulate instrumentation data only once, at the end, instead of after each execution, to avoid recording inflated totals. Commit 778e78ae9fa51e58f41cbdc72b293291d02d8984, the previous attempt at a fix, instead reset the state after every execution, which worked for the general instrumentation data but had problems for the additional instrumentation specific to Sort and Hash nodes. Report by hubert depesz lubaczewski. Analysis and fix by Amit Kapila, following a design proposal from Thomas Munro, with a comment tweak by me. Discussion: http://postgr.es/m/20171127175631.GA405@depesz.com
* Fix crashes on plans with multiple Gather (Merge) nodes.Robert Haas2017-12-18
| | | | | | | | | | | | | | | | es_query_dsa turns out to be broken by design, because it supposes that there is only one DSA for the whole query, whereas there is actually one per Gather (Merge) node. For now, work around that problem by setting and clearing the pointer around the sections of code that might need it. It's probably a better idea to get rid of es_query_dsa altogether in favor of having each node keep track individually of which DSA is relevant, but that seems like more than we would want to back-patch. Thomas Munro, reviewed and tested by Andreas Seltenreich, Amit Kapila, and by me. Discussion: http://postgr.es/m/CAEepm=1U6as=brnVvMNixEV2tpi8NuyQoTmO8Qef0-VV+=7MDA@mail.gmail.com
* Revert "Fix accumulation of parallel worker instrumentation."Robert Haas2017-12-13
| | | | | | | This reverts commit 2c09a5c12a66087218c7f8cba269cd3de51b9b82. Per further discussion, that doesn't seem to be the best possible fix. Discussion: http://postgr.es/m/CAA4eK1LW2aFKzY3=vwvc=t-juzPPVWP2uT1bpx_MeyEqnM+p8g@mail.gmail.com
* Support Parallel Append plan nodes.Robert Haas2017-12-05
| | | | | | | | | | | | | | | | | | | | When we create an Append node, we can spread out the workers over the subplans instead of piling on to each subplan one at a time, which should typically be a bit more efficient, both because the startup cost of any plan executed entirely by one worker is paid only once and also because of reduced contention. We can also construct Append plans using a mix of partial and non-partial subplans, which may allow for parallelism in places that otherwise couldn't support it. Unfortunately, this patch doesn't handle the important case of parallelizing UNION ALL by running each branch in a separate worker; the executor infrastructure is added here, but more planner work is needed. Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and Rajkumar Raghuwanshi. Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
* Fix accumulation of parallel worker instrumentation.Robert Haas2017-12-05
| | | | | | | | | | | | | When a Gather or Gather Merge node is started and stopped multiple times, the old code wouldn't reset the shared state between executions, potentially resulting in dramatically inflated instrumentation data for nodes beneath it. (The per-worker instrumentation ended up OK, I think, but the overall totals were inflated.) Report by hubert depesz lubaczewski. Analysis and fix by Amit Kapila, reviewed and tweaked a bit by me. Discussion: http://postgr.es/m/20171127175631.GA405@depesz.com
* Fix EXPLAIN ANALYZE of hash join when the leader doesn't participate.Andres Freund2017-12-05
| | | | | | | | | | | | | | | | | | | | If a hash join appears in a parallel query, there may be no hash table available for explain.c to inspect even though a hash table may have been built in other processes. This could happen either because parallel_leader_participation was set to off or because the leader happened to hit the end of the outer relation immediately (even though the complete relation is not empty) and decided not to build the hash table. Commit bf11e7ee introduced a way for workers to exchange instrumentation via the DSM segment for Sort nodes even though they are not parallel-aware. This commit does the same for Hash nodes, so that explain.c has a way to find instrumentation data from an arbitrary participant that actually built the hash table. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm%3D3DUQC2-z252N55eOcZBer6DPdM%3DFzrxH9dZc5vYLsjaA%40mail.gmail.com
* Pass eflags down to parallel workers.Robert Haas2017-11-20
| | | | | | | | | | | | | | | | | | Currently, there are no known consequences of this oversight, so no back-patch. Several of the EXEC_FLAG_* constants aren't usable in parallel mode anyway, and potential problems related to the presence or absence of OIDs (see EXEC_FLAG_WITH_OIDS, EXEC_FLAG_WITHOUT_OIDS) seem at present to be masked by the unconditional projection step performed by Gather and Gather Merge. In general, however, it seems important that all participants agree on the values of these flags, which modify executor behavior globally, and a pending patch to skip projection in Gather (Merge) would be outright broken in certain cases without this fix. Patch by me, based on investigation of a test case provided by Amit Kapila. This patch was also reviewed by Amit Kapila. Discussion: http://postgr.es/m/CA+TgmoZ0ZL=cesZFq8c9NnfK6bqy-wwUd3_74iYGodYrSoQ7Fw@mail.gmail.com
* Provide DSM segment to ExecXXXInitializeWorker functions.Andres Freund2017-11-16
| | | | | | | | | | | | Previously, executor nodes running in parallel worker processes didn't have access to the dsm_segment object used for parallel execution. In order to support resource management based on DSM segment lifetime, they need that. So create a ParallelWorkerContext object to hold it and pass it to all InitializeWorker functions. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
* Pass InitPlan values to workers via Gather (Merge).Robert Haas2017-11-16
| | | | | | | | | | | | | | | | | | | | | | | | | | If a PARAM_EXEC parameter is used below a Gather (Merge) but the InitPlan that computes it is attached to or above the Gather (Merge), force the value to be computed before starting parallelism and pass it down to all workers. This allows us to use parallelism in cases where it previously would have had to be rejected as unsafe. We do - in this case - lose the optimization that the value is only computed if it's actually used. An alternative strategy would be to have the first worker that needs the value compute it, but one downside of that approach is that we'd then need to select a parallel-safe path to compute the parameter value; it couldn't for example contain a Gather (Merge) node. At some point in the future, we might want to consider both approaches. Independent of that consideration, there is a great deal more work that could be done to make more kinds of PARAM_EXEC parameters parallel-safe. This infrastructure could be used to allow a Gather (Merge) on the inner side of a nested loop (although that's not a very appealing plan) and cases where the InitPlan is attached below the Gather (Merge) could be addressed as well using various techniques. But this is a good start. Amit Kapila, reviewed and revised by me. Reviewing and testing from Kuntal Ghosh, Haribabu Kommi, and Tushar Ahuja. Discussion: http://postgr.es/m/CAA4eK1LV0Y1AUV4cUCdC+sYOx0Z0-8NAJ2Pd9=UKsbQ5Sr7+JQ@mail.gmail.com
* Track in the plan the types associated with PARAM_EXEC parameters.Robert Haas2017-11-13
| | | | | | | | | | | | Up until now, we only tracked the number of parameters, which was sufficient to allocate an array of Datums of the appropriate size, but not sufficient to, for example, know how to serialize a Datum stored in one of those slots. An upcoming patch wants to do that, so add this tracking to make it possible. Patch by me, reviewed by Tom Lane and Amit Kapila. Discussion: http://postgr.es/m/CA+TgmoYqpxDKn8koHdW8BEKk8FMUL0=e8m2Qe=M+r0UBjr3tuQ@mail.gmail.com
* pg_stat_statements: Widen query IDs from 32 bits to 64 bits.Robert Haas2017-10-11
| | | | | | | | | | | | | | | | | | This takes advantage of the infrastructure introduced by commit 81c5e46c490e2426db243eada186995da5bb0ba7 to greatly reduce the likelihood that two different queries will end up with the same query ID. It's still possible, of course, but whereas before it the chances of a collision reached 25% around 50,000 queries, it will now take more than 3 billion queries. Backward incompatibility: Because the type exposed at the SQL level is int8, users may now see negative query IDs in the pg_stat_statements view (and also, query IDs more than 4 billion, which was the old limit). Patch by me, reviewed by Michael Paquier and Peter Geoghegan. Discussion: http://postgr.es/m/CA+TgmobG_Kp4cBKFmsznUAaM1GWW6hhRNiZC0KjRMOOeYnz5Yw@mail.gmail.com
* Remove TupleDesc remapping logic from tqueue.c.Andres Freund2017-09-14
| | | | | | | | | | With the introduction of a shared memory record typmod registry, it is no longer necessary to remap record typmods when sending tuples between backends so most of tqueue.c can be removed. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
* Reduce excessive dereferencing of function pointersPeter Eisentraut2017-09-07
| | | | | | | | | | | | It is equivalent in ANSI C to write (*funcptr) () and funcptr(). These two styles have been applied inconsistently. After discussion, we'll use the more verbose style for plain function pointer variables, to make it clear that it's a variable, and the shorter style when the function pointer is in a struct (s.func() or s->func()), because then it's clear that it's not a plain function name, and otherwise the excessive punctuation makes some of those invocations hard to read. Discussion: https://www.postgresql.org/message-id/f52c16db-14ed-757d-4b48-7ef360b1631d@2ndquadrant.com
* Improve division of labor between execParallel.c and nodeGather[Merge].c.Tom Lane2017-09-01
| | | | | | | | | | | | | | | | Move the responsibility for creating/destroying TupleQueueReaders into execParallel.c, to avoid duplicative coding in nodeGather.c and nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do shm_mq_detach, do it in the caller (which is now only ExecParallelFinish). This means execParallel.c does both the attaching and detaching of the tuple-queue-reader shm_mqs, which seems less weird than the previous arrangement. These changes also eliminate a vestigial memory leak (of the pei->tqueue array). It's now demonstrable that rescans of Gather or GatherMerge don't leak memory. Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
* Separate reinitialization of shared parallel-scan state from ExecReScan.Tom Lane2017-08-30
| | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the parallel executor logic did reinitialization of shared state within the ExecReScan code for parallel-aware scan nodes. This is problematic, because it means that the ExecReScan call has to occur synchronously (ie, during the parent Gather node's ReScan call). That is swimming very much against the tide so far as the ExecReScan machinery is concerned; the fact that it works at all today depends on a lot of fragile assumptions, such as that no plan node between Gather and a parallel-aware scan node is parameterized. Another objection is that because ExecReScan might be called in workers as well as the leader, hacky extra tests are needed in some places to prevent unwanted shared-state resets. Hence, let's separate this code into two functions, a ReInitializeDSM call and the ReScan call proper. ReInitializeDSM is called only in the leader and is guaranteed to run before we start new workers. ReScan is returned to its traditional function of resetting only local state, which means that ExecReScan's usual habits of delaying or eliminating child rescan calls are safe again. As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary to make these changes in 9.6, which is a good thing because the FDW and CustomScan APIs are impacted. Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com