aboutsummaryrefslogtreecommitdiff
path: root/src/backend/executor/nodeGather.c
Commit message (Collapse)AuthorAge
* Fix mistaken failure to allow parallelism in corner case.Robert Haas2017-10-27
| | | | | | | | | | | | | | | If we try to run a parallel plan in serial mode because, for example, it's going to be scanned via a cursor, but for some reason we're already in parallel mode (for example because an outer query is running in parallel), we'd incorrectly try to launch workers. Fix by adding a flag to the EState, so that we can be certain that ExecutePlan() and ExecGather()/ExecGatherMerge() will have the same idea about whether we are executing serially or in parallel. Report and fix by Amit Kapila with help from Kuntal Ghosh. A few tweaks by me. Discussion: http://postgr.es/m/CAA4eK1+_BuZrmVCeua5Eqnm4Co9DAXdM5HPAOE2J19ePbR912Q@mail.gmail.com
* Improve division of labor between execParallel.c and nodeGather[Merge].c.Tom Lane2017-09-01
| | | | | | | | | | | | | | | | Move the responsibility for creating/destroying TupleQueueReaders into execParallel.c, to avoid duplicative coding in nodeGather.c and nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do shm_mq_detach, do it in the caller (which is now only ExecParallelFinish). This means execParallel.c does both the attaching and detaching of the tuple-queue-reader shm_mqs, which seems less weird than the previous arrangement. These changes also eliminate a vestigial memory leak (of the pei->tqueue array). It's now demonstrable that rescans of Gather or GatherMerge don't leak memory. Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
* Code review for nodeGatherMerge.c.Tom Lane2017-08-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Comment the fields of GatherMergeState, and organize them a bit more sensibly. Comment GMReaderTupleBuffer more usefully too. Improve assorted other comments that were obsolete or just not very good English. Get rid of the use of a GMReaderTupleBuffer for the leader process; that was confusing, since only the "done" field was used, and that in a way redundant with need_to_scan_locally. In gather_merge_init, avoid calling load_tuple_array for already-known-exhausted workers. I'm not sure if there's a live bug there, but the case is unlikely to be well tested due to timing considerations. Remove some useless code, such as duplicating the tts_isempty test done by TupIsNull. Remove useless initialization of ps.qual, replacing that with an assertion that we have no qual to check. (If we did, the code would fail to check it.) Avoid applying heap_copytuple to a null tuple. While that fails to crash, it's confusing and it makes the code less legible not more so IMO. Propagate a couple of these changes into nodeGather.c, as well. Back-patch to v10, partly because of the possibility that the gather_merge_init change is fixing a live bug, but mostly to keep the branches in sync to ease future bug fixes.
* Separate reinitialization of shared parallel-scan state from ExecReScan.Tom Lane2017-08-30
| | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the parallel executor logic did reinitialization of shared state within the ExecReScan code for parallel-aware scan nodes. This is problematic, because it means that the ExecReScan call has to occur synchronously (ie, during the parent Gather node's ReScan call). That is swimming very much against the tide so far as the ExecReScan machinery is concerned; the fact that it works at all today depends on a lot of fragile assumptions, such as that no plan node between Gather and a parallel-aware scan node is parameterized. Another objection is that because ExecReScan might be called in workers as well as the leader, hacky extra tests are needed in some places to prevent unwanted shared-state resets. Hence, let's separate this code into two functions, a ReInitializeDSM call and the ReScan call proper. ReInitializeDSM is called only in the leader and is guaranteed to run before we start new workers. ReScan is returned to its traditional function of resetting only local state, which means that ExecReScan's usual habits of delaying or eliminating child rescan calls are safe again. As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary to make these changes in 9.6, which is a good thing because the FDW and CustomScan APIs are impacted. Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
* Force rescanning of parallel-aware scan nodes below a Gather[Merge].Tom Lane2017-08-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ExecReScan machinery contains various optimizations for postponing or skipping rescans of plan subtrees; for example a HashAgg node may conclude that it can re-use the table it built before, instead of re-reading its input subtree. But that is wrong if the input contains a parallel-aware table scan node, since the portion of the table scanned by the leader process is likely to vary from one rescan to the next. This explains the timing-dependent buildfarm failures we saw after commit a2b70c89c. The established mechanism for showing that a plan node's output is potentially variable is to mark it as depending on some runtime Param. Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC parameter number, but carries no actual value) associated with each Gather or GatherMerge node, mark parallel-aware nodes below that node as dependent on that Param, and arrange for ExecReScanGather[Merge] to flag that Param as changed whenever the Gather[Merge] node is rescanned. This solution breaks an undocumented assumption made by the parallel executor logic, namely that all rescans of nodes below a Gather[Merge] will happen synchronously during the ReScan of the top node itself. But that's fundamentally contrary to the design of the ExecReScan code, and so was doomed to fail someday anyway (even if you want to argue that the bug being fixed here wasn't a failure of that assumption). A follow-on patch will address that issue. In the meantime, the worst that's expected to happen is that given very bad timing luck, the leader might have to do all the work during a rescan, because workers think they have nothing to do, if they are able to start up before the eventual ReScan of the leader's parallel-aware table scan node has reset the shared scan state. Although this problem exists in 9.6, there does not seem to be any way for it to manifest there. Without GatherMerge, it seems that a plan tree that has a rescan-short-circuiting node below Gather will always also have one above it that will short-circuit in the same cases, preventing the Gather from being rescanned. Hence we won't take the risk of back-patching this change into 9.6. But v10 needs it. Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
* Move ExecProcNode from dispatch to function pointer based model.Andres Freund2017-07-30
| | | | | | | | | | | | | | | | | | | | | | This allows us to add stack-depth checks the first time an executor node is called, and skip that overhead on following calls. Additionally it yields a nice speedup. While it'd probably have been a good idea to have that check all along, it has become more important after the new expression evaluation framework in b8d7f053c5c2bf2a7e - there's no stack depth check in common paths anymore now. We previously relied on ExecEvalExpr() being executed somewhere. We should move towards that model for further routines, but as this is required for v10, it seems better to only do the necessary (which already is quite large). Author: Andres Freund, Tom Lane Reported-By: Julien Rouhaud Discussion: https://postgr.es/m/22833.1490390175@sss.pgh.pa.us https://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0c76@dalibo.com
* Move interrupt checking from ExecProcNode() to executor nodes.Andres Freund2017-07-30
| | | | | | | | | | | | | | | | | In a followup commit ExecProcNode(), and especially the large switch it contains, will largely be replaced by a function pointer directly to the correct node. The node functions will then get invoked by a thin inline function wrapper. To avoid having to include miscadmin.h in headers - CHECK_FOR_INTERRUPTS() - move the interrupt checks into the individual executor routines. While looking through all executor nodes, I noticed a number of arguably missing interrupt checks, add these too. Author: Andres Freund, Tom Lane Reviewed-By: Tom Lane Discussion: https://postgr.es/m/22833.1490390175@sss.pgh.pa.us
* Phase 2 of pgindent updates.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Post-PG 10 beta1 pgindent runBruce Momjian2017-05-17
| | | | perltidy run not included.
* Faster expression evaluation and targetlist projection.Andres Freund2017-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the old, recursive tree-walk based evaluation, with non-recursive, opcode dispatch based, expression evaluation. Projection is now implemented as part of expression evaluation. This both leads to significant performance improvements, and makes future just-in-time compilation of expressions easier. The speed gains primarily come from: - non-recursive implementation reduces stack usage / overhead - simple sub-expressions are implemented with a single jump, without function calls - sharing some state between different sub-expressions - reduced amount of indirect/hard to predict memory accesses by laying out operation metadata sequentially; including the avoidance of nearly all of the previously used linked lists - more code has been moved to expression initialization, avoiding constant re-checks at evaluation time Future just-in-time compilation (JIT) has become easier, as demonstrated by released patches intended to be merged in a later release, for primarily two reasons: Firstly, due to a stricter split between expression initialization and evaluation, less code has to be handled by the JIT. Secondly, due to the non-recursive nature of the generated "instructions", less performance-critical code-paths can easily be shared between interpreted and compiled evaluation. The new framework allows for significant future optimizations. E.g.: - basic infrastructure for to later reduce the per executor-startup overhead of expression evaluation, by caching state in prepared statements. That'd be helpful in OLTPish scenarios where initialization overhead is measurable. - optimizing the generated "code". A number of proposals for potential work has already been made. - optimizing the interpreter. Similarly a number of proposals have been made here too. The move of logic into the expression initialization step leads to some backward-incompatible changes: - Function permission checks are now done during expression initialization, whereas previously they were done during execution. In edge cases this can lead to errors being raised that previously wouldn't have been, e.g. a NULL array being coerced to a different array type previously didn't perform checks. - The set of domain constraints to be checked, is now evaluated once during expression initialization, previously it was re-built every time a domain check was evaluated. For normal queries this doesn't change much, but e.g. for plpgsql functions, which caches ExprStates, the old set could stick around longer. The behavior around might still change. Author: Andres Freund, with significant changes by Tom Lane, changes by Heikki Linnakangas Reviewed-By: Tom Lane, Heikki Linnakangas Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
* Shut down Gather's children before shutting down Gather itself.Robert Haas2017-02-22
| | | | | | | | | | | | | It turns out that the original shutdown order here does not work well. Multiple people attempting to develop further parallel query patches have discovered that they need to do cleanup before the DSM goes away, and you can't do that if the parent node gets cleaned up first. Patch by me, reviewed by KaiGai Kohei and Dilip Kumar. Discussion: http://postgr.es/m/CA+TgmoY6bOc1YnhcAQnMfCBDbsJzROQ3sYxSAL-SYB5tMJcTKg@mail.gmail.com Discussion: http://postgr.es/m/9A28C8860F777E439AA12E8AEA7694F8012AEB82@BPXM15GP.gisp.nec.co.jp Discussion: http://postgr.es/m/CA+TgmoYuPOc=+xrG1v0fCsoLbKAab9F1ddOeaaiLMzKOiBar1Q@mail.gmail.com
* Remove no-longer-needed loop in ExecGather().Tom Lane2017-01-22
| | | | | | | | | Coverity complained quite properly that commit ea15e1867 had introduced unreachable code into ExecGather(); to wit, it was no longer possible to iterate the final for-loop more or less than once. So remove the for(). In passing, clean up a couple of comments, and make better use of a local variable.
* Remove obsoleted code relating to targetlist SRF evaluation.Andres Freund2017-01-19
| | | | | | | | | | | | | Since 69f4b9c plain expression evaluation (and thus normal projection) can't return sets of tuples anymore. Thus remove code dealing with that possibility. This will require adjustments in external code using ExecEvalExpr()/ExecProject() - that should neither be hard nor very common. Author: Andres Freund and Tom Lane Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Ensure gatherstate->nextreader is properly initialized.Robert Haas2016-12-05
| | | | | | | | | The previously code worked OK as long as a Gather node was never rescanned, or if it was rescanned, as long as it got at least as many workers on rescan as it had originally. But if the number of workers ever decreased on a rescan, then it could crash. Andreas Seltenreich
* Extend framework from commit 53be0b1ad to report latch waits.Robert Haas2016-10-04
| | | | | | | | | | | | | | | | | | | | | | WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an additional wait_event_info parameter; legal values are defined in pgstat.h. This makes it possible to uniquely identify every point in the core code where we are waiting for a latch; extensions can pass WAIT_EXTENSION. Because latches were the major wait primitive not previously covered by this patch, it is now possible to see information in pg_stat_activity on a large number of important wait events not previously addressed, such as ClientRead, ClientWrite, and SyncRep. Unfortunately, many of the wait events added by this patch will fail to appear in pg_stat_activity because they're only used in background processes which don't currently appear in pg_stat_activity. We should fix this either by creating a separate view for such information, or else by deciding to include them in pg_stat_activity after all. Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and Thomas Munro.
* Don't CHECK_FOR_INTERRUPTS between WaitLatch and ResetLatch.Tom Lane2016-08-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | This coding pattern creates a race condition, because if an interesting interrupt happens after we've checked InterruptPending but before we reset our latch, the latch-setting done by the signal handler would get lost, and then we might block at WaitLatch in the next iteration without ever noticing the interrupt condition. You can put the CHECK_FOR_INTERRUPTS before WaitLatch or after ResetLatch, but not between them. Aside from fixing the bugs, add some explanatory comments to latch.h to perhaps forestall the next person from making the same mistake. In HEAD, also replace gather_readnext's direct call of HandleParallelMessages with CHECK_FOR_INTERRUPTS. It does not seem clean or useful for this one caller to bypass ProcessInterrupts and go straight to HandleParallelMessages; not least because that fails to consider the InterruptPending flag, resulting in useless work both here (if InterruptPending isn't set) and in the next CHECK_FOR_INTERRUPTS call (if it is). This thinko seems to have been introduced in the initial coding of storage/ipc/shm_mq.c (commit ec9037df2), and then blindly copied into all the subsequent parallel-query support logic. Back-patch relevant hunks to 9.4 to extirpate the error everywhere. Discussion: <1661.1469996911@sss.pgh.pa.us>
* Fix worst memory leaks in tqueue.c.Tom Lane2016-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | TupleQueueReaderNext() leaks like a sieve if it has to do any tuple disassembly/reconstruction. While we could try to clean up its allocations piecemeal, it seems like a better idea just to insist that it should be run in a short-lived memory context, so that any transient space goes away automatically. I chose to have nodeGather.c switch into its existing per-tuple context before the call, rather than inventing a separate context inside tqueue.c. This is sufficient to stop all leakage in the simple case I exhibited earlier today (see link below), but it does not deal with leaks induced in more complex cases by tqueue.c's insistence on using TopMemoryContext for data that it's not actually trying hard to keep track of. That issue is intertwined with another major source of inefficiency, namely failure to cache lookup results across calls, so it seems best to deal with it separately. In passing, improve some comments, and modify gather_readnext's method for deciding when it's visited all the readers so that it's more obviously correct. (I'm not actually convinced that the previous code *is* correct in the case of a reader deletion; it certainly seems fragile.) Discussion: <32763.1469821037@sss.pgh.pa.us>
* pgindent run for 9.6Robert Haas2016-06-09
|
* Tweak EXPLAIN for parallel query to show workers launched.Robert Haas2016-04-15
| | | | | | | | | The previous display was sort of confusing, because it didn't distinguish between the number of workers that we planned to launch and the number that actually got launched. This has already confused several people, so display both numbers and label them clearly. Julien Rouhaud, reviewed by me.
* Minor optimizations based on ParallelContext having nworkers_launched.Robert Haas2016-03-04
| | | | | | | | | | | Originally, we didn't have nworkers_launched, so code that used parallel contexts had to be preprared for the possibility that not all of the workers requested actually got launched. But now we can count on knowing the number of workers that were successfully launched, which can shave off a few cycles and simplify some code slightly. Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter Geoghegan.
* Fix spelling mistakes.Robert Haas2016-01-14
| | | | Same patch submitted independently by David Rowley and Peter Geoghegan.
* Update copyright for 2016Bruce Momjian2016-01-02
| | | | Backpatch certain files through 9.1
* Read from the same worker repeatedly until it returns no tuple.Robert Haas2015-12-23
| | | | | | | | | | | | | The original coding read tuples from workers in round-robin fashion, but performance testing shows that it works much better to read enough to empty one queue before moving on to the next. I believe the reason for this is that, with the old approach, we could easily wake up a worker repeatedly to write only one new tuple into the shm_mq each time. With this approach, by the time the process gets scheduled, it has a decent chance of being able to fill the entire buffer in one go. Patch by me. Dilip Kumar helped with performance testing.
* Fix obsolete comment.Robert Haas2015-11-30
| | | | | | It's amazing how fast things become obsolete these days. Amit Langote
* Avoid server crash when worker registration fails at execution time.Robert Haas2015-11-20
| | | | | | | | | | The previous coding attempts to destroy the DSM in this case, but child nodes might have stored data there and still be holding onto pointers in this case. So don't do that. Also, free the reader array instead of leaking it. Extracted from two different patch versions both by Amit Kapila.
* Avoid aggregating worker instrumentation multiple times.Robert Haas2015-11-18
| | | | Amit Kapila, per design ideas from me.
* Add missing "static" qualifier.Tom Lane2015-11-10
| | | | Per buildfarm member pademelon.
* Fix rebasing mistake in nodeGather.cRobert Haas2015-11-09
| | | | | | | | | The patches committed as 6e71dd7ce9766582da453f493bc371d64977282f and 3a1f8611f2582df0a16bcd35caed2e1526387643 were developed in parallel but dependent on each other in a way that I failed to notice. This patch to fix the problem was prepared by Amit Kapila.
* Modify tqueue infrastructure to support transient record types.Robert Haas2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | Commit 4a4e6893aa080b9094dadbe0e65f8a75fee41ac6, which introduced this mechanism, failed to account for the fact that the RECORD pseudo-type uses transient typmods that are only meaningful within a single backend. Transferring such tuples without modification between two cooperating backends does not work. This commit installs a system for passing the tuple descriptors over the same shm_mq being used to send the tuples themselves. The two sides might not assign the same transient typmod to any given tuple descriptor, so we must also substitute the appropriate receiver-side typmod for the one used by the sender. That adds some CPU overhead, but still seems better than being unable to pass records between cooperating parallel processes. Along the way, move the logic for handling multiple tuple queues from tqueue.c to nodeGather.c; tqueue.c now provides a TupleQueueReader, which reads from a single queue, rather than a TupleQueueFunnel, which potentially reads from multiple queues. This change was suggested previously as a way to make sure that nodeGather.c rather than tqueue.c had policy control over the order in which to read from queues, but it wasn't clear to me until now how good an idea it was. typmod mapping needs to be performed separately for each queue, and it is much simpler if the tqueue.c code handles that and leaves multiplexing multiple queues to higher layers of the stack.
* Update parallel executor support to reuse the same DSM.Robert Haas2015-10-30
| | | | | | | | | | | | | | | | Commit b0b0d84b3d663a148022e900ebfc164284a95f55 purported to make it possible to relaunch workers using the same parallel context, but it had an unpleasant race condition: we might reinitialize after the workers have sent their last control message but before they have dettached the DSM, leaving to crashes. Repair by introducing a new ParallelContext operation, ReinitializeParallelDSM. Adjust execParallel.c to use this new support, so that we can rescan a Gather node by relaunching workers but without needing to recreate the DSM. Amit Kapila, with some adjustments by me. Extracted from latest parallel sequential scan patch.
* Make Gather node projection-capable.Robert Haas2015-10-28
| | | | | | | | | The original Gather code failed to mark a Gather node as not able to do projection, but it couldn't, even though it did call initialize its projection info via ExecAssignProjectionInfo. There doesn't seem to be any good reason for this node not to have projection capability, so clean things up so that it does. Without this, plans using Gather nodes might need to carry extra Result nodes to do projection.
* Fix typos in comments.Robert Haas2015-10-22
| | | | CharSyam
* Add header comments to execParallel.c and nodeGather.c.Robert Haas2015-10-22
| | | | | Patch by me, per a note from Simon Riggs. Reviewed by Amit Kapila and Amit Langote.
* Rewrite interaction of parallel mode with parallel executor support.Robert Haas2015-10-16
| | | | | | | | | | | | | | In the previous coding, before returning from ExecutorRun, we'd shut down all parallel workers. This was dead wrong if ExecutorRun was called with a non-zero tuple count; it had the effect of truncating the query output. To fix, give ExecutePlan control over whether to enter parallel mode, and have it refuse to do so if the tuple count is non-zero. Rewrite the Gather logic so that it can cope with being called outside parallel mode. Commit 7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b is largely to blame for this problem, though this patch modifies some subsequently-committed code which relied on the guarantees it purported to make.
* Add missing "static" specifier.Tom Lane2015-10-03
| | | | Per buildfarm (pademelon, at least, doesn't like this).
* Add a Gather executor node.Robert Haas2015-09-30
A Gather executor node runs any number of copies of a plan in an equal number of workers and merges all of the results into a single tuple stream. It can also run the plan itself, if the workers are unavailable or haven't started up yet. It is intended to work with the Partial Seq Scan node which will be added in future commits. It could also be used to implement parallel query of a different sort by itself, without help from Partial Seq Scan, if the single_copy mode is used. In that mode, a worker executes the plan, and the parallel leader does not, merely collecting the worker's results. So, a Gather node could be inserted into a plan to split the execution of that plan across two processes. Nested Gather nodes aren't currently supported, but we might want to add support for that in the future. There's nothing in the planner to actually generate Gather nodes yet, so it's not quite time to break out the champagne. But we're getting close. Amit Kapila. Some designs suggestions were provided by me, and I also reviewed the patch. Single-copy mode, documentation, and other minor changes also by me.