postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
...
*	Add sortsupport for gist_btree opclasses, for faster index builds.	Heikki Linnakangas	2021-04-07
\| \| \| \| \| \| \| \| \|	Commit 16fa9b2b30 introduced a faster way to build GiST indexes, by sorting all the data. This commit adds the sortsupport functions needed to make use of that feature for btree_gist. Author: Andrey Borodin Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru
*	Fix use of cursor sensitivity terminology	Peter Eisentraut	2021-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Documentation and comments in code and tests have been using the terms sensitive/insensitive cursor incorrectly relative to the SQL standard. (Cursor sensitivity is only relevant for changes made in the same transaction as the cursor, not for concurrent changes in other sessions.) Moreover, some of the behavior of PostgreSQL is incorrect according to the SQL standard, confusing the issue further. (WHERE CURRENT OF changes are not visible in insensitive cursors, but they should be.) This change corrects the terminology and removes the claim that sensitive cursors are supported. It also adds a test case that checks the insensitive behavior in a "correct" way, using a change command not using WHERE CURRENT OF. Finally, it adds the ASENSITIVE cursor option to select the default asensitive behavior, per SQL standard. There are no changes to cursor behavior in this patch. Discussion: https://www.postgresql.org/message-id/flat/96ee8b30-9889-9e1b-b053-90e10c050e85%40enterprisedb.com
*	Message improvement	Peter Eisentraut	2021-04-07
\| \| \| \| \|	The previous wording contained a superfluous comma. Adjust phrasing for grammatical correctness and clarity.
*	Remove redundant memset(0) calls for page init of some index AMs	Michael Paquier	2021-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the contents of a page, and this routine fills entirely a page with zeros for a size of BLCKSZ, including the special space. Those index AMs have been using an extra memset() call to fill with zeros the special page space, or even the whole page, which is not necessary as PageInit() already does this work, so let's remove them. GiST was not doing this extra call, but has commented out a system call that did so since 6236991. While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care of that. This makes the whole page initialization logic more consistent across all index AMs. Author: Bharath Rupireddy Reviewed-by: Vignesh C, Mahendra Singh Thalor Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com
*	Add some information about authenticated identity via log_connections	Michael Paquier	2021-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The "authenticated identity" is the string used by an authentication method to identify a particular user. In many common cases, this is the same as the PostgreSQL username, but for some third-party authentication methods, the identifier in use may be shortened or otherwise translated (e.g. through pg_ident user mappings) before the server stores it. To help administrators see who has actually interacted with the system, this commit adds the capability to store the original identity when authentication succeeds within the backend's Port, and generates a log entry when log_connections is enabled. The log entries generated look something like this (where a local user named "foouser" is connecting to the database as the database user called "admin"): LOG: connection received: host=[local] LOG: connection authenticated: identity="foouser" method=peer (/data/pg_hba.conf:88) LOG: connection authorized: user=admin database=postgres application_name=psql Port->authn_id is set according to the authentication method: bsd: the PostgreSQL username (aka the local username) cert: the client's Subject DN gss: the user principal ident: the remote username ldap: the final bind DN pam: the PostgreSQL username (aka PAM username) password (and all pw-challenge methods): the PostgreSQL username peer: the peer's pw_name radius: the PostgreSQL username (aka the RADIUS username) sspi: either the down-level (SAM-compatible) logon name, if compat_realm=1, or the User Principal Name if compat_realm=0 The trust auth method does not set an authenticated identity. Neither does clientcert=verify-full. Port->authn_id could be used for other purposes, like a superuser-only extra column in pg_stat_activity, but this is left as future work. PostgresNode::connect_{ok,fails}() have been modified to let tests check the backend log files for required or prohibited patterns, using the new log_like and log_unlike parameters. This uses a method based on a truncation of the existing server log file, like issues_sql_like(). Tests are added to the ldap, kerberos, authentication and SSL test suites. Author: Jacob Champion Reviewed-by: Stephen Frost, Magnus Hagander, Tom Lane, Michael Paquier Discussion: https://postgr.es/m/c55788dd1773c521c862e8e0dddb367df51222be.camel@vmware.com
*	Postpone some more stuff out of ExecInitModifyTable.	Tom Lane	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Delay creation of the projections for INSERT and UPDATE tuples until they're needed. This saves a pretty fair amount of work when only some of the partitions are actually touched. The logic associated with identifying junk columns in UPDATE/DELETE is moved to another loop, allowing removal of one loop over the target relations; but it didn't actually change at all. Extracted from a larger patch, which seemed to me to be too messy to push in one commit. Amit Langote, reviewed at different times by Heikki Linnakangas and myself Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
*	Postpone some stuff out of ExecInitModifyTable.	Tom Lane	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Arrange to do some things on-demand, rather than immediately during executor startup, because there's a fair chance of never having to do them at all: * Don't open result relations' indexes until needed. * Don't initialize partition tuple routing, nor the child-to-root tuple conversion map, until needed. This wins in UPDATEs on partitioned tables when only some of the partitions will actually receive updates; with larger partition counts the savings is quite noticeable. Also, we can remove some sketchy heuristics in ExecInitModifyTable about whether to set up tuple routing. Also, remove execPartition.c's private hash table tracking which partitions were already opened by the ModifyTable node. Instead use the hash added to ModifyTable itself by commit 86dc90056. To allow lazy computation of the conversion maps, we now set ri_RootResultRelInfo in all child ResultRelInfos. We formerly set it only in some, not terribly well-defined, cases. This has user-visible side effects in that now more error messages refer to the root relation instead of some partition (and provide error data in the root's column order, too). It looks to me like this is a strict improvement in consistency, so I don't have a problem with the output changes visible in this commit. Extracted from a larger patch, which seemed to me to be too messy to push in one commit. Amit Langote, reviewed at different times by Heikki Linnakangas and myself Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
*	Increment xactCompletionCount during subtransaction abort.	Andres Freund	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Snapshot caching, introduced in 623a9ba79b, did not increment xactCompletionCount during subtransaction abort. That could lead to an older snapshot being reused. That is, at least as far as I can see, not a correctness issue (for MVCC snapshots there's no difference between "in progress" and "aborted"). The only difference between the old and new snapshots would be a newer ->xmax. While HeapTupleSatisfiesMVCC makes the same visibility determination, reusing the old snapshot leads HeapTupleSatisfiesMVCC to not set HEAP_XMIN_INVALID. Which subsequently causes the kill_prior_tuple optimization to not kick in (via HeapTupleIsSurelyDead() returning false). The performance effects of doing the same index-lookups over and over again is how the issue was discovered... Fix the issue by incrementing xactCompletionCount in XidCacheRemoveRunningXids. It already acquires ProcArrayLock exclusively, making that an easy proposition. Add a test to ensure that kill_prior_tuple prevents index growth when it involves aborted subtransaction of the current transaction. Author: Andres Freund Discussion: https://postgr.es/m/20210406043521.lopeo7bbigad3n6t@alap3.anarazel.de Discussion: https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de
*	Remove tupgone special case from vacuumlazy.c.	Peter Geoghegan	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Retry the call to heap_prune_page() in rare cases where there is disagreement between the heap_prune_page() call and the call to HeapTupleSatisfiesVacuum() that immediately follows. Disagreement is possible when a concurrently-aborted transaction makes a tuple DEAD during the tiny window between each step. This was the only case where a tuple considered DEAD by VACUUM still had storage following pruning. VACUUM's definition of dead tuples is now uniformly simple and unambiguous: dead tuples from each page are always LP_DEAD line pointers that were encountered just after we performed pruning (and just before we considered freezing remaining items with tuple storage). Eliminating the tupgone=true special case enables INDEX_CLEANUP=off style skipping of index vacuuming that takes place based on flexible, dynamic criteria. The INDEX_CLEANUP=off case had to know about skipping indexes up-front before now, due to a subtle interaction with the special case (see commit dd695979) -- this was a special case unto itself. Now there are no special cases. And so now it won't matter when or how we decide to skip index vacuuming: it won't affect how pruning behaves, and it won't be affected by any of the implementation details of pruning or freezing. Also remove XLOG_HEAP2_CLEANUP_INFO records. These are no longer necessary because we now rely entirely on heap pruning taking care of recovery conflicts. There is no longer any need to generate recovery conflicts for DEAD tuples that pruning just missed. This also means that heap vacuuming now uses exactly the same strategy for recovery conflicts as index vacuuming always has: REDO routines never need to process a latestRemovedXid from the WAL record, since earlier REDO of the WAL record from pruning is sufficient in all cases. The generic XLOG_HEAP2_CLEAN record type is now split into two new record types to reflect this new division (these are called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). Also stop acquiring a super-exclusive lock for heap pages when they're vacuumed during VACUUM's second heap pass. A regular exclusive lock is enough. This is correct because heap page vacuuming is now strictly a matter of setting the LP_DEAD line pointers to LP_UNUSED. No other backend can have a pointer to a tuple located in a pinned buffer that can be invalidated by a concurrent heap page vacuum operation. Heap vacuuming can now be thought of as conceptually similar to index vacuuming and conceptually dissimilar to heap pruning. Heap pruning now has sole responsibility for anything involving the logical contents of the database (e.g., managing transaction status information, recovery conflicts, considering what to do with HOT chains). Index vacuuming and heap vacuuming are now only concerned with recycling garbage items from physical data structures that back the logical database. Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record changes. Credit for the idea of retrying pruning a page to avoid the tupgone case goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
*	Fix missing #include in nodeResultCache.h.	Tom Lane	2021-04-06
\| \| \| \|	Per cpluspluscheck.
*	Fix handling of clauses incompatible with extended statistics	Tomas Vondra	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of incompatible clauses while applying extended statistics was a bit confused - while handling a mix of compatible and incompatible clauses it sometimes incorrectly treated the incompatible clauses as compatible, resulting in a crash. Fixed by reworking the code applying the selected statistics object to make it easier to understand, and adding a proper compatibility check. Reported-by: David Rowley Discussion: https://postgr.es/m/CAApHDvpYT10-nkSp8xXe-nbO3jmoaRyRFHbzh-RWMfAJynqgpQ%40mail.gmail.com
*	Refactor lazy_scan_heap() loop.	Peter Geoghegan	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a lazy_scan_heap() subsidiary function that handles heap pruning and tuple freezing: lazy_scan_prune(). This is a great deal cleaner. The code that remains in lazy_scan_heap()'s per-block loop can now be thought of as code that either comes before or after the call to lazy_scan_prune(), which is now the clear focal point. This division is enforced by the way in which we now manage state. lazy_scan_prune() outputs state (using its own struct) that describes what to do with the page following pruning and freezing (e.g., visibility map maintenance, recording free space in the FSM). It doesn't get passed any special instructional state from the preamble code, though. Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off from the logic used by single-heap-pass VACUUMs. The former case is now structured as the omission of index and heap vacuuming by a two pass VACUUM. The latter case goes back to being used only when the table happens to have no indexes (just as it was before commit a96c41fe). This structure is much more natural, since the whole point of INDEX_CLEANUP=off is to skip the index and heap vacuuming that would otherwise take place. The single-heap-pass case doesn't skip any useful work, though -- it just does heap pruning and heap vacuuming together when the table happens to have no indexes. Both of these changes are preparation for an upcoming patch that generalizes the mechanism used by INDEX_CLEANUP=off. The later patch will allow VACUUM to give up on index and heap vacuuming dynamically, as problems emerge (e.g., with wraparound), so that an affected VACUUM operation can finish up as soon as possible. Also fix a very old bug in single-pass VACUUM VERBOSE output. We were reporting the number of tuples deleted via pruning as a direct substitute for reporting the number of LP_DEAD items removed in a function that deals with the second pass over the heap. But that doesn't work at all -- they're two different things. To fix, start tracking the total number of LP_DEAD items encountered during pruning, and use that in the report instead. A single pass VACUUM will always vacuum away whatever LP_DEAD items a heap page has immediately after it is pruned, so the total number of LP_DEAD items encountered during pruning equals the total number vacuumed-away. (They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay because skipping index vacuuming is now a totally orthogonal concept to one-pass VACUUM.) Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE output. This makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (because it never showed information about LP_UNUSED items). VACUUM VERBOSE reported LP_UNUSED items left behind by the last VACUUM, and LP_UNUSED items created via pruning HOT chains during the current VACUUM (it never included LP_UNUSED items left behind by the current VACUUM's second pass over the heap). This makes it useless as an indicator of line pointer bloat, which must have been the original intention. (Like the first VACUUM VERBOSE issue, this issue was arguably an oversight in commit 282d2a03, which added the heap-only tuple optimization.) Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start reporting pages_removed instead. This also makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (which does not show empty_pages, but does show pages_removed). An empty page isn't meaningfully different to a page that is almost empty, or a page that is empty but for only a small number of remaining LP_UNUSED items. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
*	Clean up treatment of missing default and CHECK-constraint records.	Tom Lane	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Andrew Gierth reported that it's possible to crash the backend if no pg_attrdef record is found to match an attribute that has atthasdef set. AttrDefaultFetch warns about this situation, but then leaves behind a relation tupdesc that has null "adbin" pointer(s), which most places don't guard against. We considered promoting the warning to an error, but throwing errors during relcache load is pretty drastic: it effectively locks one out of using the relation at all. What seems better is to leave the load-time behavior as a warning, but then throw an error in any code path that wants to use a default and can't find it. This confines the error to a subset of INSERT/UPDATE operations on the table, and in particular will at least allow a pg_dump to succeed. Also, we should fix AttrDefaultFetch to not leave any null pointers in the tupdesc, because that just creates an untested bug hazard. While at it, apply the same philosophy of "warn at load, throw error only upon use of the known-missing info" to CHECK constraints. CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch, but for reasons lost in the mists of time, it was throwing ERROR for the same cases that AttrDefaultFetch treats as WARNING. Make the two functions more nearly alike. In passing, get rid of potentially-O(N^2) loops in equalTupleDesc by making AttrDefaultFetch sort the entries after fetching them, so that equalTupleDesc can assume that entries in two equal tupdescs must be in matching order. (CheckConstraintFetch already was sorting CHECK constraints, but equalTupleDesc hadn't been told about it.) There's some argument for back-patching this, but with such a small number of field reports, I'm content to fix it in HEAD. Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk
*	Stop archive recovery if WAL generated with wal_level=minimal is found.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously if hot standby was enabled, archive recovery exited with an error when it found WAL generated with wal_level=minimal. But if hot standby was disabled, it just reported a warning and continued in that case. Which could lead to data loss or errors during normal operation. A warning was emitted, but users could easily miss that and not notice this serious situation until they encountered the actual errors. To improve this situation, this commit changes archive recovery so that it exits with FATAL error when it finds WAL generated with wal_level=minimal whatever the setting of hot standby. This enables users to notice the serious situation soon. The FATAL error is thrown if archive recovery starts from a base backup taken before wal_level is changed to minimal. When archive recovery exits with the error, if users have a base backup taken after setting wal_level to higher than minimal, they can recover the database by starting archive recovery from that newer backup. But note that if such backup doesn't exist, there is no easy way to complete archive recovery, which may make the database server unstartable and users may lose whole database. The commit adds the note about this risk into the document. Even in the case of unstartable database server, previously by just disabling hot standby users could avoid the error during archive recovery, forcibly start up the server and salvage data from it. But note that this commit makes this procedure unavailable at all. Author: Takamichi Osumi Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
*	Adjust input value to WaitEventSetWait() in ExecAppendAsyncEventWait().	Etsuro Fujita	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \|	Adjust the number of events given to WaitEventSetWait() so that it doesn't exceed the maximum number of events in the WaitEventSet given to that function (set->nevents_space) in hopes of making the buildfarm green. Per valgrind failure report from Tom Lane and the buildfarm. Author: Etsuro Fujita Discussion: https://postgr.es/m/3411577.1617289776%40sss.pgh.pa.us
*	ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION	Peter Eisentraut	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \|	At present, if we want to update publications in a subscription, we can use SET PUBLICATION. However, it requires supplying all publications that exists and the new publications. If we want to add new publications, it's inconvenient. The new syntax only supplies the new publications. When the refresh is true, it only refreshes the new publications. Author: Japin Li <japinli@hotmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/MEYP282MB166939D0D6C480B7FBE7EFFBB6BC0@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
*	Change return type of EXTRACT to numeric	Peter Eisentraut	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous implementation of EXTRACT mapped internally to date_part(), which returned type double precision (since it was implemented long before the numeric type existed). This can lead to imprecise output in some cases, so returning numeric would be preferrable. Changing the return type of an existing function is a bit risky, so instead we do the following: We implement a new set of functions, which are now called "extract", in parallel to the existing date_part functions. They work the same way internally but use numeric instead of float8. The EXTRACT construct is now mapped by the parser to these new extract functions. That way, dumps of views etc. from old versions (which would use date_part) continue to work unchanged, but new uses will map to the new extract functions. Additionally, the reverse compilation of EXTRACT now reproduces the original syntax, using the new mechanism introduced in 40c24bfef92530bd846e111c1742c2a54441c62c. The following minor changes of behavior result from the new implementation: - The column name from an isolated EXTRACT call is now "extract" instead of "date_part". - Extract from date now rejects inappropriate field names such as HOUR. It was previously mapped internally to extract from timestamp, so it would silently accept everything appropriate for timestamp. - Return values when extracting fields with possibly fractional values, such as second and epoch, now have the full scale that the value has internally (so, for example, '1.000000' instead of just '1'). Reported-by: Petr Fedorov <petr.fedorov@phystech.edu> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/42b73d2d-da12-ba9f-570a-420e0cce19d9@phystech.edu
*	Fix typo in pgstat.c.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \|	Introduced by 9868167500. Author: Vignesh C Discussion: https://postgr.es/m/CALDaNm1DqgaLBAJrtGznKk1sR1mH-augmp7LfGvxWwTUhah+rg@mail.gmail.com
*	Add function to log the memory contexts of specified backend process.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 3e98c0bafb added pg_backend_memory_contexts view to display the memory contexts of the backend process. However its target process is limited to the backend that is accessing to the view. So this is not so convenient when investigating the local memory bloat of other backend process. To improve this situation, this commit adds pg_log_backend_memory_contexts() function that requests to log the memory contexts of the specified backend process. This information can be also collected by calling MemoryContextStats(TopMemoryContext) via a debugger. But this technique cannot be used in some environments because no debugger is available there. So, pg_log_backend_memory_contexts() allows us to see the memory contexts of specified backend more easily. Only superusers are allowed to request to log the memory contexts because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, at the next CHECK_FOR_INTERRUPTS(), the target backend logs its memory contexts at LOG_SERVER_ONLY level, so that these memory contexts will appear in the server log but not be sent to the client. It logs one message per memory context. Because if it buffers all memory contexts into StringInfo to log them as one message, which may require the buffer to be enlarged very much and lead to OOM error since there can be a large number of memory contexts in a backend. When a backend process is consuming huge memory, logging all its memory contexts might overrun available disk space. To prevent this, now this patch limits the number of child contexts to log per parent to 100. As with MemoryContextStats(), it supposes that practical cases where the log gets long will typically be huge numbers of siblings under the same parent context; while the additional debugging value from seeing details about individual siblings beyond 100 will not be large. There was another proposed patch to add the function to return the memory contexts of specified backend as the result sets, instead of logging them, in the discussion. However that patch is not included in this commit because it had several issues to address. Thanks to Tatsuhito Kasahara, Andres Freund, Tom Lane, Tomas Vondra, Michael Paquier, Kyotaro Horiguchi and Zhihong Yu for the discussion. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Kyotaro Horiguchi, Zhihong Yu, Fujii Masao Discussion: https://postgr.es/m/0271f440ac77f2a4180e0e56ebd944d1@oss.nttdata.com
*	Allow pgoutput to send logical decoding messages.	Amit Kapila	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The output plugin accepts a new parameter (messages) that controls if logical decoding messages are written into the replication stream. It is useful for those clients that use pgoutput as an output plugin and needs to process messages that were written by pg_logical_emit_message(). Although logical streaming replication protocol supports logical decoding messages now, logical replication does not use this feature yet. Author: David Pirotte, Euler Taveira Reviewed-by: Euler Taveira, Andres Freund, Ashutosh Bapat, Amit Kapila Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
*	Refactor function parse_output_parameters.	Amit Kapila	2021-04-06
\| \| \| \| \| \| \| \| \| \| \|	Instead of using multiple parameters in parse_ouput_parameters function signature, use the struct PGOutputData that encapsulates all pgoutput options. It will be useful for future work where we need to add other options in pgoutput. Author: Euler Taveira Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
*	Allocate access strategy in parallel VACUUM workers.	Peter Geoghegan	2021-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 49f49def took entirely the wrong approach to fixing this issue. Just allocate a local buffer access strategy in each individual worker instead of trying to propagate state. This state was never propagated by parallel VACUUM in the first place. It looks like the only reason that this worked following commit 40d964ec was that it involved static global variables, which are initialized to 0 per the C standard. A more comprehensive fix may be necessary, even on HEAD. This fix should at least get the buildfarm green once again. Thanks once again to Thomas Munro for continued off-list assistance with the issue.
*	Support INCLUDE'd columns in SP-GiST.	Tom Lane	2021-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \|	Not much to say here: does what it says on the tin. We steal a previously-always-zero bit from the nextOffset field of leaf index tuples in order to track whether there is a nulls bitmap. Otherwise it works about like included columns in other index types. Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova, and rather heavily editorialized on by me Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
*	Propagate parallel VACUUM's buffer access strategy.	Peter Geoghegan	2021-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \|	Parallel VACUUM relied on global variable state from the leader process being propagated to workers on fork(). Commit b4af70cb removed most uses of global variables inside vacuumlazy.c, but did not account for the buffer access strategy state. To fix, propagate the state through shared memory instead. Per buildfarm failures on elver, curculio, and morepork. Many thanks to Thomas Munro for off-list assistance with this issue.
*	Simplify state managed by VACUUM.	Peter Geoghegan	2021-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reorganize the state struct used by VACUUM -- group related items together to make it easier to understand. Also stop relying on stack variables inside lazy_scan_heap() -- move those into the state struct instead. Doing things this way simplifies large groups of related functions whose function signatures had a lot of unnecessary redundancy. Switch over to using int64 for the struct fields used to count things that are reported to the user via log_autovacuum and VACUUM VERBOSE output. We were using double, but that doesn't seem to have any advantages. Using int64 makes it possible to add assertions that verify that the first pass over the heap (pruning) encounters precisely the same number of LP_DEAD items that get deleted from indexes later on, in the second pass over the heap. These assertions will be added in later commits. Finally, adjust the signatures of functions with IndexBulkDeleteResult pointer arguments in cases where there was ambiguity about whether or not the argument relates to a single index or all indexes. Functions now use the idiom that both ambulkdelete() and amvacuumcleanup() have always used (where appropriate): accept a mutable IndexBulkDeleteResult pointer argument, and return a result IndexBulkDeleteResult pointer to caller. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
*	Add pg_read_all_data and pg_write_all_data roles	Stephen Frost	2021-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A commonly requested use-case is to have a role who can run an unfettered pg_dump without having to explicitly GRANT that user access to all tables, schemas, et al, without that role being a superuser. This address that by adding a "pg_read_all_data" role which implicitly gives any member of this role SELECT rights on all tables, views and sequences, and USAGE rights on all schemas. As there may be cases where it's also useful to have a role who has write access to all objects, pg_write_all_data is also introduced and gives users implicit INSERT, UPDATE and DELETE rights on all tables, views and sequences. These roles can not be logged into directly but instead should be GRANT'd to a role which is able to log in. As noted in the documentation, if RLS is being used then an administrator may (or may not) wish to set BYPASSRLS on the login role which these predefined roles are GRANT'd to. Reviewed-by: Georgios Kokolatos Discussion: https://postgr.es/m/20200828003023.GU29590@tamriel.snowman.net
*	Shut down transaction tracking at startup process exit.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Maxim Orlov reported that the shutdown of standby server could result in the following assertion failure. The cause of this issue was that, when the shutdown caused the startup process to exit, recovery-time transaction tracking was not shut down even if it's already initialized, and some locks the tracked transactions were holding could not be released. At this situation, if other process was invoked and the PGPROC entry that the startup process used was assigned to it, it found such unreleased locks and caused the assertion failure, during the initialization of it. TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))" This commit fixes this issue by making the startup process shut down transaction tracking and release all locks, at the exit of it. Back-patch to all supported branches. Reported-by: Maxim Orlov Author: Fujii Masao Reviewed-by: Maxim Orlov Discussion: https://postgr.es/m/ad4ce692cc1d89a093b471ab1d969b0b@postgrespro.ru
*	Fix typo in collationcmds.c	Michael Paquier	2021-04-05
\| \| \| \| \| \| \|	Introduced by 51e225d. Author: Anton Voloshin Discussion: https://postgr.es/m/05477da0-703c-7de7-998c-5879738e8f39@postgrespro.ru
*	Fix more confusion in SP-GiST.	Tom Lane	2021-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spg_box_quad_leaf_consistent unconditionally returned the leaf datum as leafValue, even though in its usage for poly_ops that value is of completely the wrong type. In versions before 12, that was harmless because the core code did nothing with leafValue in non-index-only scans ... but since commit 2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would unconditionally try to copy the value using the wrong datatype parameters. Said copying is a waste of time and space if we're not going to return the data, but it accidentally failed to fail until I fixed the datatype confusion in ac9099fc1. Hence, change spgNewHeapItem to not copy the datum unless we're actually going to return it later. This saves cycles and dodges the question of whether lossy opclasses are returning the right type. Also change spg_box_quad_leaf_consistent to not return data that might be of the wrong type, as insurance against somebody introducing a similar bug into the core code in future. It seems like a good idea to back-patch these two changes into v12 and v13, although I'm afraid to change spgNewHeapItem's mistaken idea of which datatype to use in those branches. Per buildfarm results from ac9099fc1. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
*	Fix confusion in SP-GiST between attribute type and leaf storage type.	Tom Lane	2021-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to the documentation, the attType passed to the opclass config function (and also relied on by the core code) is the type of the heap column or expression being indexed. But what was actually being passed was the type stored for the index column. This made no difference for user-defined SP-GiST opclasses, because we weren't allowing the STORAGE clause of CREATE OPCLASS to be used, so the two types would be the same. But it's silly not to allow that, seeing that the built-in poly_ops opclass has a different value for opckeytype than opcintype, and that if you want to do lossy storage then the types must really be different. (Thus, user-defined opclasses doing lossy storage had to lie about what type is in the index.) Hence, remove the restriction, and make sure that we use the input column type not opckeytype where relevant. For reasons of backwards compatibility with existing user-defined opclasses, we can't quite insist that the specified leafType match the STORAGE clause; instead just add an amvalidate() warning if they don't match. Also fix some bugs that would only manifest when trying to return index entries when attType is different from attLeafType. It's not too surprising that these have not been reported, because the only usual reason for such a difference is to store the leaf value lossily, rendering index-only scans impossible. Add a src/test/modules module to exercise cases where attType is different from attLeafType and yet index-only scan is supported. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
*	Fix bug in brin_minmax_multi_union	Tomas Vondra	2021-04-04
\| \| \| \| \| \| \| \| \| \| \|	When calling sort_expanded_ranges() we need to remember the return value, because the function sorts and also deduplicates the ranges. So the number of ranges may decrease. brin_minmax_multi_union failed to do that, which resulted in crashes due to bogus ranges (equal minval/maxval but not marked as compacted). Reported-by: Jaime Casanova Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to
*	Fix order of parameters in BRIN minmax-multi calls	Tomas Vondra	2021-04-04
\| \| \| \| \| \| \| \| \| \| \| \|	The BRIN minmax-multi consistent function incorrectly assumed it can lookup an operator, and then swap the arguments to get the commutator. For example <(a,b) would be called as <(b,a) to get >(a,b). This works when the arguments are of the same type, but with cross-type opclasses this fails. We can't swap <(float4,float8) arguments, for example. Fixed by passing arguments in the right order. Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
*	Fix BRIN minmax-multi distance for inet type	Tomas Vondra	2021-04-04
\| \| \| \| \| \| \| \| \| \|	The distance calculation ignored the mask, unlike the inet comparator, which resulted in negative distance in some cases. Fixed by applying the mask in brin_minmax_multi_distance_inet. I've considered simply calling inetmi() to calculate the delta, but that does not consider mask either. Reviewed-by: Zhihong Yu Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com
*	Fix BRIN minmax-multi distance for timetz type	Tomas Vondra	2021-04-04
\| \| \| \| \| \| \| \| \|	The distance calculation ignored the time zone, so the result of (b-a) might have ended negative even if (b > a). Fixed by considering the time zone difference. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
*	Fix BRIN minmax-multi distance for interval type	Tomas Vondra	2021-04-04
\| \| \| \| \| \| \| \| \| \| \| \|	The distance calculation for interval type was treating months as having 31 days, which is inconsistent with the interval comparator (using 30 days). Due to this it was possible to get negative distance (b-a) when (a<b), trigerring an assert. Fixed by adopting the same logic as interval_cmp_value. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com
*	Improve efficiency of wait event reporting, remove proc.h dependency.	Andres Freund	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pgstat_report_wait_start() and pgstat_report_wait_end() required two conditional branches so far. One to check if MyProc is NULL, the other to check if pgstat_track_activities is set. As wait events are used around comparatively lightweight operations, and are inlined (reducing branch predictor effectiveness), that's not great. The dependency on MyProc has a second disadvantage: Low-level subsystems, like storage/file/fd.c, report wait events, but architecturally it is preferable for them to not depend on inter-process subsystems like proc.h (defining PGPROC). After this change including pgstat.h (nor obviously its sub-components like backend_status.h, wait_event.h, ...) does not pull in IPC related headers anymore. These goals, efficiency and abstraction, are achieved by having pgstat_report_wait_start/end() not interact with MyProc, but instead a new my_wait_event_info variable. At backend startup it points to a local variable, removing the need to check for MyProc being NULL. During process initialization my_wait_event_info is redirected to MyProc->wait_event_info. At shutdown this is reversed. Because wait event reporting now does not need to know about where the wait event is stored, it does not need to know about PGPROC anymore. The removal of the branch for checking pgstat_track_activities is simpler: Don't check anymore. The cost due to the branch are often higher than the store - and even if not, pgstat_track_activities is rarely disabled. The main motivator to commit this work now is that removing the (indirect) pgproc.h include from pgstat.h simplifies a patch to move statistics reporting to shared memory (which still has a chance to get into 14). Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
*	Split backend status and progress related functionality out of pgstat.c.	Andres Freund	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backend status (supporting pg_stat_activity) and command progress (supporting pg_stat_progress*) related code is largely independent from the rest of pgstat.[ch] (supporting views like pg_stat_all_tables that accumulate data over time). See also a333476b925. This commit doesn't rename the function names to make the distinction from the rest of pgstat_ clearer - that'd be more invasive and not clearly beneficial. If we were to decide to do such a rename at some point, it's better done separately from moving the code as well. Robert's review was of an earlier version. Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de
*	Refactor HMAC implementations	Michael Paquier	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to the cryptohash implementations, this refactors the existing HMAC code into a single set of APIs that can be plugged with any crypto libraries PostgreSQL is built with (only OpenSSL currently). If there is no such libraries, a fallback implementation is available. Those new APIs are designed similarly to the existing cryptohash layer, so there is no real new design here, with the same logic around buffer bound checks and memory handling. HMAC has a dependency on cryptohashes, so all the cryptohash types supported by cryptohash{_openssl}.c can be used with HMAC. This refactoring is an advantage mainly for SCRAM, that included its own implementation of HMAC with SHA256 without relying on the existing crypto libraries even if PostgreSQL was built with their support. This code has been tested on Windows and Linux, with and without OpenSSL, across all the versions supported on HEAD from 1.1.1 down to 1.0.1. I have also checked that the implementations are working fine using some sample results, a custom extension of my own, and doing cross-checks across different major versions with SCRAM with the client and the backend. Author: Michael Paquier Reviewed-by: Bruce Momjian Discussion: https://postgr.es/m/X9m0nkEJEzIPXjeZ@paquier.xyz
*	Do not rely on pgstat.h to indirectly include storage/ headers.	Andres Freund	2021-04-02
\| \| \| \| \| \| \| \|	An upcoming patch might remove the (now indirect) proc.h include (which in turn includes other headers), and it's cleaner for the modified files to include their dependencies directly anyway... Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
*	Split wait event related code from pgstat.[ch] into wait_event.[ch].	Andres Freund	2021-04-02
\| \| \| \| \| \| \| \| \| \| \| \|	The wait event related code is independent from the rest of the pgstat.[ch] code, of nontrivial size and changes on a regular basis. Put it into its own set of files. As there doesn't seem to be a good pre-existing directory for code like this, add src/backend/utils/activity. Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de
*	Remove useless Asserts in Result Cache code	David Rowley	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \|	Testing if an unsigned variable is >= 0 is pretty pointless. There's likely enough code in remove_cache_entry() to verify the cache memory accounting is correct in assert enabled builds. These Asserts were not adding much extra cover, even if they had been checking >= 0 on a signed variable. Reported-by: Andres Freund Discussion: https://postgr.es/m/20210402204734.6mo3nfacnljlicgn@alap3.anarazel.de
*	Detect POLLHUP/POLLRDHUP while running queries.	Thomas Munro	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide a new GUC check_client_connection_interval that can be used to check whether the client connection has gone away, while running very long queries. It is disabled by default. For now this uses a non-standard Linux extension (also adopted by at least one other OS). POLLRDHUP is not defined by POSIX, and other OSes don't have a reliable way to know if a connection was closed without actually trying to read or write. In future we might consider trying to send a no-op/heartbeat message instead, but that could require protocol changes. Author: Sergey Cherkashin <s.cherkashin@postgrespro.ru> Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Tatsuo Ishii <ishii@sraoss.co.jp> Reviewed-by: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Maksim Milyutin <milyutinma@gmail.com> Reviewed-by: Tsunakawa, Takayuki/綱川貴之 <tsunakawa.takay@fujitsu.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (much earlier version) Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
*	Strip file names reported in error messages on Windows, too.	Tom Lane	2021-04-02
\| \| \| \| \| \| \| \| \| \| \| \|	Commit dd136052b established a policy that error message FILE items should include only the base name of the reporting source file, for uniformity and succinctness. We now observe that some Windows compilers use backslashes in __FILE__ strings, so truncate at backslashes as well. This is expected to fix some platform variation in the results of the new libpq_pipeline test module. Discussion: https://postgr.es/m/3650140.1617372290@sss.pgh.pa.us
*	Add support for NullIfExpr in eval_const_expressions	Peter Eisentraut	2021-04-02
\| \| \| \| \|	Author: Hou Zhijie <houzj.fnst@cn.fujitsu.com> Discussion: https://www.postgresql.org/message-id/flat/7ea5ce773bbc4eea9ff1a381acd3b102@G08CNEXMBPEKD05.g08.fujitsu.local
*	Fix pgstat_report_replslot() to use proper data types for its arguments.	Fujii Masao	2021-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The caller of pgstat_report_replslot() passes int64 values to the function. Also the function stores those values in PgStat_Counter (i.e., int64) fields of PgStat_MsgReplSlot struct. But previously the function used "int" as the data types of some arguments for those values, which could lead to the overflow of values. To avoid this risk, this commit fixes pgstat_report_replslot() to use PgStat_Counter type for the arguments. Since they are the statistics counters, PgStat_Counter, the data type used for counters, is used for them instead of int64. Reported-by: Vignesh C Author: Vignesh C Reviewed-by: Jeevan Ladhe, Fujii Masao Discussion: https://postgr.es/m/CALDaNm080OpG=ZwOb0i8EyChH5SyHAMFWJCKaKTXmrfvJLbgaA@mail.gmail.com
*	Add Result Cache executor node (take 2)	David Rowley	2021-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
*	Rethink handling of pass-by-value leaf datums in SP-GiST.	Tom Lane	2021-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The existing convention in SP-GiST is that any pass-by-value datatype is stored in Datum representation, i.e. it's of width sizeof(Datum) even when typlen is less than that. This is okay, or at least it's too late to change it, for prefix datums and node-label datums in inner (upper) tuples. But it's problematic for leaf datums, because we'd prefer those to be stored in Postgres' standard on-disk representation so that we can easily extend leaf tuples to carry additional "included" columns. I believe, however, that we can get away with just up and changing that. This would be an unacceptable on-disk-format break, but there are two big mitigating factors: 1. It seems quite unlikely that there are any SP-GiST opclasses out there that use pass-by-value leaf datatypes. Certainly none of the ones in core do, nor has codesearch.debian.net heard of any. Given what SP-GiST is good for, it's hard to conceive of a use-case where the leaf-level values would be both small and fixed-width. (As an example, if you wanted to index text values with the leaf level being just a byte, then every text string would have to be represented with one level of inner tuple per preceding byte, which would be horrendously space-inefficient and slow to access. You always want to use as few inner-tuple levels as possible, leaving as much as possible in the leaf values.) 2. Even granting that you have such an index, this change only breaks things on big-endian machines. On little-endian, the high order bytes of the Datum format will now just appear to be alignment padding space. So, change the code to store pass-by-value leaf datums in their usual on-disk form. Inner-tuple datums are not touched. This is extracted from a larger patch that intends to add support for "included" columns. I'm committing it separately for visibility in our commit logs. Pavel Borisov and Tom Lane, reviewed by Andrey Borodin Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
*	Rename Default Roles to Predefined Roles	Stephen Frost	2021-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \|	The term 'default roles' wasn't quite apt as these roles aren't able to be modified or removed after installation, so rename them to be 'Predefined Roles' instead, adding an entry into the newly added Obsolete Appendix to help users of current releases find the new documentation. Bruce Momjian and Stephen Frost Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org and https://www.postgresql.org/message-id/20201120211304.GG16415@tamriel.snowman.net
*	Fix internal extract(timezone_minute) formulas	Peter Eisentraut	2021-04-01
\| \| \| \| \| \| \| \| \| \|	Through various refactorings over time, the extract(timezone_minute from time with time zone) and extract(timezone_minute from timestamp with time zone) implementations ended up with two different but equally nonsensical formulas by using SECS_PER_MINUTE and MINS_PER_HOUR interchangeably. Since those two are of course both the same number, the formulas do work, but for readability, fix them to be semantically correct.
*	Do COPY FROM encoding conversion/verification in larger chunks.	Heikki Linnakangas	2021-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This gives a small performance gain, by reducing the number of calls to the conversion/verification function, and letting it work with larger inputs. Also, reorganizing the input pipeline makes it easier to parallelize the input parsing: after the input has been converted to the database encoding, the next stage of finding the newlines can be done in parallel, because there cannot be any newline chars "embedded" in multi-byte characters in the encodings that we support as server encodings. This changes behavior in one corner case: if client and server encodings are the same single-byte encoding (e.g. latin1), previously the input would not be checked for zero bytes ('\0'). Any fields containing zero bytes would be truncated at the zero. But if encoding conversion was needed, the conversion routine would throw an error on the zero. After this commit, the input is always checked for zeros. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi