aboutsummaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAge
...
* Remove extra whitespace in pg_upgrade status message.Nathan Bossart2024-09-25
| | | | | | | | | | | There's no need to add another level of indentation to this status message. pg_log() will put it in the right place. Oversight in commit 347758b120. Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/ZunW7XHLd2uTts4f%40nathan Backpatch-through: 17
* Turn 'if' condition around to avoid Svace complaintAlvaro Herrera2024-09-25
| | | | | | | | | | | | | | | | | The unwritten assumption of this code is that both events->head and events->tail are NULL together (an empty list) or they aren't. So the code was testing events->head for nullness and using that as a cue to deference events->tail, which annoys the Svace static code analyzer. We can silence it by testing events->tail member instead, and add an assertion about events->head to ensure it's all consistent. This code is very old and as far as we know, there's never been a bug report related to this, so there's no need to backpatch. This was found by the ALT Linux Team using Svace. Author: Alexander Kuznetsov <kuznetsovam@altlinux.org> Discussion: https://postgr.es/m/6d0323c3-3f5d-4137-af73-98a5ab90e77c@altlinux.org
* vacuumdb: Skip temporary tables in query to build list of relationsMichael Paquier2024-09-25
| | | | | | | | | | | | | | | | | Running vacuumdb with a non-superuser while another user has created a temporary table would lead to a mid-flight permission failure, interrupting the operation. vacuum_rel() skips temporary relations of other backends, and it makes no sense for vacuumdb to know about these relations, so let's switch it to ignore temporary relations entirely. Adding a qual in the query based on relpersistence simplifies the generation of its WHERE clause in vacuum_one_database(), per se the removal of "has_where". Author: VaibhaveS, Michael Paquier Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CAM_eQjwfAR=y3G1fGyS1U9FTmc+FyJm9amNfY2QCZBnDDbNPZg@mail.gmail.com Backpatch-through: 12
* For inplace update durability, make heap_update() callers wait.Noah Misch2024-09-24
| | | | | | | | | | | | | | | | | | | The previous commit fixed some ways of losing an inplace update. It remained possible to lose one when a backend working toward a heap_update() copied a tuple into memory just before inplace update of that tuple. In catalogs eligible for inplace update, use LOCKTAG_TUPLE to govern admission to the steps of copying an old tuple, modifying it, and issuing heap_update(). This includes MERGE commands. To avoid changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when holding a relation lock sufficient to exclude inplace updaters. Back-patch to v12 (all supported versions). In v13 and v12, "UPDATE pg_class" or "UPDATE pg_database" can still lose an inplace update. The v14+ UPDATE fix needs commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35, and it wasn't worth reimplementing that fix without such infrastructure. Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas. Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
* Fix data loss at inplace update after heap_update().Noah Misch2024-09-24
| | | | | | | | | | | | | | | | | | | | | | As previously-added tests demonstrated, heap_inplace_update() could instead update an unrelated tuple of the same catalog. It could lose the update. Losing relhasindex=t was a source of index corruption. Inplace-updating commands like VACUUM will now wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a long-running GRANT already hurts VACUUM progress more just by keeping an XID running. The VACUUM will behave like a DELETE or UPDATE waiting for the uncommitted change. For implementation details, start at the systable_inplace_update_begin() header comment and README.tuplock. Back-patch to v12 (all supported versions). In back branches, retain a deprecated heap_inplace_update(), for extensions. Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier versions) Heikki Linnakangas, and (in earlier versions) Alexander Lakhin. Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
* Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.Noah Misch2024-09-24
| | | | | | | | | | | | The current use always releases this locktag. A planned use will continue that intent. It will involve more areas of code, making unlock omissions easier. Warn under debug_assertions, like we do for various resource leaks. Back-patch to v12 (all supported versions), the plan for the commit of the new use. Reviewed by Heikki Linnakangas. Discussion: https://postgr.es/m/20240512232923.aa.nmisch@google.com
* Allow length=-1 for NUL-terminated input to pg_strncoll(), etc.Jeff Davis2024-09-24
| | | | | | | | | Like ICU, allow a length of -1 to be specified for NUL-terminated arguments to pg_strncoll(), pg_strnxfrm(), and pg_strnxfrm_prefix(). Simplifies the code and comments. Discussion: https://postgr.es/m/2d758e07dff26bcc7cbe2aec57431329bfe3679a.camel@j-davis.com
* Fix psql describe commands' handling of ACL columns for old servers.Tom Lane2024-09-24
| | | | | | | | | | | Commit d1379ebf4 carelessly broke printACLColumn for pre-9.4 servers, by using the cardinality() function which we introduced in 9.4. We expect psql's describe-related commands to work back to 9.2, so this is bad. Use the longstanding array_length() function instead. Per report from Christoph Berg. Back-patch to v17. Discussion: https://postgr.es/m/ZvLXYglRS6hMMhtr@msg.df7cb.de
* Tighten up make_libc_collator() and make_icu_collator().Jeff Davis2024-09-24
| | | | | | | | | | | | | | | Ensure that error paths within these functions do not leak a collator, and return the result rather than using an out parameter. (Error paths in the caller may still result in a leaked collator, which will be addressed separately.) In make_libc_collator(), if the first newlocale() succeeds and the second one fails, close the first locale_t object. The function make_icu_collator() doesn't have any external callers, so change it to be static. Discussion: https://postgr.es/m/54d20e812bd6c3e44c10eddcd757ec494ebf1803.camel@j-davis.com
* Add further excludes to headerscheckPeter Eisentraut2024-09-24
| | | | | | | | | Some header files under contrib/isn/ are not meant to be included independently, and they fail -Wmissing-variable-declarations when doing so. Reported-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BYVt5MBD-w0HyHpsGb4U8RNge3DvAbDmOFy_epGhZ2Mg%40mail.gmail.com#aba3226c6dd493923bd6ce95d25a2d77
* Neaten up our choices of SQLSTATEs for XML-related errors.Tom Lane2024-09-24
| | | | | | | | | | | | | | | | | | | | | When our XML-handling modules were first written, the SQL standard lacked any error codes that were particularly intended for XML error conditions. Unsurprisingly, this led to some rather random choices of errcodes in those modules. Now the standard has a whole SQLSTATE class, "Class 10 - XQuery Error", with a reasonably large selection of relevant-looking errcodes. In this patch I've chosen one fairly generic code defined by the standard, 10608 = invalid_argument_for_xquery, and used it where it seemed appropriate. I've also made an effort to replace ERRCODE_INTERNAL_ERROR everywhere it was not clearly reporting a coding problem; in particular, many of the existing uses look like they can fairly be reported as ERRCODE_OUT_OF_MEMORY. It might be interesting to try to map libxml2's error codes into the standard's new collection, but I've not undertaken that here. Discussion: https://postgr.es/m/417250.1726341268@sss.pgh.pa.us
* Update obsolete nbtree array preprocessing comments.Peter Geoghegan2024-09-24
| | | | | | | | | | | | The array->scan_key references fixed up at the end of preprocessing start out as offsets into the arrayKeyData[] array (the array returned by _bt_preprocess_array_keys at the start of preprocessing that involves array scan keys). Offsets into the arrayKeyData[] array are no longer guaranteed to be valid offsets into our original scan->keyData[] input scan key array, but comments describing the array->scan_key references still talked about scan->keyData[]. Update those comments. Oversight in commit b5249741.
* Add ONLY support for VACUUM and ANALYZEDavid Rowley2024-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since autovacuum does not trigger an ANALYZE for partitioned tables, users must perform these manually. However, performing a manual ANALYZE on a partitioned table would always result in recursively analyzing each partition and that could be undesirable as autovacuum takes care of that. For partitioned tables that contain a large number of partitions, having to analyze each partition could take an unreasonably long time, especially so for tables with a large number of columns. Here we allow the ONLY keyword to prefix the name of the table to allow users to have ANALYZE skip processing partitions. This option can also be used with VACUUM, but there is no work to do if VACUUM ONLY is used on a partitioned table. This commit also changes the behavior of VACUUM and ANALYZE for inheritance parents. Previously inheritance child tables would not be processed when operating on the parent. Now, by default we *do* operate on the child tables. ONLY can be used to obtain the old behavior. The release notes should note this as an incompatibility. The default behavior has not changed for partitioned tables as these always recursively processed the partitions. Author: Michael Harris <harmic@gmail.com> Discussion: https://postgr.es/m/CADofcAWATx_haD=QkSxHbnTsAe6+e0Aw8Eh4H8cXyogGvn_kOg@mail.gmail.com Discussion: https://postgr.es/m/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA@mail.gmail.com Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com>
* Remove ATT_TABLE for ALTER TABLE ... ATTACH/DETACHMichael Paquier2024-09-24
| | | | | | | | | | | | | | | Attempting these commands for a non-partitioned table would result in a failure when creating the relation in transformPartitionCmd(). This gives the possibility to throw an error earlier with a much better error message, thanks to d69a3f4d70b7. The extra test cases are from me. Note that FINALIZE uses a different subcommand and it had no coverage for its failure path with non-partitioned tables. Author: Álvaro Herrera, Michael Paquier Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/202409190803.tnis52adt2n5@alvherre.pgsql
* jsonapi: fix memory leakage during OOM error recovery.Tom Lane2024-09-23
| | | | | | | | | | Coverity pointed out that inc_lex_level() would leak memory (not to mention corrupt the pstack data structure) if some but not all of its three REALLOC's failed. To fix, store successfully-updated pointers back into the pstack struct immediately. Oversight in 0785d1b8b, so no need for back-patch.
* Fix asserts in fast-path locking codeTomas Vondra2024-09-23
| | | | | | | | | | | | | | | | | | | Commit c4d5cb71d229 introduced a couple asserts in the fast-path locking code, upsetting Coverity. The assert in InitProcGlobal() is clearly wrong, as it assigns instead of checking the value. This is harmless, but doesn't check anything. The asserts in FAST_PATH_ macros are written as if for signed values, but the macros are only called for unsigned ones. That makes the check for (val >= 0) useless. Checks written as ((uint32) x < max) work for both signed and unsigned values. Negative values should wrap to values greater than INT32_MAX. Per Coverity, report by Tom Lane. Reported-by: Tom Lane Discussion: https://postgr.es/m/2891628.1727019959@sss.pgh.pa.us
* Add memory/disk usage for more executor nodes.Tatsuo Ishii2024-09-23
| | | | | | | | | | | | | | | This commit is similar to 95d6e9af07, expanding the idea to CTE scan, table function scan and recursive union scan nodes so that the maximum tuplestore memory or disk usage is shown with EXPLAIN ANALYZE command. Also adjust show_storage_info() so that it accepts storage type and storage size arguments instead of Tuplestorestate. This allows the node types to share the formatting code using show_storage_info(). Due to this show_material_info() and show_windowagg_info() are also modified. Reviewed-by: David Rowley Discussion: https://postgr.es/m/20240918.211246.1127161704188186085.ishii%40postgresql.org
* Remove pg_authid's TOAST table.Nathan Bossart2024-09-21
| | | | | | | | | | | | | | | | | pg_authid's only varlena column is rolpassword, which unfortunately cannot be de-TOASTed during authentication because we haven't selected a database yet and cannot read pg_class. By removing pg_authid's TOAST table, attempts to set password hashes that require out-of-line storage will fail with a "row is too big" error instead. We may want to provide a more user-friendly error in the future, but for now let's just remove the useless TOAST table. Bumps catversion. Reported-by: Alexander Lakhin Reviewed-by: Tom Lane, Michael Paquier Discussion: https://postgr.es/m/89e8649c-eb74-db25-7945-6d6b23992394%40gmail.com
* Increase the number of fast-path lock slotsTomas Vondra2024-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the fixed-size array of fast-path locks with arrays, sized on startup based on max_locks_per_transaction. This allows using fast-path locking for workloads that need more locks. The fast-path locking introduced in 9.2 allowed each backend to acquire a small number (16) of weak relation locks cheaply. If a backend needs to hold more locks, it has to insert them into the shared lock table. This is considerably more expensive, and may be subject to contention (especially on many-core systems). The limit of 16 fast-path locks was always rather low, because we have to lock all relations - not just tables, but also indexes, views, etc. For planning we need to lock all relations that might be used in the plan, not just those that actually get used in the final plan. So even with rather simple queries and schemas, we often need significantly more than 16 locks. As partitioning gets used more widely, and the number of partitions increases, this limit is trivial to hit. Complex queries may easily use hundreds or even thousands of locks. For workloads doing a lot of I/O this is not noticeable, but for workloads accessing only data in RAM, the access to the shared lock table may be a serious issue. This commit removes the hard-coded limit of the number of fast-path locks. Instead, the size of the fast-path arrays is calculated at startup, and can be set much higher than the original 16-lock limit. The overall fast-path locking protocol remains unchanged. The variable-sized fast-path arrays can no longer be part of PGPROC, but are allocated as a separate chunk of shared memory and then references from the PGPROC entries. The fast-path slots are organized as a 16-way set associative cache. You can imagine it as a hash table of 16-slot "groups". Each relation is mapped to exactly one group using hash(relid), and the group is then processed using linear search, just like the original fast-path cache. With only 16 entries this is cheap, with good locality. Treating this as a simple hash table with open addressing would not be efficient, especially once the hash table gets almost full. The usual remedy is to grow the table, but we can't do that here easily. The access would also be more random, with worse locality. The fast-path arrays are sized using the max_locks_per_transaction GUC. We try to have enough capacity for the number of locks specified in the GUC, using the traditional 2^n formula, with an upper limit of 1024 lock groups (i.e. 16k locks). The default value of max_locks_per_transaction is 64, which means those instances will have 64 fast-path slots. The main purpose of the max_locks_per_transaction GUC is to size the shared lock table. It is often set to the "average" number of locks needed by backends, with some backends using significantly more locks. This should not be a major issue, however. Some backens may have to insert locks into the shared lock table, but there can't be too many of them, limiting the contention. The only solution is to increase the GUC, even if the shared lock table already has sufficient capacity. That is not free, especially in terms of memory usage (the shared lock table entries are fairly large). It should only happen on machines with plenty of memory, though. In the future we may consider a separate GUC for the number of fast-path slots, but let's try without one first. Reviewed-by: Robert Haas, Jakub Wartak Discussion: https://postgr.es/m/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
* Refactor handling of nbtree array redundancies.Peter Geoghegan2024-09-21
| | | | | | | | | | | | | | | | | | | | | Teach _bt_preprocess_array_keys to eliminate redundant array equality scan keys directly, rather than just marking them as redundant. Its _bt_preprocess_keys caller is no longer required to ignore input scan keys that were marked redundant in this way. Oversights like the one fixed by commit f22e17f7 are no longer possible. The new scheme also makes it easier for _bt_preprocess_keys to output a so.keyData[] scan key array with _more_ scan keys than it was passed in its scan.keyData[] input scan key array. An upcoming patch that adds skip scan optimizations to nbtree will take advantage of this. In passing, remove and rename certain _bt_preprocess_keys variables to make the difference between our input scan key array and our output scan key array clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
* Improve Asserts checking relation matching in parallel scans.Tom Lane2024-09-20
| | | | | | | | | | | | | | | | table_beginscan_parallel and index_beginscan_parallel contain Asserts checking that the relation a worker will use in a parallel scan is the same one the leader intended. However, they were checking for relation OID match, which was not strong enough to detect the mismatch problem fixed in 126ec0bc7. What would be strong enough is to compare relfilenodes instead. Arguably, that's a saner definition anyway, since a scan surely operates on a physical relation not a logical one. Hence, store and compare RelFileLocators not relation OIDs. Also ensure that index_beginscan_parallel checks the index identity not just the table identity. Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
* Alphabetize #include directives in pg_checksums.c.Nathan Bossart2024-09-20
| | | | | Author: Michael Banck Discussion: https://postgr.es/m/66edaed0.050a0220.32a9ba.42c8%40mx.google.com
* Fix nbtree pgstats accounting with parallel scans.Peter Geoghegan2024-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. Under this scheme a parallel index scan with array keys will perform the same number of index descents as an equivalent serial index scan (barring corner cases where an individual parallel worker discovers that it can advance the scan's array keys without anybody needing to perform another descent of the index to get to the relevant page on the leaf level). Despite all this, the pgstats accounting wasn't updated; it continued to increment the total number of index scans for the rel once per _bt_first call, no matter the details. As a result, the number of (primitive) index scans could be over-counted during parallel scans. To fix, delay incrementing the count of index scans until after we've established that another descent of the index (using either _bt_search or _bt_endpoint) is required. That way pg_stat_user_tables.idx_scan always advances in the same way, regardless of whether or not the scan makes use of parallelism. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
* Add parameter "connstr" to PostgreSQL::Test::Cluster::background_psqlMichael Paquier2024-09-20
| | | | | | | | Like for Cluster::psql, this can be handy to force the use of a connection string with some values overriden, like a "host". Author: Aidar Imamov Discussion: https://postgr.es/m/ecacb079efc533aed3c234cbcb5b07b6@postgrespro.ru
* Restore relmapper state early enough in parallel workers.Tom Lane2024-09-19
| | | | | | | | | | | | | | | | | | | | | | | We need to do RestoreRelationMap before loading catalog-derived state, else the worker may end up with catalog relcache entries containing stale relfilenode data. Move up RestoreReindexState too, on the principle that that should also happen before we do much of any catalog access. I think ideally these things would happen even before InitPostgres, but there are various problems standing in the way of that, notably that the relmapper thinks "active" mappings should be discarded at transaction end. The implication of this is that InitPostgres and RestoreLibraryState will see the same catalog state as an independent backend would see, which is probably fine; at least, it's been like that all along. Per report from Justin Pryzby. There is a case to be made that this should be back-patched. But given the lack of complaints before 6e086fa2e and the short amount of time remaining before 17.0 wraps, I'll just put it in HEAD for now. Discussion: https://postgr.es/m/ZuoU_8EbSTE14o1U@pryzbyj2023
* psql: Add tests for repeated calls of \bind[_named]Michael Paquier2024-09-20
| | | | | | | | | | | | The implementation assumes that on multiple calls of these meta-commands the last one wins. Multiple \g calls in-between mean multiple executions. There were no tests to check these properties, hence let's add something. Author: Jelte Fennema-Nio, Michael Paquier Discussion: https://postgr.es/m/CAGECzQSTE7CoM=Gst56Xj8pOvjaPr09+7jjtWqTC40pGETyAuA@mail.gmail.com
* Improve Perl script which adds commit links to release notesBruce Momjian2024-09-19
| | | | | | | | | | Reported-by: Andrew Dunstan Discussion: https://postgr.es/m/b2465837-56df-4794-a0b5-5e6ed44ed870@dunslane.net Author: Andrew Dunstan Backpatch-through: 12
* Add UpgradeTaskProcessCB to typedefs.listAlexander Korotkov2024-09-19
| | | | | While it doesn't directly influence indentation right now, add it for uniformity.
* Fix order of includes in src/bin/pg_upgrade/info.cAlexander Korotkov2024-09-19
|
* Move pg_wal_replay_wait() to xlogfuncs.cAlexander Korotkov2024-09-19
| | | | | | | | | | | This commit moves pg_wal_replay_wait() procedure to be a neighbor of WAL-related functions in xlogfuncs.c. The implementation of LSN waiting continues to reside in the same place. By proposal from Michael Paquier. Reported-by: Peter Eisentraut Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org
* psql: Clean up more aggressively state of \bind[_named], \parse and \closeMichael Paquier2024-09-19
| | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a couple of issues with the psql meta-commands mentioned above when called repeatedly: - The statement name is reset for each call. If a command errors out, its send_mode would still be set, causing an incorrect path to be taken when processing a query. For \bind_named, this could trigger an assertion failure as a statement name is always expected for this meta-command. This issue has been introduced by d55322b0da60. - The memory allocated for bind parameters can be leaked. This is a bug enlarged by d55322b0da60 that exists since 5b66de3433e2, as it is also possible to leak memory with \bind in v16 and v17. This requires a fix that will be done on the affected branches separately. This issue is taken care of here for HEAD. This patch tightens the cleanup of the state used for the extended protocol meta-commands (bind parameters, send mode, statement name) by doing it before running each meta-command on top of doing it once a query has been processed, avoiding any leaks and the inconsistencies when mixing calls, by refactoring the cleanup in a single routine used in all the code paths where this step is required. Reported-by: Alexander Lakhin Author: Anthonin Bonnefoy Discussion: https://postgr.es/m/2e5b89af-a351-ff0a-000c-037ac28314ab@gmail.com
* Introduce ATT_PARTITIONED_TABLE in tablecmds.cMichael Paquier2024-09-19
| | | | | | | | | | | | | | | | Partitioned tables and normal tables have been relying on ATT_TABLE in ATSimplePermissions() to produce error messages that depend on the relation's relkind, because both relkinds currently support the same set of ALTER TABLE subcommands. A patch to restrict SET LOGGED/UNLOGGED for partitioned tables is under discussion, and introducing ATT_PARTITIONED_TABLE makes subcommand restrictions for partitioned tables easier to deal with, so let's add one. There is no functional change. Author: Michael Paquier Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/Zt6cDnwSvnuLLnak@paquier.xyz
* Optimize tuplestore usage for WITH RECURSIVE CTEsDavid Rowley2024-09-19
| | | | | | | | | | | | | | | | | | | | | nodeRecursiveunion.c makes use of two tuplestores and, until now, would delete and recreate one of these tuplestores after every recursive iteration. Here we adjust that behavior and instead reuse one of the existing tuplestores and just empty it of all tuples using tuplestore_clear(). This saves some free/malloc roundtrips and has shown a 25-30% performance improvement for queries that perform very little work between recursive iterations. This also paves the way to add some EXPLAIN ANALYZE telemetry output for recursive common table expressions, similar to what was done in 1eff8279d and 95d6e9af0. Previously calling tuplestore_end() would have caused the maximum storage space used to be lost. Reviewed-by: Tatsuo Ishii Discussion: https://postgr.es/m/CAApHDvr9yW0YRiK8A2J7nvyT8g17YzbSfOviEWrghazKZbHbig@mail.gmail.com
* Add TOAST table to pg_index.Nathan Bossart2024-09-18
| | | | | | | | | | | | | | | | | | | | This change allows pg_index rows to use out-of-line storage for the "indexprs" and "indpred" columns, which enables use-cases such as very large index expressions. This system catalog was previously not given a TOAST table due to a fear of circularity issues (see commit 96cdeae07f). Testing has not revealed any such problems, and it seems unlikely that the entries for system indexes could ever need out-of-line storage. In any case, it is still early in the v18 development cycle, so committing this now will hopefully increase the chances of finding any unexpected problems prior to release. Bumps catversion. Reported-by: Jonathan Katz Reviewed-by: Tom Lane Discussion: https://postgr.es/m/b611015f-b423-458c-aa2d-be0e655cc1b4%40postgresql.org
* Add some sanity checks in executor for query ID reportingMichael Paquier2024-09-18
| | | | | | | | | | | | | | | | | | | This commit adds three sanity checks in code paths of the executor where it is possible to use hooks, checking that a query ID is reported in pg_stat_activity if compute_query_id is enabled: - ExecutorRun() - ExecutorFinish() - ExecutorEnd() This causes the test in pg_stat_statements added in 933848d16dc9 to complain immediately in ExecutorRun(). The idea behind this commit is to help extensions to detect if they are missing query ID reports when a query goes through the executor. Perhaps this will prove to be a bad idea, but let's see where this experience goes in v18 and newer versions. Reviewed-by: Sami Imseih Discussion: https://postgr.es/m/ZuJb5xCKHH0A9tMN@paquier.xyz
* Extend PgStat_HashKey.objid from 4 to 8 bytesMichael Paquier2024-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This opens the possibility to define keys for more types of statistics kinds in PgStat_HashKey, the first case being 8-byte query IDs for statistics like pg_stat_statements. This increases the size of PgStat_HashKey from 12 to 16 bytes, while PgStatShared_HashEntry, entry stored in the dshash for pgstats, keeps the same size due to alignment. xl_xact_stats_item, that tracks the stats items to drop in commit WAL records, is increased from 12 to 16 bytes. Note that individual chunks in commit WAL records should be multiples of sizeof(int), hence 8-byte object IDs are stored as two uint32, based on a suggestion from Heikki Linnakangas. While on it, the field of PgStat_HashKey is renamed from "objoid" to "objid", as for some stats kinds this field does not refer to OIDs but just IDs, like for replication slot stats. This commit bumps the following format variables: - PGSTAT_FILE_FORMAT_ID, as PgStat_HashKey is written to the stats file for non-serialized stats kinds in the dshash table. - XLOG_PAGE_MAGIC for the changes in xl_xact_stats_item. - Catalog version, for the SQL function pg_stat_have_stats(). Reviewed-by: Bertrand Drouvot Discussion: https://postgr.es/m/ZsvTS9EW79Up8I62@paquier.xyz
* Don't enter parallel mode when holding interrupts.Noah Misch2024-09-17
| | | | | | | | | | Doing so caused the leader to hang in wait_event=ParallelFinish, which required an immediate shutdown to resolve. Back-patch to v12 (all supported versions). Francesco Degrassi Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
* Add missing query ID reporting in extended query protocolMichael Paquier2024-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds query ID reports for two code paths when processing extended query protocol messages: - When receiving a bind message, setting it to the first Query retrieved from a cached cache. - When receiving an execute message, setting it to the first PlannedStmt stored in a portal. An advantage of this method is that this is able to cover all the types of portals handled in the extended query protocol, particularly these two when the report done in ExecutorStart() is not enough (neither is an addition in ExecutorRun(), actually, for the second point): - Multiple execute messages, with multiple ExecutorRun(). - Portal with execute/fetch messages, like a query with a RETURNING clause and a fetch size that stores the tuples in a first execute message going though ExecutorStart() and ExecuteRun(), followed by one or more execute messages doing only fetches from the tuplestore created in the first message. This corresponds to the case where execute_is_fetch is set, for example. Note that the query ID reporting done in ExecutorStart() is still necessary, as an EXECUTE requires it. Query ID reporting is optimistic and more calls to pgstat_report_query_id() don't matter as the first report takes priority except if the report is forced. The comment in ExecutorStart() is adjusted to reflect better the reality with the extended query protocol. The test added in pg_stat_statements is a courtesy of Robert Haas. This uses psql's \bind metacommand, hence this part is backpatched down to v16. Reported-by: Kaido Vaikla, Erik Wienhold Author: Sami Imseih Reviewed-by: Jian He, Andrei Lepikhov, Michael Paquier Discussion: https://postgr.es/m/CA+427g8DiW3aZ6pOpVgkPbqK97ouBdf18VLiHFesea2jUk3XoQ@mail.gmail.com Discussion: https://postgr.es/m/CA+TgmoZxtnf_jZ=VqBSyaU8hfUkkwoJCJ6ufy4LGpXaunKrjrg@mail.gmail.com Discussion: https://postgr.es/m/1391613709.939460.1684777418070@office.mailbox.org Backpatch-through: 14
* Allow ReadStream to be consumed as raw block numbers.Thomas Munro2024-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commits 041b9680 and 6377e12a changed the interface of scan_analyze_next_block() to take a ReadStream instead of a BlockNumber and a BufferAccessStrategy, and to return a value to indicate when the stream has run out of blocks. This caused integration problems for at least one known extension that uses specially encoded BlockNumber values that map to different underlying storage, because acquire_sample_rows() sets up the stream so that read_stream_next_buffer() reads blocks from the main fork of the relation's SMgrRelation. Provide read_stream_next_block(), as a way for such an extension to access the stream of raw BlockNumbers directly and forward them to its own ReadBuffer() calls after decoding, as it could in earlier releases. The new function returns the BlockNumber and BufferAccessStrategy that were previously passed directly to scan_analyze_next_block(). Alternatively, an extension could wrap the stream of BlockNumbers in another ReadStream with a callback that performs any decoding required to arrive at real storage manager BlockNumber values, so that it could benefit from the I/O combining and concurrency provided by read_stream.c. Another class of table access method that does nothing in scan_analyze_next_block() because it is not block-oriented could use this function to control the number of block sampling loops. It could match the previous behavior with "return read_stream_next_block(stream, &bas) != InvalidBlockNumber". Ongoing work is expected to provide better ANALYZE support for table access methods that don't behave like heapam with respect to storage blocks, but that will be for future releases. Back-patch to 17. Reported-by: Mats Kindahl <mats@timescale.com> Reviewed-by: Mats Kindahl <mats@timescale.com> Discussion: https://postgr.es/m/CA%2B14425%2BCcm07ocG97Fp%2BFrD9xUXqmBKFvecp0p%2BgV2YYR258Q%40mail.gmail.com
* Repair pg_upgrade for identity sequences with non-default persistence.Tom Lane2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we introduced unlogged sequences in v15, identity sequences have defaulted to having the same persistence as their owning table. However, it is possible to change that with ALTER SEQUENCE, and pg_dump tries to preserve the logged-ness of sequences when it doesn't match (as indeed it wouldn't for an unlogged table from before v15). The fly in the ointment is that ALTER SEQUENCE SET [UN]LOGGED fails in binary-upgrade mode, because it needs to assign a new relfilenode which we cannot permit in that mode. Thus, trying to pg_upgrade a database containing a mismatching identity sequence failed. To fix, add syntax to ADD/ALTER COLUMN GENERATED AS IDENTITY to allow the sequence's persistence to be set correctly at creation, and use that instead of ALTER SEQUENCE SET [UN]LOGGED in pg_dump. (I tried to make SET [UN]LOGGED work without any pg_dump modifications, but that seems too fragile to be a desirable answer. This way should be markedly faster anyhow.) In passing, document the previously-undocumented SEQUENCE NAME option that pg_dump also relies on for identity sequences; I see no value in trying to pretend it doesn't exist. Per bug #18618 from Anthony Hsu. Back-patch to v15 where we invented this stuff. Discussion: https://postgr.es/m/18618-d4eb26d669ed110a@postgresql.org
* Ensure standby promotion point in 043_wal_replay_wait.plAlexander Korotkov2024-09-17
| | | | | | | | | This commit ensures standby will be promoted at least at the primary insert LSN we have just observed. We use pg_switch_wal() to force the insert LSN to be written then wait for standby to catchup. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com
* Minor cleanup related to pg_wal_replay_wait() procedureAlexander Korotkov2024-09-17
| | | | | | | | | | | | * Rename $node_standby1 to $node_standby in 043_wal_replay_wait.pl as there is only one standby. * Remove useless debug printing in 043_wal_replay_wait.pl. * Fix typo in one check description in 043_wal_replay_wait.pl. * Fix some wording in comments and documentation. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com Reviewed-by: Alexander Lakhin
* Avoid parallel nbtree index scan hangs with SAOPs.Peter Geoghegan2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. A backend that successfully scheduled the scan's next primitive index scan saved its backend local array keys in shared memory. Any backend could pick up the scheduled primitive scan within _bt_first. This scheme decouples scheduling a primitive scan from starting the scan (by performing another descent of the index via a _bt_search call from _bt_first) to make things robust. The scheme had a deadlock hazard, at least when the leader process participated in the scan. _bt_parallel_seize had a code path that made backends that were not in an immediate position to start a scheduled primitive index scan wait for some other backend to do so instead. Under the right circumstances, the leader process could wait here forever: the leader would wait for any other backend to start the primitive scan, while every worker was busy waiting on the leader to consume tuples from the scan's tuple queue. To fix, don't wait for a scheduled primitive index scan to be started by some other eligible backend from within _bt_parallel_seize (when the calling backend isn't in a position to do so itself). Return false instead, while recording that the scan has a scheduled primitive index scan in backend local state. This leaves the backend in the same state as the existing case where a backend schedules (or tries to schedule) another primitive index scan from within _bt_advance_array_keys, before calling _bt_parallel_seize. _bt_parallel_seize already handles that case by returning false without waiting, and without unsetting the backend local state. Leaving the backend in this state enables it to start a previously scheduled primitive index scan once it gets back to _bt_first. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Matthias van de Meent, with tweaks by me. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
* Add temporal FOREIGN KEY contraintsPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | Add PERIOD clause to foreign key constraint definitions. This is supported for range and multirange types. Temporal foreign keys check for range containment instead of equality. This feature matches the behavior of the SQL standard temporal foreign keys, but it works on PostgreSQL's native ranges instead of SQL's "periods", which don't exist in PostgreSQL (yet). Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT} are not supported yet. (previously committed as 34768ee3616, reverted by 8aee330af55; this is essentially unchanged from those) Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add temporal PRIMARY KEY and UNIQUE constraintsPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints. These are backed by GiST indexes instead of B-tree indexes, since they are essentially exclusion constraints with = for the scalar parts of the key and && for the temporal part. (previously committed as 46a0cd4cefb, reverted by 46a0cd4cefb; the new part is this:) Because 'empty' && 'empty' is false, the temporal PK/UQ constraint allowed duplicates, which is confusing to users and breaks internal expectations. For instance, when GROUP BY checks functional dependencies on the PK, it allows selecting other columns from the table, but in the presence of duplicate keys you could get the value from any of their rows. So we need to forbid empties. This all means that at the moment we can only support ranges and multiranges for temporal PK/UQs, unlike the original patch (above). Documentation and tests for this are added. But this could conceivably be extended by introducing some more general support for the notion of "empty" for other types. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add stratnum GiST support functionPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | This is support function 12 for the GiST AM and translates "well-known" RT*StrategyNumber values into whatever strategy number is used by the opclass (since no particular numbers are actually required). We will use this to support temporal PRIMARY KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality. This commit adds two implementations, one for internal GiST opclasses (just an identity function) and another for btree_gist opclasses. It updates btree_gist from 1.7 to 1.8, adding the support function for all its opclasses. (previously committed as 6db4598fcb8, reverted by 8aee330af55; this is essentially unchanged from those) Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add memory/disk usage for Window aggregate nodes in EXPLAIN.Tatsuo Ishii2024-09-17
| | | | | | | | | | | | | This commit is similar to 1eff8279d and expands the idea to Window aggregate nodes so that users can know how much memory or disk the tuplestore used. This commit uses newly introduced tuplestore_get_stats() to inquire this information and add some additional output in EXPLAIN ANALYZE to display the information for the Window aggregate node. Reviewed-by: David Rowley, Ashutosh Bapat, Maxim Orlov, Jian He Discussion: https://postgr.es/m/20240706.202254.89740021795421286.ishii%40postgresql.org
* Fix redefinition of typedef.Nathan Bossart2024-09-16
| | | | | | | | Per buildfarm members sifaka and longfin, clang with -Wtypedef-redefinition warns of a duplicate typedef unless building with C11. Oversight in commit 40e2e5e92b.
* pg_upgrade: Parallelize encoding conversion check.Nathan Bossart2024-09-16
| | | | | | | | | | | This commit makes use of the new task framework in pg_upgrade to parallelize the check for incompatible user-defined encoding conversions, i.e., those defined on servers older than v14. This step will now process multiple databases concurrently when pg_upgrade's --jobs option is provided a value greater than 1. Reviewed-by: Daniel Gustafsson, Ilya Gladyshev Discussion: https://postgr.es/m/20240516211638.GA1688936%40nathanxps13
* pg_upgrade: Parallelize WITH OIDS check.Nathan Bossart2024-09-16
| | | | | | | | | | This commit makes use of the new task framework in pg_upgrade to parallelize the check for tables declared WITH OIDS. This step will now process multiple databases concurrently when pg_upgrade's --jobs option is provided a value greater than 1. Reviewed-by: Daniel Gustafsson, Ilya Gladyshev Discussion: https://postgr.es/m/20240516211638.GA1688936%40nathanxps13