aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAge
...
* Fix a couple of memory leaks in pg_restore.cAndrew Dunstan2025-04-06
| | | | per complaint from Coverity.
* Relax ordering-related hardcoded btree requirements in planningPeter Eisentraut2025-04-06
| | | | | | | | | | | | | | | | | | | | There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Revert "Put enable_self_join_elimination into postgresql.conf.sample"Alexander Korotkov2025-04-06
| | | | | | | This reverts commit c2d329260cd8. Reported-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/D292EB44-806E-439A-82A4-491A1BA59E7A%40yesql.se
* Put enable_self_join_elimination into postgresql.conf.sampleAlexander Korotkov2025-04-06
| | | | | | | | | | fc069a3a6319 implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This commit adds enable_self_join_elimination to the postgresql.conf.sample, as it was forgotten in the original commit. Discussion: https://postgr.es/m/CAHewXN%3D%2Bghd6O6im46q7j2u6c3H6vkXtXmF%3D_v4CfGSnjje8PA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com>
* Compute CRC32C using AVX-512 instructions where availableJohn Naylor2025-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous implementation of CRC32C on x86 relied on the native CRC32 instruction from the SSE 4.2 extension, which operates on up to 8 bytes at a time. We can get a substantial speedup by using carryless multiplication on SIMD registers, processing 64 bytes per loop iteration. Shorter inputs fall back to ordinary CRC instructions. On Intel Tiger Lake hardware (2020), CRC is now 50% faster for inputs between 64 and 112 bytes, and 3x faster for 256 bytes. The VPCLMULQDQ instruction on 512-bit registers has been available on Intel hardware since 2019 and AMD since 2022. There is an older variant for 128-bit registers, but at least on Zen 2 it performs worse than normal CRC instructions for short inputs. We must now do a runtime check, even for builds that target SSE 4.2. This doesn't matter in practice for WAL (arguably the most critical case), because since commit e2809e3a1 the final computation with the 20-byte WAL header is inlined and unrolled when targeting that extension. Compared with two direct function calls, testing showed equal or slightly faster performance in performing an indirect function call on several dozen bytes followed by inlined instructions on constant input of 20 bytes. The MIT-licensed implementation was generated with the "generate" program from https://github.com/corsix/fast-crc32/ Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" V. Gopal, E. Ozturk, et al., 2009 Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com> Co-authored-by: Paul Amonson <paul.d.amonson@intel.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Matthew Sterrett <matthewsterrett2@gmail.com> (earlier version) Tested-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com> Tested-by: David Rowley <<dgrowleyml@gmail.com>> (earlier version) Discussion: https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com Discussion: https://postgr.es/m/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com
* Quote filename in error messageDaniel Gustafsson2025-04-05
| | | | | | | | | | Project standard is to quote filenames in error and log messages, which commit 2da74d8d640 missed in two error messages. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/20250404.120328.103562371975971823.horikyota.ntt@gmail.com
* Fix parse_cte.c's failure to examine sub-WITHs in DML statements.Tom Lane2025-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | makeDependencyGraphWalker thought that only SelectStmt nodes could contain a WithClause. Which was true in our original implementation of WITH, but astonishingly we missed updating this code when we added the ability to attach WITH to INSERT/UPDATE/DELETE (and later MERGE). Moreover, since it was coded to deliberately block recursion to a WithClause, even updating raw_expression_tree_walker didn't save it. The upshot of this was that we didn't see references to outer CTE names appearing within an inner WITH, and would neither complain about disallowed recursion nor account for such references when sorting CTEs into a usable order. The lack of complaints about this is perhaps not so surprising, because typical usage of WITH wouldn't hit either case. Still, it's pretty broken; failing to detect recursion here leads to assert failures or worse later on. Fix by factoring out the processing of sub-WITHs into a new function WalkInnerWith, and invoking that for all the statement types that can have WITH. Bug: #18878 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18878-a26fa5ab6be2f2cf@postgresql.org Backpatch-through: 13
* Add modern SHA-2 based password hashes to pgcrypto.Álvaro Herrera2025-04-05
| | | | | | | | | | | This adapts the publicly available reference implementation on https://www.akkadia.org/drepper/SHA-crypt.txt and adds the new hash algorithms sha256crypt and sha512crypt to crypt() and gen_salt() respectively. Author: Bernd Helmle <mailings@oopsware.de> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/c763235a2757e2f5f9e3e27268b9028349cef659.camel@oopsware.de
* Avoid double transformation of json_array()'s subquery.Tom Lane2025-04-05
| | | | | | | | | | | | | | | | | | | | | | | transformJsonArrayQueryConstructor() applied transformStmt() to the same subquery tree twice. While this causes no issue in many cases, there are some where it causes a coredump, thanks to the parser's habit of scribbling on its input. Fix by making a copy before the first transformation (compare 0f43083d1). This is quite brute-force, but then so is the whole business of transforming the input twice. Per discussion in the bug thread, this implementation of json_array() parsing should be replaced completely. But that will take some work and will surely not be back-patchable, so for the moment let's take the easy way out. Oversight in 7081ac46a. Back-patch to v16 where that came in. Bug: #18877 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18877-c3c3ad75845833bb@postgresql.org Backpatch-through: 16
* Clean up from commit 1495eff7bdbAndrew Dunstan2025-04-05
| | | | | Fix some comments, and remove the hacky way of quoting database names in favor of appendStringLiteralConn.
* Set log_statement=none in t/002_pg_upgrade.plÁlvaro Herrera2025-04-05
| | | | | | | | | This should make the test a wee bit faster on high-load machines (e.g., when running under valgrind). Per complaint from Andres Freund. Discussion: https://postgr.es/m/cwbcyjp2ts7o7xgy5y5gwtcd4zltvncsj67el7xgci7xbwrhlu@k363vk5tce4g
* pg_dump: Tiny header cleanupÁlvaro Herrera2025-04-05
| | | | | | | | | | | | | | | In commits 9c02e3a986da and 8ec0aaeae094, Nathan added a duplicate TocEntry typedef forward declaration (plus assorted #ifdef hackery to avoid C99 preprocessor issues) to deal with some very old untidyness regarding DefnDumperPtr function prototype being located in pg_backup.h. But there's no reason to have the DefnDumperPtr typedef (and the accompanying DataDumperPtr typedef) in that file at all; they are better placed in pg_backup_archiver.h, the internal header, because they are only used internally. That also requires zero #ifdef hackery, so move them there. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/202504042140.qo66ggw6wzsz@alvherre.pgsql
* pg_dump: Fix query for gathering attribute stats on older versions.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | Commit 9c02e3a986 taught pg_dump to retrieve attribute statistics for 64 relations at a time. pg_dump supports dumping from v9.2 and newer versions, but our query for retrieving statistics for multiple relations uses WITH ORDINALITY and multi-argument UNNEST(), both of which were introduced in v9.4. To fix, we resort to gathering statistics for a single relation at a time on versions older than v9.4. Per buildfarm member crake. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/Z_BcWVMvlUIJ_iuZ%40nathan
* Repair misbehavior with duplicate entries in FK SET column lists.Tom Lane2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since v15 we've had an option to apply a foreign key constraint's ON DELETE SET DEFAULT or SET NULL action to just some of the referencing columns. There was not a check for duplicate entries in the list of columns-to-set, though. That caused a potential memory stomp in CreateConstraintEntry(), which incautiously assumed that the list of columns-to-set couldn't be longer than the number of key columns. Even after fixing that, the case doesn't work because you get an error like "multiple assignments to same column" from the SQL command that is generated to do the update. We could either raise an error for duplicate columns or silently suppress the dups, and after a bit of thought I chose to do the latter. This is motivated by the fact that duplicates in the FK column list are legal, so it's not real clear why duplicates in the columns-to-set list shouldn't be. Of course there's no need to actually set the column more than once. I left in the fix in CreateConstraintEntry() too, just because it didn't seem like such low-level code ought to be making assumptions about what it's handed. Bug: #18879 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18879-259fc59d072bd4d7@postgresql.org Backpatch-through: 15
* functions.c: copy trees from source_list before parse analysis etc.Tom Lane2025-04-04
| | | | | | | | | | | | | | | | | | | | | | This is yet another bit of fallout from the fact that backend/parser (like other code) feels free to scribble on the parse tree it's handed. In this case that resulted in modifying the relatively-short-lived copy in the cached function's source_list. That would be fine since we only need each source_list tree once ... except that if the parser fails after making some changes, the function cache entry remains as-is and will still be there if the user tries to execute the function again. Then we have problems because we're feeding a non-pristine tree to the parser. The most expedient fix is a quick copyObject(). I considered other answers like somehow marking the cache entry invalid temporarily, but that would add complexity and I'm not sure it's worth it. In typical scenarios we'd only do this once per function query per session. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6d442183-102c-498a-81d1-eeeb086cdc5a@gmail.com
* Fix a couple of error messages and tests for themAndrew Dunstan2025-04-04
| | | | oversights in 1495eff7bdb and 289f74d0cb2. Mea culpa.
* Prevent redeclaration of typedef TocEntry.Nathan Bossart2025-04-04
| | | | | | | | | Commit 9c02e3a986 added a forward declaration for this typedef that caused redeclarations, which is not valid in C99. To fix, add some preprocessor guards to avoid a redefinition, as is done elsewhere (e.g., commit 382092a0cd). Per buildfarm.
* Add more TAP tests for pg_dumpallAndrew Dunstan2025-04-04
| | | | | Author: Matheus Alcantara <matheusssilv97@gmail.com> Author: Mahendra Singh Thalor <mahi6run@gmail.com>
* Non text modes for pg_dumpall, correspondingly change pg_restoreAndrew Dunstan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | pg_dumpall acquires a new -F/--format option, with the same meanings as pg_dump. The default is p, meaning plain text. For any other value, a directory is created containing two files, globals.data and map.dat. The first contains SQL for restoring the global data, and the second contains a map from oids to database names. It will also contain a subdirectory called databases, inside which it will create archives in the specified format, named using the database oids. In these casess the -f argument is required. If pg_restore encounters a directory containing globals.dat, and no toc.dat, it restores the global settings and then restores each database. pg_restore acquires two new options: -g/--globals-only which suppresses restoration of any databases, and --exclude-database which inhibits restoration of particualr database(s) in the same way the same option works in pg_dumpall. Author: Mahendra Singh Thalor <mahi6run@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/cb103623-8ee6-4ba5-a2c9-f32e3a4933fa@dunslane.net
* add new list type simple_oid_string_list to fe-utils/simple_listAndrew Dunstan2025-04-04
| | | | | | | | This type contains both an oid and a string. This will be used in forthcoming changes to pg_restore. Author: Andrew Dunstan <andrew@dunslane.net>
* Move common pg_dump code related to connections to a new fileAndrew Dunstan2025-04-04
| | | | | | | | | ConnectDatabase is used by pg_dumpall, pg_restore and pg_dump so move common code to new file. new file name: connectdb.c Author: Mahendra Singh Thalor <mahi6run@gmail.com>
* Remove unused function parameters in pg_backup_archiver.c.Nathan Bossart2025-04-04
| | | | | | | | | Thanks to commit 9c02e3a986, which modified some of the changes from commit a0a4601765, we can remove the now-unused ArchiveHandle parameter from _tocEntryRestorePass() and move_to_ready_heap(). Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z-3x2AnPCP331JA3%40nathan
* pg_dump: Retrieve attribute statistics in batches.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pg_dump gathers attribute statistics with a query per relation, which can cause pg_dump to take significantly longer, especially when there are many relations. This commit addresses this by teaching pg_dump to gather attribute statistics for 64 relations at a time. Some simple tests showed this was the optimal batch size, but performance may vary depending on the workload. Our lookahead code determines the next batch of relations by searching the TOC sequentially for relevant entries. This approach assumes that we will dump all such entries in TOC order, which unfortunately isn't true for dump formats that use RestoreArchive(). RestoreArchive() does multiple passes through the TOC and selectively dumps certain groups of entries each time. This is particularly problematic for index stats and a subset of matview stats; both are in SECTION_POST_DATA, but matview stats that depend on matview data are dumped in RESTORE_PASS_POST_ACL, while all other stats are dumped in RESTORE_PASS_MAIN. To handle this, this commit moves all statistics data entries in SECTION_POST_DATA to RESTORE_PASS_POST_ACL, which ensures that we always dump them in TOC order. A convenient side effect of this change is that we can revert a decent chunk of commit a0a4601765, but that is left for a follow-up commit. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com
* pg_dump: Reduce memory usage of dumps with statistics.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, pg_dump stores all generated commands for statistics in memory. These commands can be quite large and therefore can significantly increase pg_dump's memory footprint. To fix, wait until we are about to write out the commands before generating them, and be sure to free the commands after writing. This is implemented via a new defnDumper callback that works much like the dataDumper one but is specifically designed for TOC entries. Custom dumps that include data might write the TOC twice (to update data offset information), which would ordinarily cause pg_dump to run the attribute statistics queries twice. However, as a hack, we save the length of the written-out entry in the first pass and skip over it in the second. While there is no known technical issue with executing the queries multiple times and rewriting the results, it's expensive and feels risky, so let's avoid it. As an exception, we _do_ execute the queries twice for the tar format. This format does a second pass through the TOC to generate the restore.sql file. pg_restore doesn't use this file, so even if the second round of queries returns different results than the first, it won't corrupt the output; the archive and restore.sql file will just have different content. A follow-up commit will teach pg_dump to gather attribute statistics in batches, which our testing indicates more than makes up for the added expense of running the queries twice. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com
* Skip second WriteToc() call for custom-format dumps without data.Nathan Bossart2025-04-04
| | | | | | | | | | | Presently, "pg_dump --format=custom" calls WriteToc() twice. The second call updates the data offset information, which allegedly makes parallel pg_restore significantly faster. However, if we're not dumping any data, there are no data offsets to update, so we can skip this step. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan
* Use streaming read I/O in autoprewarmMelanie Plageman2025-04-04
| | | | | | | | | | | | | | | | Make a read stream for each valid fork of each valid relation represented in the autoprewarm dump file and prewarm those blocks through the read stream API instead of by directly invoking ReadBuffer(). Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru> (earlier versions) Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> (earlier versions) Reviewed-by: Matheus Alcantara <mths.dev@pm.me> (earlier versions) Discussion: https://postgr.es/m/flat/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com
* Refactor autoprewarm_database_main() in preparation for read streamMelanie Plageman2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Autoprewarm prewarms blocks from a dump file representing the contents of shared buffers at the time it was dumped. It uses a sorted array of BlockInfoRecords, each representing a block from one of the cluster's databases and tables. autoprewarm_database_main() prewarms all the blocks from a single database. It is optimized to ensure we don't try to open the same relation or fork over and over again if it has been dropped or is invalid. The main loop handled this by carefully setting various local variables to sentinel values when a run of blocks should be skipped. This method won't work with the read stream API. The read stream callback must be able to advance the current position in the BlockInfoRecord array to allow for reading ahead additional blocks, however a read stream maps 1-1 with a relation and fork combination. So, the main loop in autoprewarm_database_main() must also advance the position in the array of BlockInfoRecords to skip invalid relations and forks. This split control doesn't fit well with the current flow control in autoprewarm_database_main() To make it compatible with the read stream API, change autoprewarm_database_main() to explicitly fast-forward in the BlockInfoRecords array past the blocks belonging to an invalid relation or fork. This commit only implements the new control flow -- it does not use the read stream API. Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/flat/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com
* Remove superfluous autoprewarm checkMelanie Plageman2025-04-04
| | | | | | | | | | | | | autoprewarm_database_main() prewarms blocks from the same database. It is passed an array of sorted BlockInfoRecords and a start and stop index into the array. The range represented should include only blocks belonging to global objects or blocks from a single database. Remove an unnecessary check that the current block is from the same database and add an assert to ensure this invariant remains. Doing so removes a special case that makes future refactoring to accommodate read streamifying autoprewarm easier. Noticed off-list by Andres Freund
* Avoid extra index searches through preprocessing.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transform low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) in such a way as to allow _bt_first to consistently apply later keys when it descends the tree. This can lower the number of index searches for multi-column scans that use a ">" key on one of the index's prefix columns (or use a "<" key, when scanning backwards) when it precedes some later lower-order key. For example, an index qual "WHERE a > 5 AND b = 2" will now be converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after low_compare and high_compare have been finalized. That way, the initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an initial position, rather than just using "WHERE a > 5" -- "b = 2" can be applied during every _bt_first call. There's a decent chance that this will allow such a scan to avoid the extra search that might otherwise be needed to determine the lowest "a" value still satisfying "WHERE a > 5". The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). But the savings can really add up in cases where an affected skip array comes after some other array. For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can save as many as 3 _bt_first calls by applying the new transformation to its "y" array (up to 1 extra search can be avoided per "x" element). Follow-up to commit 92fe23d9, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
* Improve nbtree skip scan primitive scan scheduling.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't allow nbtree scans with skip arrays to end any primitive scan on its first leaf page without giving some consideration to how many times the scan's arrays advanced while changing at least one skip array (though continue not caring about the number of array advancements that only affected SAOP arrays, even during skip scans with SAOP arrays). Now when a scan performs more than 3 such array advancements in the course of reading a single leaf page, it is taken as a signal that the next page is unlikely to be skippable. We'll therefore continue the ongoing primitive index scan, at least until we can perform a recheck against the next page's finaltup. Testing has shown that this new heuristic occasionally makes all the difference with skip scans that were expected to rely on the "passed first page" heuristic added by commit 9a2e2a28. Without it, there is a remaining risk that certain kinds of skip scans will never quite manage to clear the initial hurdle of performing a primitive scan that lasts beyond its first leaf page (or that such a skip scan will only clear that initial hurdle when it has already wasted noticeably-many cycles due to inefficient primitive scan scheduling). Follow-up to commits 92fe23d9 and 9a2e2a28. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
* pg_recvlogical: Add --failover option.Masahiko Sawada2025-04-04
| | | | | | | | | | | This new option instructs pg_recvlogical to create the logical replication slot with the failover option enabled. It can be used in conjunction with the --create-slot option. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966C54097FC83AF19F3516BF5AC2@OSCPR01MB14966.jpnprd01.prod.outlook.com
* Oversight in commit b81ffa13e3.Jeff Davis2025-04-04
| | | | Should warn if a materialized view may be affected, as well.
* Further optimize nbtree search scan key comparisons.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Postgres 17 commit e0b1ee17 added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit 92fe23d9, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
* Add nbtree skip scan optimization.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b8. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
* Stabilize regression test from c0962a113.Tom Lane2025-04-04
| | | | | | | | Per buildfarm. Co-authored-by: Alena Rybakina <a.rybakina@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/srnuqlttuimzmvoulhsrbgvj4vnul6b65osswvua7sfkqsvmuy@yg7apybpxp34
* Fix autoprewarm neglect of tablespacesMelanie Plageman2025-04-04
| | | | | | | | | | | | | | | While prewarming blocks from a dump file, autoprewarm_database_main() mistakenly ignored tablespace when detecting the beginning of the next relation to prewarm. Because RelFileNumbers are only unique within a tablespace, autoprewarm could miss prewarming blocks from a relation with the same RelFileNumber in a different tablespace. Though this situation is likely rare in practice, it's best to make the code correct. Do so by explicitly checking for the RelFileNumber when detecting a new relation. Reported-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/97c36982-603b-494a-95f4-aaf2a12ac27e%40iki.fi
* Add commit e1a8b1ad58 to .git-blame-ignore-revs.Nathan Bossart2025-04-04
|
* Re-pgindent pg_largeobject.c after commit 0d6c477664.Nathan Bossart2025-04-04
|
* Convert 'x IN (VALUES ...)' to 'x = ANY ...' then appropriateAlexander Korotkov2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit implements the automatic conversion of 'x IN (VALUES ...)' into ScalarArrayOpExpr. That simplifies the query tree, eliminating the appearance of an unnecessary join. Since VALUES describes a relational table, and the value of such a list is a table row, the optimizer will likely face an underestimation problem due to the inability to estimate cardinality through MCV statistics. The cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. We perform the transformation in the convert_ANY_sublink_to_join() if VALUES RTE is proper and the transformation is convertible. The conversion is only possible for operations on scalar values, not rows. Also, we currently support the transformation only when it ends up with a constant array. Otherwise, the evaluation of non-hashed SAOP might be slower than the corresponding Hash Join with VALUES. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
* Extract make_SAOP_expr() function from match_orclause_to_indexcol()Alexander Korotkov2025-04-04
| | | | | | | | | | | | | | | This commit extracts the code to generate ScalarArrayOpExpr on top of the list of expressions from match_orclause_to_indexcol() into a separate function make_SAOP_expr(). This function was extracted to be used in optimization for conversion of 'x IN (VALUES ...)' to 'x = ANY ...'. make_SAOP_expr() is placed in clauses.c file as only two additional headers were needed there compared with other places. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
* Fix crash/valgrind errorPeter Eisentraut2025-04-04
| | | | | | Fix for commit 9ef1851685b: We have to skip indexes where sortopfamily is NULL. This takes the place of the previous btree check. Detected by valgrind on the buildfarm.
* docs: Clarify that NULL arg to set_config() means reset to defaultHeikki Linnakangas2025-04-04
| | | | | | Author: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/CAKFQuwY0SK6JdCci1VJX6xsztRXgGeVEY-grkENZx%2B3CZpyPcQ@mail.gmail.com
* Relax assertion in finding correct GiST parentHeikki Linnakangas2025-04-04
| | | | | | | | | | | | | | | | | | Commit 28d3c2ddcf introduced an assertion that if the memorized downlink location in the insertion stack isn't valid, the parent's LSN should've changed too. Turns out that was too strict. In gistFindCorrectParent(), if we walk right, we update the parent's block number and clear its memorized 'downlinkoffnum'. That triggered the assertion on next call to gistFindCorrectParent(), if the parent needed to be split too. Relax the assertion, so that it's OK if downlinkOffnum is InvalidOffsetNumber. Backpatch to v13-, all supported versions. The assertion was added in commit 28d3c2ddcf in v12. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org
* Allow "COPY table TO" command to copy rows from materialized views.Fujii Masao2025-04-04
| | | | | | | | | | | | | | | | | | | Previously, "COPY table TO" command worked only with plain tables and did not support materialized views, even when they were populated and had physical storage. To copy rows from materialized views, "COPY (query) TO" command had to be used, instead. This commit extends "COPY table TO" to support populated materialized views directly, improving usability and performance, as "COPY table TO" is generally faster than "COPY (query) TO". Note that copying from unpopulated materialized views will still result in an error. Author: jian he <jian.universality@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=yg@mail.gmail.com
* Support non-btree indexes in get_actual_variable_range()Peter Eisentraut2025-04-04
| | | | | | | | | | | | | | This was previously not supported because the btree strategy numbers were hardcoded. Now we can support this for any index that has the required strategy mapping support and the required operators. If an index scan used for get_actual_variable_range() requires recheck, we now just ignore it instead of erroring out. With btree we knew this couldn't happen, but now it might. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Extend ALTER DEFAULT PRIVILEGES to define default privileges for large objects.Fujii Masao2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | Previously, ALTER DEFAULT PRIVILEGES did not support large objects. This meant that to grant privileges to users other than the owner, permissions had to be manually assigned each time a large object was created, which was inconvenient. This commit extends ALTER DEFAULT PRIVILEGES to allow defining default access privileges for large objects. With this change, specified privileges will automatically apply to newly created large objects, making privilege management more efficient. As a side effect, this commit introduces the new keyword OBJECTS since it's used in the syntax of ALTER DEFAULT PRIVILEGES. Original patch by Haruka Takatsuka, with some fixes and tests by Yugo Nagata, and rebased by Laurenz Albe. Author: Takatsuka Haruka <harukat@sraoss.co.jp> Co-authored-by: Yugo Nagata <nagata@sraoss.co.jp> Co-authored-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Masao Fujii <masao.fujii@gmail.com> Discussion: https://postgr.es/m/20240424115242.236b499b2bed5b7a27f7a418@sraoss.co.jp
* Use standard die() signal handler in walreceiverHeikki Linnakangas2025-04-04
| | | | | | | | | | | | | | | | | | | | This gets rid of the bespoken ProcessWalRcvInterrupts() function, which lets walreceiver terminate at any CHECK_FOR_INTERRUPTS() call. And it's less code anyway. We can now use the standard libpqsrv_connect_params() libpq wrapper from libpq-be-fe-helpers.h, removing more code. We attempted to do that earlier already in commit 728f86fec6, but that was reverted because it didn't call ProcessWalRcvInterrupts() and therefore didn't react to shutdown requests. Now that ProcessWalRcvInterrupts() is gone, it works. As stated in that commit, this also leads to libpqwalreceiver reserving file descriptors for libpq conncetions, which is nice. Author: Andres Freund <andres@anarazel.de> (the earlier commit) Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
* Convert PathKey to use CompareTypePeter Eisentraut2025-04-04
| | | | | | | | | | | | | | | Change the PathKey struct to use CompareType to record the sort direction instead of hardcoding btree strategy numbers. The CompareType is then converted to the index-type-specific strategy when the plan is created. This reduces the number of places btree strategy numbers are hardcoded, and it's a self-contained subset of a larger effort to allow non-btree indexes to behave like btrees. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* doc: Clarify the system value for sslrootcertDaniel Gustafsson2025-04-04
| | | | | | | | | | | | | | The documentation for the special value "system" for sslrootcert could be misinterpreted to mean the default operating system CA store, which it may be, but it's defined to be the default CA store of the SSL lib used. Backpatch down to v16 where support for the system value was added. Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: George MacKerron <george@mackerron.co.uk> Discussion: https://postgr.es/m/B3CBBAA3-6EA3-4AB7-8619-4BBFAB93DDB4@yesql.se Backpatch-through: 16
* pg_createsubscriber: Improve error messages.Amit Kapila2025-04-04
| | | | | | | | | | | Consistently, an option name is used in the error messages where applicable. Also, change the code to use pg_fatal() instead of a combination of pg_log_error() and exit(). Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CALDaNm0HxF1RH27LP7VisLzNsSJbssy8a64M5p6UduDaBq6-ag@mail.gmail.com