aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Add default_char_signedness field to ControlFileData.Masahiko Sawada2025-02-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The signedness of the 'char' type in C is implementation-dependent. For instance, 'signed char' is used by default on x86 CPUs, while 'unsigned char' is used on aarch CPUs. Previously, we accidentally let C implementation signedness affect persistent data. This led to inconsistent results when comparing char data across different platforms. This commit introduces a new 'default_char_signedness' field in ControlFileData to store the signedness of the 'char' type. While this change does not encourage the use of 'char' without explicitly specifying its signedness, this field can be used as a hint to ensure consistent behavior for pre-v18 data files that store data sorted by the 'char' type on disk (e.g., GIN and GiST indexes), especially in cross-platform replication scenarios. Newly created database clusters unconditionally set the default char signedness to true. pg_upgrade (with an upcoming commit) changes this flag for clusters if the source database cluster has signedness=false. As a result, signedness=false setting will become rare over time. If we had known about the problem during the last development cycle that forced initdb (v8.3), we would have made all clusters signed or all clusters unsigned. Making pg_upgrade the only source of signedness=false will cause the population of database clusters to converge toward that retrospective ideal. Bump catalog version (for the catalog changes) and PG_CONTROL_VERSION (for the additions in ControlFileData). Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
* Drop opcintype from index AM strategy translation APIPeter Eisentraut2025-02-21
| | | | | | | | | | | | | | | | | | | | | | | | | | The type argument wasn't actually really necessary. It was a remnant of converting the API of the gist strategy translation from using opclass to using opfamily+opcintype (commits c09e5a6a016, 622f678c102). For looking up the gist translation function, we used the convention "amproclefttype = amprocrighttype = opclass's opcintype" (see pg_amproc.h). But each operator family should only have one translation function, and getting the right type for the lookup is sometimes cumbersome and fragile, so this is all unnecessarily complicated. To simplify this, change the gist stategy support procedure to take "any", "any" as argument. (This is arbitrary but seems intuitive. The alternative of using InvalidOid as argument(s) upsets various DDL commands, so it's not practical.) Then we don't need opcintype for the lookup, and we can remove it from all the API layers introduced by commit c09e5a6a016. This also adds some more documentation about the correct signature of the gist support function and adds more checks in gistvalidate(). This was previously underspecified. (It relied implicitly on convention mentioned above.) Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Remove various unnecessary (char *) castsPeter Eisentraut2025-02-20
| | | | | | | | Remove a number of (char *) casts that are unnecessary. Or in some cases, rewrite the code to make the purpose of the cast clearer. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Fix FATAL message for invalid recovery timeline at beginning of recoveryMichael Paquier2025-02-20
| | | | | | | | | | | | | | | | | | | | If the requested recovery timeline is not reachable, the logged checkpoint and timeline should to be the values read from the backup_label when it is defined. The message generated used the values from the control file in this case, which is fine when recovering from the control file without a backup_label, but not if there is a backup_label. Issue introduced in ee994272ca50. v15 has introduced xlogrecovery.c and more simplifications in this area (4a92a1c3d1c3, a27048cbcb58), making this change a bit simpler to think about, so backpatch only down to this version. Author: David Steele <david@pgbackrest.org> Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Discussion: https://postgr.es/m/c3d617d4-1696-4aa7-8a4d-5a7d19cc5618@pgbackrest.org Backpatch-through: 15
* Correct relation size estimate with low fillfactorTomas Vondra2025-02-19
| | | | | | | | | | | | | | | | | | | | | Since commit 29cf61ade3, table_block_relation_estimate_size() considers fillfactor when estimating number of rows in a relation before the first ANALYZE. The formula however did not consider tuples may be larger than available space determined by fillfactor, ending with density 0. This ultimately means the relation was estimated to contain a single row. The executor however places at least one tuple per page, even with very low fillfactor values, so the density should be at least 1. Fixed by clamping the density estimate using clamp_row_est(). Reported by Heikki Linnakangas. Fix by me, with regression test inspired by example provided by Heikki. Backpatch to 17, where the issue was introduced. Reported-by: Heikki Linnakangas Backpatch-through: 17 Discussion: https://postgr.es/m/2bf9d973-7789-4937-a7ca-0af9fb49c71e@iki.fi
* Fix crash in brininsertcleanup during logical replication.Tom Lane2025-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logical replication crashes if the subscriber's partitioned table has a BRIN index. There are two independently blamable causes, and this patch fixes both: 1. brininsertcleanup fails if called twice for the same IndexInfo, because it half-destroys its BrinInsertState but leaves it still linked from ii_AmCache. brininsert would also fail in that state, so it's pretty hard to see any advantage to this coding. Fully remove the BrinInsertState, instead, so that a new brininsert call would create a new cache. 2. A logical replication subscriber sometimes does ExecOpenIndices twice on the same ResultRelInfo, followed by doing ExecCloseIndices twice; the second call reaches the brininsertcleanup bug. Quite aside from tickling unexpected cases in aminsertcleanup methods, this seems very wasteful, because the IndexInfos built in the first ExecOpenIndices call are just lost during the second call, and have to be rebuilt at possibly-nontrivial cost. We should establish a coding rule that you don't do that. The problematic coding is that when the target table is partitioned, apply_handle_tuple_routing calls ExecFindPartition which does ExecOpenIndices (and expects that ExecCleanupTupleRouting will close the indexes again). Using the ResultRelInfo made by ExecFindPartition, it calls apply_handle_delete_internal or apply_handle_insert_internal, both of which think they need to do ExecOpenIndices/ExecCloseIndices for themselves. They do in the main non-partitioned code paths, but not here. The simplest fix is to pull their ExecOpenIndices/ExecCloseIndices calls out and put them in the call sites for the non-partitioned cases. (We could have refactored apply_handle_update_internal similarly, but I did not do so today because there's no bug there: the partitioned code path doesn't call it.) Also, remove the always-duplicative open/close calls within apply_handle_tuple_routing itself. Since brininsertcleanup and indeed the whole aminsertcleanup mechanism are new in v17, there's no observable bug in older branches. A case could be made for trying to avoid these duplicative open/close calls in the older branches, but for now it seems not worth the trouble and risk of new bugs. Bug: #18815 Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com> Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org Backpatch-through: 17
* Invalidate inactive replication slots.Amit Kapila2025-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit introduces idle_replication_slot_timeout GUC that allows inactive slots to be invalidated at the time of checkpoint. Because checkpoints happen checkpoint_timeout intervals, there can be some lag between when the idle_replication_slot_timeout was exceeded and when the slot invalidation is triggered at the next checkpoint. To avoid such lags, users can force a checkpoint to promptly invalidate inactive slots. Note that the idle timeout invalidation mechanism is not applicable for slots that do not reserve WAL or for slots on the standby server that are synced from the primary server (i.e., standby slots having 'synced' field 'true'). Synced slots are always considered to be inactive because they don't perform logical decoding to produce changes. The slots can become inactive for a long period if a subscriber is down due to a system error or inaccessible because of network issues. If such a situation persists, it might be more practical to recreate the subscriber rather than attempt to recover the node and wait for it to catch up which could be time-consuming. Then, external tools could create replication slots (e.g., for migrations or upgrades) that may fail to remove them if an error occurs, leaving behind unused slots that take up space and resources. Manually cleaning them up can be tedious and error-prone, and without intervention, these lingering slots can cause unnecessary WAL retention and system bloat. As the duration of idle_replication_slot_timeout is in minutes, any test using that would be time-consuming. We are planning to commit a follow up patch for tests by using the injection point framework. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com Discussion: https://postgr.es/m/OS0PR01MB5716C131A7D80DAE8CB9E88794FC2@OS0PR01MB5716.jpnprd01.prod.outlook.com
* Reduce scope of heap vacuum per_buffer_dataMelanie Plageman2025-02-18
| | | | | | | | | | | | | | | Move lazy_scan_heap()'s per_buffer_data variable into a tighter scope. In lazy_scan_heap()'s phase I heap vacuuming, the read stream API returns a pointer to the next block number to vacuum. As long as read_stream_next_buffer() returns a valid buffer, per_buffer_data should always be valid. Move per_buffer_data into a tighter scope and make sure it is reset to NULL on each iteration so that we get a core dump instead of bogus data from a previous block if something goes wrong in the read stream API. Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/626104.1739729538%40sss.pgh.pa.us
* Revert: Get rid of WALBufMappingLockAlexander Korotkov2025-02-17
| | | | | This commit reverts 6a2275b895. Buildfarm failure on batta spots some concurrency issue, which requires further investigation.
* Add information about WAL buffers full to VACUUM/ANALYZE (VERBOSE)Michael Paquier2025-02-17
| | | | | | | | | | | | | | | | This commit adds the information about the number of times WAL buffers have been full to the logs generated by VACUUM/ANALYZE (VERBOSE) and in the logs generated by autovacuum, complementing the existing information stored by WalUsage. This is the last part of the backend code where the value of wal_buffers_full can be reported, similarly to all the other fields of WalUsage. 320545bfcfee and ce5bcc4a9f26 have done the same for EXPLAIN and pgss. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
* Move wal_buffers_full from PgStat_PendingWalStats to WalUsageMichael Paquier2025-02-17
| | | | | | | | | | | | | | | | | | | | | | | | wal_buffers_full has been introduced in pg_stat_wal in 8d9a935965f, as some information providing metrics for the tuning of the GUC wal_buffers. WalUsage has been introduced before that in df3b181499. Moving this field is proving to be beneficial for several reasons: - This information can now be made available in more layers, providing more granularity than just pg_stat_wal, on a per-query basis: EXPLAIN, pgss and VACUUM/ANALYZE logs. - A patch is under discussion to provide statistics for WAL at backend level, and this move simplifies a bit the handling of pending statistics. The remaining data in PgStat_PendingWalStats now relates to write/sync counters and times, with equivalents present in pg_stat_io, that backend statistics are able to already track. So this should cut all the dependencies between PgStat_PendingWalStats and WAL stats at backend level. As of this change, wal_buffers_full only shows in pg_stat_wal. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
* Get rid of WALBufMappingLockAlexander Korotkov2025-02-17
| | | | | | | | | | | | | | | | | | | | | | | | | Allow multiple backends to initialize WAL buffers concurrently. This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without taking a single LWLock in exclusive mode. The new algorithm works as follows: * reserve a page for initialization using XLogCtl->InitializeReserved, * ensure the page is written out, * once the page is initialized, try to advance XLogCtl->InitializedUpTo and signal to waiters using XLogCtl->InitializedUpToCondVar condition variable, * repeat previous steps until we reserve initialization up to the target WAL position, * wait until concurrent initialization finishes using a XLogCtl->InitializedUpToCondVar. Now, multiple backends can, in parallel, concurrently reserve pages, initialize them, and advance XLogCtl->InitializedUpTo to point to the latest initialized page. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
* Add delay time to VACUUM/ANALYZE (VERBOSE) and autovacuum logs.Nathan Bossart2025-02-14
| | | | | | | | | | | Commit bb8dff9995 added this information to the pg_stat_progress_vacuum and pg_stat_progress_analyze system views. This commit adds the same information to the output of VACUUM and ANALYZE with the VERBOSE option and to the autovacuum logs. Suggested-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
* Use PqMsg_Progress macro in HandleParallelMessage().Nathan Bossart2025-02-14
| | | | | | | Commit a99cc6c6b4 introduced the PqMsg_Progress macro but missed updating HandleParallelMessage() accordingly. Backpatch-through: 17
* Use streaming read I/O in VACUUM's third phaseMelanie Plageman2025-02-14
| | | | | | | | | | | | | Make vacuum's third phase (its second pass over the heap), which reaps dead items collected in the first phase and marks them as reusable, use the read stream API. This commit adds a new read stream callback, vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and returns the next block number to read for vacuum. Author: Melanie Plageman <melanieplageman@gmail.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com
* Use streaming read I/O in VACUUM's first phaseMelanie Plageman2025-02-14
| | | | | | | | | | Make vacuum's first phase, which prunes and freezes tuples and records dead TIDs, use the read stream API by by converting heap_vac_scan_next_block() to a read stream callback. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aLwANZpxHc0tC-6OT0OQT4TftDGkKAO5yigMUOv_Tcsw%40mail.gmail.com
* Convert heap_vac_scan_next_block() boolean parameters to flagsMelanie Plageman2025-02-14
| | | | | | | | | | | | | | | | The read stream API only allows one piece of extra per block state to be passed back to the API user (per_buffer_data). lazy_scan_heap() needs two pieces of per-buffer data: whether or not the block was all-visible in the visibility map and whether or not it was eagerly scanned. Convert these two pieces of information to flags so that they can be populated by heap_vac_scan_next_block() and returned to lazy_scan_heap(). A future commit will turn heap_vac_scan_next_block() into the read stream callback for heap phase I vacuuming. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_bmx33jTqATP5GKNFYwAg02a9dDtk4U_ciEjgBHZSVkOQ%40mail.gmail.com
* Remove unnecessary (char *) casts [xlog]Peter Eisentraut2025-02-13
| | | | | | | | Remove (char *) casts no longer needed after XLogRegisterData() and XLogRegisterBufData() argument type change. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* XLogRegisterData, XLogRegisterBufData void * argument for binary dataPeter Eisentraut2025-02-13
| | | | | | | | | Change XLogRegisterData() and XLogRegisterBufData() functions to take void * for binary data instead of char *. This will remove the need for numerous casts (done in a separate commit for clarity). Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Remove unnecessary (char *) casts [mem]Peter Eisentraut2025-02-12
| | | | | | | | | | Remove (char *) casts around memory functions such as memcmp(), memcpy(), or memset() where the cast is useless. Since these functions don't take char * arguments anyway, these casts are at best complicated casts to (void *), about which see commit 7f798aca1d5. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Add is_analyze parameter to vacuum_delay_point().Nathan Bossart2025-02-11
| | | | | | | | | | | | This function is used in both vacuum and analyze code paths, and a follow-up commit will require distinguishing between the two. This commit forces callers to specify whether they are in a vacuum or analyze path, but it does not use that information for anything yet. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
* Eagerly scan all-visible pages to amortize aggressive vacuumMelanie Plageman2025-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aggressive vacuums must scan every unfrozen tuple in order to advance the relfrozenxid/relminmxid. Because data is often vacuumed before it is old enough to require freezing, relations may build up a large backlog of pages that are set all-visible but not all-frozen in the visibility map. When an aggressive vacuum is triggered, all of these pages must be scanned. These pages have often been evicted from shared buffers and even from the kernel buffer cache. Thus, aggressive vacuums often incur large amounts of extra I/O at the expense of foreground workloads. To amortize the cost of aggressive vacuums, eagerly scan some all-visible but not all-frozen pages during normal vacuums. All-visible pages that are eagerly scanned and set all-frozen in the visibility map are counted as successful eager freezes and those not frozen are counted as failed eager freezes. If too many eager scans fail in a row, eager scanning is temporarily suspended until a later portion of the relation. The number of failures tolerated is configurable globally and per table. To effectively amortize aggressive vacuums, we cap the number of successes as well. Capping eager freeze successes also limits the amount of potentially wasted work if these pages are modified again before the next aggressive vacuum. Once we reach the maximum number of blocks successfully eager frozen, eager scanning is disabled for the remainder of the vacuum of the relation. Original design idea from Robert Haas, with enhancements from Andres Freund, Tomas Vondra, and me Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
* Virtual generated columnsPeter Eisentraut2025-02-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new variant of generated columns that are computed on read (like a view, unlike the existing stored generated columns, which are computed on write, like a materialized view). The syntax for the column definition is ... GENERATED ALWAYS AS (...) VIRTUAL and VIRTUAL is also optional. VIRTUAL is the default rather than STORED to match various other SQL products. (The SQL standard makes no specification about this, but it also doesn't know about VIRTUAL or STORED.) (Also, virtual views are the default, rather than materialized views.) Virtual generated columns are stored in tuples as null values. (A very early version of this patch had the ambition to not store them at all. But so much stuff breaks or gets confused if you have tuples where a column in the middle is completely missing. This is a compromise, and it still saves space over being forced to use stored generated columns. If we ever find a way to improve this, a bit of pg_upgrade cleverness could allow for upgrades to a newer scheme.) The capabilities and restrictions of virtual generated columns are mostly the same as for stored generated columns. In some cases, this patch keeps virtual generated columns more restricted than they might technically need to be, to keep the two kinds consistent. Some of that could maybe be relaxed later after separate careful considerations. Some functionality that is currently not supported, but could possibly be added as incremental features, some easier than others: - index on or using a virtual column - hence also no unique constraints on virtual columns - extended statistics on virtual columns - foreign-key constraints on virtual columns - not-null constraints on virtual columns (check constraints are supported) - ALTER TABLE / DROP EXPRESSION - virtual column cannot have domain type - virtual columns are not supported in logical replication The tests in generated_virtual.sql have been copied over from generated_stored.sql with the keyword replaced. This way we can make sure the behavior is mostly aligned, and the differences can be visible. Some tests for currently not supported features are currently commented out. Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org
* Introduce autovacuum_vacuum_max_threshold.Nathan Bossart2025-02-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One way autovacuum chooses tables to vacuum is by comparing the number of updated or deleted tuples with a value calculated using autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. The threshold specifies the base value for comparison, and the scale factor specifies the fraction of the table size to add to it. This strategy ensures that smaller tables are vacuumed after fewer updates/deletes than larger tables, which is reasonable in many cases but can result in infrequent vacuums on very large tables. This is undesirable for a couple of reasons, such as very large tables incurring a huge amount of bloat between vacuums. This new parameter provides a way to set a limit on the value calculated with autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor so that very large tables are vacuumed more frequently. By default, it is set to 100,000,000 tuples, but it can be disabled by setting it to -1. It can also be adjusted for individual tables by changing storage parameters. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Frédéric Yhuel <frederic.yhuel@dalibo.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Joe Conway <mail@joeconway.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Vinícius Abrahão <vinnix.bsd@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
* Add data for WAL in pg_stat_io and backend statisticsMichael Paquier2025-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds WAL IO stats to both pg_stat_io view and per-backend IO statistics (pg_stat_get_backend_io()). This change is possible since f92c854cf406, as WAL IO is not counted in blocks in some code paths where its stats data is measured (like WAL read in xlogreader.c). IOContext gains IOCONTEXT_INIT and IOObject IOOBJECT_WAL, with the following combinations allowed: - IOOBJECT_WAL/IOCONTEXT_NORMAL is used to track I/O operations done on already-created WAL segments. - IOOBJECT_WAL/IOCONTEXT_INIT is used for tracking I/O operations done when initializing WAL segments. The core changes are done in pg_stat_io.c, backend statistics inherit them. Backend statistics and pg_stat_io are now available for the WAL writer, the WAL receiver and the WAL summarizer processes. I/O timing data is controlled by the GUC track_io_timing, like the existing data of pg_stat_io for consistency. The timings related to IOOBJECT_WAL show up if the GUC is enabled (disabled by default). Bump pgstats file version, due to the additions in IOObject and IOContext, impacting the amount of data written for the fixed-numbered IO stats kind in the pgstats file. Author: Nazir Bilal Yavuz Reviewed-by: Bertrand Drouvot, Nitin Jadhav, Amit Kapila, Michael Paquier, Melanie Plageman, Bharath Rupireddy Discussion: https://postgr.es/m/CAN55FZ3AiQ+ZMxUuXnBpd0Rrh1YhwJ5FudkHg=JU0P+-W8T4Vg@mail.gmail.com
* Integrate GistTranslateCompareType() into IndexAmTranslateCompareType()Peter Eisentraut2025-02-03
| | | | | | | | | | | | | | | | | This turns GistTranslateCompareType() into a callback function of the gist index AM instead of a standalone function. The existing callers are changed to use IndexAmTranslateCompareType(). This then makes that code not hardcoded toward gist. This means in particular that the temporal keys code is now independent of gist. Also, this generalizes commit 74edabce7a3, so other index access methods other than the previously hardcoded ones could now work as REPLICA IDENTITY in a logical replication subscriber. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Fix typo in xlog.cMichael Paquier2025-02-03
| | | | "recovery" is not a verb. Introduced in 68cb5af46cd8.
* Convert strategies to and from compare typesPeter Eisentraut2025-02-02
| | | | | | | | | | | | | | | For each Index AM, provide a mapping between operator strategies and the system-wide generic concept of a comparison type. For example, for btree, BTLessStrategyNumber maps to and from COMPARE_LT. Numerous places in the planner and executor think directly in terms of btree strategy numbers (and a few in terms of hash strategy numbers.) These should be converted over subsequent commits to think in terms of CompareType instead. (This commit doesn't make any use of this API yet.) Author: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Add get_opfamily_name() functionPeter Eisentraut2025-02-01
| | | | | | | | This refactors and simplifies various existing code to make use of the new function. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Rename GistTranslateStratnum() to GistTranslateCompareType()Peter Eisentraut2025-02-01
| | | | | | | | | | | Follow up to commit 630f9a43cec. The previous name had become confusing, because it doesn't actually translate a strategy number but a CompareType into a strategy number. We might add the inverse at some point, which would then probably be called something like GistTranslateStratnum. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Get rid of our dependency on type "long" for memory size calculations.Tom Lane2025-01-31
| | | | | | | | | | | | | | | | | | | | | | | | | | Consistently use "Size" (or size_t, or in some places int64 or double) as the type for variables holding memory allocation sizes. In most places variables' data types were fine already, but we had an ancient habit of computing bytes from kilobytes-units GUCs with code like "work_mem * 1024L". That risks overflow on Win64 where they did not make "long" as wide as "size_t". We worked around that by restricting such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB on Win64. This patch removes that restriction, after replacing such calculations with "work_mem * (Size) 1024" or variants of that. It should be noted that this patch was constructed by searching outwards from the GUCs that have MAX_KILOBYTES as upper limit. So I can't positively guarantee there are no other places doing memory-size arithmetic in int or long variables. I do however feel pretty confident that increasing MAX_KILOBYTES on Win64 is safe now. Also, nothing in our code should be dealing in multiple-gigabyte allocations without authorization from a relevant GUC, so it seems pretty likely that this search caught everything that could be at risk of overflow. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217
* Fix grammatical typos around possessive "its"John Naylor2025-01-29
| | | | | | | | Some places spelled it "it's", which is short for "it is". In passing, fix a couple other nearby grammatical errors. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Discussion: https://postgr.es/m/CA+COZaAO8g1KJCV0T48=CkJMjAnnfTGLWOATz+2aCh40c2Nm+g@mail.gmail.com
* Track per-relation cumulative time spent in [auto]vacuum and [auto]analyzeMichael Paquier2025-01-28
| | | | | | | | | | | | | | | | | | | This commit adds four fields to the statistics of relations, aggregating the amount of time spent for each operation on a relation: - total_vacuum_time, for manual vacuum. - total_autovacuum_time, for vacuum done by the autovacuum daemon. - total_analyze_time, for manual analyze. - total_autoanalyze_time, for analyze done by the autovacuum daemon. This gives users the option to derive the average time spent for these operations with the help of the related "count" fields. Bump catalog version (for the catalog changes) and PGSTAT_FILE_FORMAT_ID (for the additions in PgStat_StatTabEntry). Author: Sami Imseih Reviewed-by: Bertrand Drouvot, Michael Paquier Discussion: https://postgr.es/m/CAA5RZ0uVOGBYmPEeGF2d1B_67tgNjKx_bKDuL+oUftuoz+=Y1g@mail.gmail.com
* At update of non-LP_NORMAL TID, fail instead of corrupting page header.Noah Misch2025-01-25
| | | | | | | | | | | | | | | | The right mix of DDL and VACUUM could corrupt a catalog page header such that PageIsVerified() durably fails, requiring a restore from backup. This affects only catalogs that both have a syscache and have DDL code that uses syscache tuples to construct updates. One of the test permutations shows a variant not yet fixed. This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with TM_Deleted. I think core and PGXN are indifferent to that. Per bug #17821 from Alexander Lakhin. Back-patch to v13 (all supported versions). The test case is v17+, since it uses INJECTION_POINT. Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
* Merge copies of converting an XID to a FullTransactionId.Noah Misch2025-01-25
| | | | | | | | | | | Assume twophase.c is the performance-sensitive caller, and preserve its choice of unlikely() branch hint. Add some retrospective rationale for that choice. Back-patch to v17, for the next commit to use it. Reviewed (in earlier versions) by Michael Paquier. Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
* Add some const decorations (htup.h)Peter Eisentraut2025-01-23
| | | | Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517@enterprisedb.com
* Add some more use of Page/PageData rather than char *Peter Eisentraut2025-01-20
| | | | Discussion: https://www.postgresql.org/message-id/flat/692ee0da-49da-4d32-8dca-da224cc2800e@eisentraut.org
* Fix header check for continuation records where standbys could be stuckMichael Paquier2025-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XLogPageRead() checks immediately for an invalid WAL record header on a standby, to be able to handle the case of continuation records that need to be read across two different sources. As written, the check was too generic, applying to any target LSN. Based on an analysis by Kyotaro Horiguchi, what really matters is to make sure that the page header is checked when attempting to read a LSN at the boundary of a segment, to handle the case of a continuation record that spawns across multiple pages when dealing with multiple segments, as WAL receivers are spawned they request WAL from the beginning of a segment. This fix has been proposed by Kyotaro Horiguchi. This could cause standbys to loop infinitely when dealing with a continuation record during a timeline jump, in the case where the contents of the record in the follow-up page are invalid. Some regression tests are added to check such scenarios, able to reproduce the original problem. In the test, the contents of a continuation record are overwritten with junk zeros on its follow-up page, and replayed on standbys. This is inspired by 039_end_of_wal.pl, and is enough to show how standbys should react on promotion by not being stuck. Without the fix, the test would fail with a timeout. The test to reproduce the problem has been written by Alexander Kukushkin. The original check has been introduced in 066871980183, for a similar problem. Author: Kyotaro Horiguchi, Alexander Kukushkin Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com Backpatch-through: 13
* Revert recent changes related to handling of 2PC files at recoveryMichael Paquier2025-01-17
| | | | | | | | | | | | | | | | | | | | This commit reverts 8f67f994e8ea (down to v13) and c3de0f9eed38 (down to v17), as these are proving to not be completely correct regarding two aspects: - In v17 and newer branches, c3de0f9eed38's check for epoch handling is incorrect, and does not correctly handle frozen epochs. A logic closer to widen_snapshot_xid() should be used. The 2PC code should try to integrate deeper with FullTransactionIds, 5a1dfde8334b being not enough. - In v13 and newer branches, 8f67f994e8ea is a workaround for the real issue, which is that we should not attempt CLOG lookups without reaching consistency. This exists since 728bd991c3c4, and this is reachable with ProcessTwoPhaseBuffer() called by restoreTwoPhaseData() at the beginning of recovery. Per discussion with Noah Misch. Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com Backpatch-through: 13
* Add and use BitmapHeapScanDescData structMelanie Plageman2025-01-16
| | | | | | | | | | | | | Move the several members of HeapScanDescData which are specific to Bitmap Heap Scans into a new struct, BitmapHeapScanDescData, which inherits from HeapScanDescData. This reduces the size of the HeapScanDescData for other types of scans and will allow us to add additional bitmap heap scan-specific members in the future without fear of bloating the HeapScanDescData. Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
* Fix nbtree contradictory array element comment.Peter Geoghegan2025-01-16
| | | | | Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution.
* Add more general summary to vacuumlazy.cMelanie Plageman2025-01-15
| | | | | | | | | | | | | Add more comments at the top of vacuumlazy.c on heap relation vacuuming implementation. Previously vacuumlazy.c only had details related to dead TID storage. This commit adds a more general summary to help future developers understand the heap relation vacuum design and implementation at a high level. Reviewed-by: Alena Rybakina, Robert Haas, Andres Freund, Bilal Yavuz Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
* Change gist stratnum function to use CompareTypePeter Eisentraut2025-01-15
| | | | | | | | | | | | | This changes commit 7406ab623fe in that the gist strategy number mapping support function is changed to use the CompareType enum as input, instead of the "well-known" RT*StrategyNumber strategy numbers. This is a bit cleaner, since you are not dealing with two sets of strategy numbers. Also, this will enable us to subsume this system into a more general system of using CompareType to define operator semantics across index methods. Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Fix potential integer overflow in bringetbitmap()Michael Paquier2025-01-14
| | | | | | | | | | | | | | | | This function expects an "int64" as result and stores the number of pages to add to the index scan bitmap as an "int", multiplying its final result by 10. For a relation large enough, this can theoretically overflow if counting more than (INT32_MAX / 10) pages, knowing that the number of pages is upper-bounded by MaxBlockNumber. To avoid the overflow, this commit redefines "totalpages", used to calculate the result, to be an "int64" rather than an "int". Reported-by: Evgeniy Gorbanyov Author: James Hunter Discussion: https://www.postgresql.org/message-id/07704817-6fa0-460c-b1cf-cd18f7647041@basealt.ru Backpatch-through: 13
* Move nbtree preprocessing into new .c file.Peter Geoghegan2025-01-13
| | | | | | | | | | | | | Quite a bit of code within nbtutils.c is only called during nbtree preprocessing. Move that code into a new .c file, nbtpreprocesskeys.c. Also reorder some of the functions within the new file for clarity. This commit has no functional impact. It is strictly mechanical. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-WznwNn1BDOpWxHBUK1f3Rdw8pO9UCenWXnvT=n9GO8GnLA@mail.gmail.com Discussion: https://postgr.es/m/86930045-5df5-494a-b4f1-815bc3fbcce0%40iki.fi
* Add support for NOT ENFORCED in CHECK constraintsPeter Eisentraut2025-01-11
| | | | | | | | | | | | | | | | | | | This adds support for the NOT ENFORCED/ENFORCED flag for constraints, with support for check constraints. The plan is to eventually support this for foreign key constraints, where it is typically more useful. Note that CHECK constraints do not currently support ALTER operations, so changing the enforceability of an existing constraint isn't possible without dropping and recreating it. This could be added later. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Tested-by: Triveni N <triveni.n@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
* Make verify_compact_attribute available in non-assert buildsDavid Rowley2025-01-11
| | | | | | | | | | | | | | | | | | | | 6f3820f37 adjusted the assert-enabled validation of the CompactAttribute to call a new external function to perform the validation. That commit made it so the function was only available when building with USE_ASSERT_CHECKING, and because TupleDescCompactAttr() is a static inline function, the call to verify_compact_attribute() was compiled into any extension which uses TupleDescCompactAttr(). This caused issues for such extensions when loading the assert-enabled extension into PostgreSQL versions without asserts enabled due to that function being unavailable in core. To fix this, make verify_compact_attribute() available unconditionally, but make it do nothing unless building with USE_ASSERT_CHECKING. Author: Andrew Kane <andrew@ankane.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAOdR5yHfMEMW00XGo=v1zCVUS6Huq2UehXdvKnwtXPTcZwXhmg@mail.gmail.com
* Fix obsolete nbtree README left link remarks.Peter Geoghegan2025-01-10
| | | | | | Oversight in commit 1bd4bc85, which made nbtree backwards scans operate off of a copy of each page's left link as of the time of its call to _bt_readpage.
* Fix SLRU bank selection codeÁlvaro Herrera2025-01-09
| | | | | | | | | | | | | | | | | | | | The originally submitted code (using bit masking) was correct when the number of slots was restricted to be a power of two -- but that limitation was removed during development that led to commit 53c2a97a9266, which made the bank selection code incorrect. This led to always using a smaller number of banks than available. Change said code to use integer modulo instead, which works correctly with an arbitrary number of banks. It's likely that we could improve on this to avoid runtime use of integer division. But with this change we're, at least, not wasting memory on unused banks, and more banks mean less contention, which is likely to have a much higher performance impact than a single instruction's latency. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru
* Improve nbtree unsatisfiable RowCompare detection.Peter Geoghegan2025-01-07
| | | | | | | | | | | | | | | | | Move nbtree's detection of RowCompare quals that are unsatisfiable due to having a NULL in their first row element: rather than detecting these cases at the point where _bt_first builds its insertion scan key, do so earlier, during preprocessing proper. This brings the RowCompare case in line every other case involving an unsatisfiable-due-to-NULL qual. nbtree now consistently detects such unsatisfiable quals -- even when they happen to involve a key that isn't examined by _bt_first at all. Affected cases thereby avoid useless full index scans that cannot possibly return any matching rows. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzmySVXst2hFrOATC-zw1Byg1XC-jYUS314=mzuqsNwk+Q@mail.gmail.com