aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
Commit message (Collapse)AuthorAge
...
* Split XLogCtl->LogwrtResult into separate struct membersAlvaro Herrera2024-04-03
| | | | | | | | | | | | | | | | | | | After this change we have XLogCtl->logWriteResult and ->logFlushResult. There's no functional change, other than the fact that the assignment from shared memory to local is no longer done via struct assignment, but instead using a macro that copies each member separately. The current representation is inconvenient going forward; notably, we would like to add a new member "Copy" (to keep track of the last position copied into WAL buffers), so the symmetry between the values in shared memory vs. those in local would be lost. This also gives us freedom to later change the concurrency model for the values in shared memory: we can make them use atomics instead of relying on the info_lck spinlock. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
* Add error codes to some PANIC/FATAL errors reportsDaniel Gustafsson2024-04-03
| | | | | | | | | | | This adds errcodes to a set of PANIC and FATAL errors in xlog.c and relcache.c, which previously had no errcode at all set, in order to make fleetwide analysis of errorlogs easier. There are many more ereport/elogs left which could benefit from having an errcode but this at least makes a dent in the issue. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
* Implement pg_wal_replay_wait() stored procedureAlexander Korotkov2024-04-02
| | | | | | | | | | | | | | | | | | | | | | | | pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Add regression test for restart points during promotionMichael Paquier2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | This test serves as a way to demonstrate how to use the features introduced in 37b369dc67bc, while providing coverage for 7863ee4def65 that caused the startup process to throw "PANIC: could not locate a valid checkpoint record" when starting recovery. The test checks that a node is able to properly restart following a crash when a restart point was finishing across a promotion, with an injection point added in the middle of CreateRestartPoint() to stop the restartpoint in flight. Note that this test fails when 7863ee4def65 is reverted. Kyotaro Horiguchi is the original author of this test, that has been originally posted on the thread where 7863ee4def65 was discussed. I have just upgraded and polished it to rely on injection points, making it much cheaper to reproduce the failure. This test requires injection points to be enabled in the builds, hence meson and ./configure need an update to pass this knowledge down to the test. The name of the new injection point follows the same naming convention as 6a1ea02c491d. The Makefile's EXTRA_INSTALL of recovery TAP tests is updated to include modules/injection_points. Author: Kyotaro Horiguchi, Michael Paquier Reviewed-by: Andrey Borodin, Bertrand Drouvot Discussion: https://postgr.es/m/ZdLuxBk5hGpol91B@paquier.xyz
* Convert unloggedLSN to an atomic variable.Nathan Bossart2024-02-29
| | | | | | | | | | | | | Currently, this variable is an XLogRecPtr protected by a spinlock. By converting it to an atomic variable, we can remove the spinlock, which saves a small amount of shared memory space. Since this code is not performance-critical, we use atomic operations with full barrier semantics to make it easy to reason about correctness. Author: John Morris Reviewed-by: Michael Paquier, Robert Haas, Andres Freund, Stephen Frost, Bharath Rupireddy Discussion: https://postgr.es/m/BYAPR13MB26772534335255E50318C574A0409%40BYAPR13MB2677.namprd13.prod.outlook.com Discussion: https://postgr.es/m/MN2PR13MB2688FD8B757316CB5C54C8A2A0DDA%40MN2PR13MB2688.namprd13.prod.outlook.com
* Remove superfluous 'pgprocno' field from PGPROCHeikki Linnakangas2024-02-22
| | | | | | | | | It was always just the index of the PGPROC entry from the beginning of the proc array. Introduce a macro to compute it from the pointer instead. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Add assert to WALReadFromBuffers().Jeff Davis2024-02-16
| | | | | | Per suggestion from Andres. Discussion: https://postgr.es/m/20240214025508.6mcblauossthvaw3@awork3.anarazel.de
* Read WAL directly from WAL buffers.Jeff Davis2024-02-12
| | | | | | | | | | | | If available, read directly from WAL buffers, avoiding the need to go through the filesystem. Only for physical replication for now, but can be expanded to other callers. In preparation for replicating unflushed WAL data. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com Reviewed-by: Andres Freund, Alvaro Herrera, Nathan Bossart, Dilip Kumar, Nitin Jadhav, Melih Mutlu, Kyotaro Horiguchi
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Fix incorrect data type choices in some read and write calls.Tom Lane2023-12-27
| | | | | | | | | | | | | | | | | | | | | | | | Recently-introduced code in reconstruct.c was using "unsigned" to store the result of read(), pg_pread(), or write(). This is completely bogus: it breaks subsequent tests for the result being negative, as we're being reminded of by a chorus of buildfarm warnings. Switch to "int" as was doubtless intended. (There are several other uses of "unsigned" in this file that also look poorly chosen to me, but for now I'm just trying to clean up the buildfarm.) A larger problem is that "int" is not necessarily wide enough to hold the result: per POSIX, all these functions return ssize_t. In places where the requested read or write length clearly fits in int, that's academic. It may be academic anyway as long as we constrain individual data files to 1GB, since even a readv or writev-like operation would then not be responsible for transferring more than 1GB. Nonetheless it seems like trouble waiting to happen, so I made a pass over readv and writev calls and fixed the result variables where that seemed appropriate. We might want to think about changing some of the fd.c functions to return ssize_t too, for future-proofing; but I didn't tackle that here. Discussion: https://postgr.es/m/1672202.1703441340@sss.pgh.pa.us
* Add a new WAL summarizer process.Robert Haas2023-12-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When active, this process writes WAL summary files to $PGDATA/pg_wal/summaries. Each summary file contains information for a certain range of LSNs on a certain TLI. For each relation, it stores a "limit block" which is 0 if a relation is created or destroyed within a certain range of WAL records, or otherwise the shortest length to which the relation was truncated during that range of WAL records, or otherwise InvalidBlockNumber. In addition, it stores a list of blocks which have been modified during that range of WAL records, but excluding blocks which were removed by truncation after they were modified and never subsequently modified again. In other words, it tells us which blocks need to copied in case of an incremental backup covering that range of WAL records. But this doesn't yet add the capability to actually perform an incremental backup; the next patch will do that. A new parameter summarize_wal enables or disables this new background process. The background process also automatically deletes summary files that are older than wal_summarize_keep_time, if that parameter has a non-zero value and the summarizer is configured to run. Patch by me, with some design help from Dilip Kumar and Andres Freund. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
* Additional write barrier in AdvanceXLInsertBuffer().Jeff Davis2023-12-19
| | | | | | | | | | | | First, mark the xlblocks member with InvalidXLogRecPtr, then issue a write barrier, then initialize it. That ensures that the xlblocks member doesn't appear valid while the contents are being initialized. In preparation for reading WAL buffer contents without a lock. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
* Use 64-bit atomics for xlblocks array elements.Jeff Davis2023-12-19
| | | | | | | | | | In preparation for reading the contents of WAL buffers without a lock. Also, avoids the previously-needed comment in GetXLogBuffer() explaining why it's safe from torn reads. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
* Remove trace_recovery_messagesMichael Paquier2023-12-11
| | | | | | | | | | | | This GUC was intended as a debugging help in the 9.0 area when hot standby and streaming replication were being developped, able to offer more information at LOG level rather than DEBUGn. There are more tools available these days that are able to offer rather equivalent information, like pg_waldump introduced in 9.3. It is not obvious how this facility is useful these days, so let's remove it. Author: Bharath Rupireddy Discussion: https://postgr.es/m/ZXEXEAUVFrvpquSd@paquier.xyz
* Rename ShmemVariableCache to TransamVariablesHeikki Linnakangas2023-12-08
| | | | | | | | The old name was misleading: It's not a cache, the values kept in the struct are the authoritative source. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
* Fix compilation on Windows with WAL_DEBUGMichael Paquier2023-12-06
| | | | | | | | | | This has been broken since b060dbe0001a that has reworked the callback mechanism of XLogReader, most likely unnoticed because any form of development involving WAL happens on platforms where this compiles fine. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com Backpatch-through: 13
* Apply quotes more consistently to GUC names in logsMichael Paquier2023-11-30
| | | | | | | | | | | | | | Quotes are applied to GUCs in a very inconsistent way across the code base, with a mix of double quotes or no quotes used. This commit removes double quotes around all the GUC names that are obviously referred to as parameters with non-English words (use of underscore, mixed case, etc). This is the result of a discussion with Álvaro Herrera, Nathan Bossart, Laurenz Albe, Peter Eisentraut, Tom Lane and Daniel Gustafsson. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
* Reduce rate of walwriter wakeups due to async commits.Heikki Linnakangas2023-11-27
| | | | | | | | | | | | | | | | | XLogSetAsyncXactLSN(), called at asynchronous commit, would wake up walwriter every time the LSN advances, but walwriter doesn't actually do anything unless it has at least 'wal_writer_flush_after' full blocks of WAL to write. Repeatedly waking up walwriter to do nothing is a waste of CPU cycles in both walwriter and the backends doing the wakeups. To fix, apply the same logic in XLogSetAsyncXactLSN() to decide whether to wake up walwriter, as walwriter uses to determine if it has any work to do. In the passing, rename misleadingly named 'flushbytes' local variable to 'flushblocks'. Author: Andres Freund, Heikki Linnakangas Discussion: https://www.postgresql.org/message-id/20231024230929.vsc342baqs7kmbte@awork3.anarazel.de
* C comment: clarify that WAL files can be _recycled_ or removedBruce Momjian2023-11-25
| | | | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/CAB7nPqSDdF0heotQU3gsepgqx+9c+6KjLd3R6aNYH7KKfDd2ig@mail.gmail.com Author: Michael Paquier Backpatch-through: master
* Prohibit max_slot_wal_keep_size to value other than -1 during upgrade.Amit Kapila2023-11-10
| | | | | | | | | | | We don't want existing slots in the old cluster to get invalidated during the upgrade. During an upgrade, we set this variable to -1 via the command line in an attempt to prevent such invalidations, but users have ways to override it. This patch ensures the value is not overridden by the user. Author: Kyotaro Horiguchi Reviewed-by: Peter Smith, Alvaro Herrera Discussion: http://postgr.es/m/20231027.115759.2206827438943188717.horikyota.ntt@gmail.com
* Introduce pg_stat_checkpointerMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | | | Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
* Change struct tablespaceinfo's oid member from 'char *' to 'Oid'Robert Haas2023-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | This shouldn't change behavior except in the unusual case where there are file in the tablespace directory that have entirely numeric names but are nevertheless not possible names for a tablespace directory, either because their names have leading zeroes that shouldn't be there, or the value is actually zero, or because the value is too large to represent as an OID. In those cases, the directory would previously have made it into the list of tablespaceinfo objects and no longer will. Thus, base backups will now ignore such directories, instead of treating them as legitimate tablespace directories. Similarly, if entries for such tablespaces occur in a tablespace_map file, they will now be rejected as erroneous, instead of being honored. This is infrastructure for future work that wants to be able to know the tablespace of each relation that is part of a backup *as an OID*. By strengthening the up-front validation, we don't have to worry about weird cases later, and can more easily avoid repeated string->integer conversions. Patch by me, reviewed by David Steele. Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
* During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.Robert Haas2023-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
* Improve the naming in wal_sync_method code.Nathan Bossart2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
* Add wait events for checkpoint delay mechanism.Thomas Munro2023-10-13
| | | | | | | | | | | When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
* Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas2023-10-12
| | | | | | | | | | | | | | | | | An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
* Rename variable for code clarityDaniel Gustafsson2023-09-15
| | | | | | | | | | | When tracking IO timing for WAL, the duration is what we calculate based on the start and end timestamps, it's not what the variable contains. Rename the timestamp variable to end to better communicate what it contains. Original patch by Krishnakumar with additional hacking to fix another occurrence by me. Author: Krishnakumar R <kksrcv001@gmail.com> Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com
* Quote filenames in error messagesDaniel Gustafsson2023-09-14
| | | | | | | | | | | | | | The majority of all filenames are quoted in user facing error and log messages, but a few were still printed without quotes. While these filenames do not risk causing any ambiguity as their format is strict, quote them anyways to be consistent across all logs. Also concatenate a message to keep it one line to make it easier to grep for in the code. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/080EEABE-6645-4A46-AB20-6285ADAC44FE@yesql.se
* Flush logical slots to disk during a shutdown checkpoint if required.Amit Kapila2023-09-14
| | | | | | | | | | | | | | | | | | | | | It's entirely possible for a logical slot to have a confirmed_flush LSN higher than the last value saved on disk while not being marked as dirty. Currently, it is not a major problem but a later patch adding support for the upgrade of slots relies on that value being properly flushed to disk. It can also help avoid processing the same transactions again in some boundary cases after the clean shutdown and restart. Say, we process some transactions for which we didn't send anything downstream (the changes got filtered) but the confirm_flush LSN is updated due to keepalives. As we don't flush the latest value of confirm_flush LSN, it may lead to processing the same changes again without this patch. The approach taken by this patch has been suggested by Ashutosh Bapat. Author: Vignesh C, Julien Rouhaud, Kuroda Hayato Reviewed-by: Amit Kapila, Dilip Kumar, Michael Paquier, Ashutosh Bapat, Peter Smith, Hou Zhijie Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
* Make error messages about WAL segment size more consistentPeter Eisentraut2023-08-28
| | | | | | | | | | | | | | | | Make the primary messages more compact and make the detail messages uniform. In initdb.c and pg_resetwal.c, use the newish option_parse_int() to simplify some of the option parsing. For the backend GUC wal_segment_size, add a GUC check hook to do the verification instead of coding it in bootstrap.c. This might be overkill, but that way the check is in the right place and it becomes more self-documenting. In passing, make pg_controldata use the logging API for warning messages. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/9939aa8a-d7be-da2c-7715-0a0b5535a1f7@eisentraut.org
* Document more assumptions of LWLock variable changes with WAL insertsMichael Paquier2023-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a few comments about what LWLockWaitForVar() relies on when a backend waits for a variable update on its LWLocks for WAL insertions up to an expected LSN. First, LWLockWaitForVar() does not include a memory barrier, relying on a spinlock taken at the beginning of WaitXLogInsertionsToFinish(). This was hidden behind two layers of routines in lwlock.c. This assumption is now documented at the top of LWLockWaitForVar(), and detailed at bit more within LWLockConflictsWithVar(). Second, document why WaitXLogInsertionsToFinish() does not include memory barriers, relying on a spinlock at its top, which is, per Andres' input, fine for two different reasons, both depending on the fact that the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a certain value. This area's documentation and assumptions could be improved more in the future, but at least that's a beginning. Author: Bharath Rupireddy, Andres Freund Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Optimize WAL insertion lock acquisition and release with some atomicsMichael Paquier2023-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The WAL insertion lock variable insertingAt is currently being read and written with the help of the LWLock wait list lock to avoid any read of torn values. This wait list lock can become a point of contention on a highly concurrent write workloads. This commit switches insertingAt to a 64b atomic variable that provides torn-free reads/writes. On platforms without 64b atomic support, the fallback implementation uses spinlocks to provide the same guarantees for the values read. LWLockWaitForVar(), through LWLockConflictsWithVar(), reads the new value to check if it still needs to wait with a u64 atomic operation. LWLockUpdateVar() updates the variable before waking up the waiters with an exchange_u64 (full memory barrier). LWLockReleaseClearVar() now uses also an exchange_u64 to reset the variable. Before this commit, all these steps relied on LWLockWaitListLock() and LWLockWaitListUnlock(). This reduces contention on LWLock wait list lock and improves performance of highly-concurrent write workloads. Here are some numbers using pg_logical_emit_message() (HEAD at d6677b93) with various arbitrary record lengths and clients up to 1k on a rather-large machine (64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of TPS numbers coming from pgbench: message_size_b | 16 | 64 | 256 | 1024 --------------------+--------+--------+--------+------- patch_4_clients | 83830 | 82929 | 80478 | 73131 patch_16_clients | 267655 | 264973 | 250566 | 213985 patch_64_clients | 380423 | 378318 | 356907 | 294248 patch_256_clients | 360915 | 354436 | 326209 | 263664 patch_512_clients | 332654 | 321199 | 287521 | 240128 patch_1024_clients | 288263 | 276614 | 258220 | 217063 patch_2048_clients | 252280 | 243558 | 230062 | 192429 patch_4096_clients | 212566 | 213654 | 205951 | 166955 head_4_clients | 83686 | 83766 | 81233 | 73749 head_16_clients | 266503 | 265546 | 249261 | 213645 head_64_clients | 366122 | 363462 | 341078 | 261707 head_256_clients | 132600 | 132573 | 134392 | 165799 head_512_clients | 118937 | 114332 | 116860 | 150672 head_1024_clients | 133546 | 115256 | 125236 | 151390 head_2048_clients | 137877 | 117802 | 120909 | 138165 head_4096_clients | 113440 | 115611 | 120635 | 114361 Bharath has been measuring similar improvements, where the limit of the WAL insertion lock begins to be felt when more than 256 concurrent clients are involved in this specific workload. An extra patch has been discussed to introduce a fast-exit path in LWLockUpdateVar() when there are no waiters, still this does not influence the write-heavy workload cases discussed as there are always waiters. This will be considered separately. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Enable archiving in recovery TAP test 009_twophase.plMichael Paquier2023-06-20
| | | | | | | | | | | | | | | | | This is a follow-up of f663b00, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13
* Pre-beta mechanical code beautification.Tom Lane2023-05-19
| | | | | | | | | | | | | | | Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
* Prevent underflow in KeepLogSeg().Nathan Bossart2023-04-27
| | | | | | | | | | | | | | | The call to XLogGetReplicationSlotMinimumLSN() might return a greater LSN than the one given to the function. Subsequent segment number calculations might then underflow, which could result in unexpected behavior when removing or recyling WAL files. This was introduced with max_slot_wal_keep_size in c655077639. To fix, skip the block of code for replication slots if the LSN is greater. Reported-by: Xu Xingwang Author: Kyotaro Horiguchi Reviewed-by: Junwang Zhao Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org Backpatch-through: 13
* Fix various typos and incorrect/outdated name referencesDavid Rowley2023-04-19
| | | | | Author: Alexander Lakhin Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com
* Fix pg_basebackup with in-place tablespaces some more.Robert Haas2023-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c6f2f01611d4f2c412e92eb7893f76fa590818e8 purported to make this work, but problems remained. In a plain-format backup, the files from an in-place tablespace got included in the tar file for the main tablespace, which is wrong but it's not clear that it has any user-visible consequences. In a tar-format backup, the TABLESPACE_MAP option is used, and so we never iterated over pg_tblspc and thus never backed up the in-place tablespaces anywhere at all. To fix this, reverse the changes in that commit, so that when we scan pg_tblspc during a backup, we create tablespaceinfo objects even for in-place tablespaces. We set the field that would normally contain the absolute pathname to the relative path pg_tblspc/${TSOID}, and that's good enough to make basebackup.c happy without any further changes. However, pg_basebackup needs a couple of adjustments to make it work. First, it needs to understand that a relative path for a tablespace means it's an in-place tablespace. Second, it needs to tolerate the situation where restoring the main tablespace tries to create pg_tblspc or a subdirectory and finds that it already exists, because we restore user-defined tablespaces before the main tablespace. Since in-place tablespaces are only intended for use in development and testing, no back-patch. Patch by me, reviewed by Thomas Munro and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmobwvbEp+fLq2PykMYzizcvuNv0a7gPMJtxOTMOuuRLMHg@mail.gmail.com
* Allow logical decoding on standbysAndres Freund2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unsurprisingly, this requires wal_level = logical to be set on the primary and standby. The infrastructure added in 26669757b6a ensures that slots are invalidated if the primary's wal_level is lowered. Creating a slot on a standby waits for a xl_running_xact record to be processed. If the primary is idle (and thus not emitting xl_running_xact records), that can take a while. To make that faster, this commit also introduces the pg_log_standby_snapshot() function. By executing it on the primary, completion of slot creation on the standby can be accelerated. Note that logical decoding on a standby does not itself enforce that required catalog rows are not removed. The user has to use physical replication slots + hot_standby_feedback or other measures to prevent that. If catalog rows required for a slot are removed, the slot is invalidated. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion, for the addition of the pg_log_standby_snapshot() function. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com>
* For cascading replication, wake physical and logical walsenders separatelyAndres Freund2023-04-08
| | | | | | | | | | | | | | | | | | | | | | Physical walsenders can't send data until it's been flushed; logical walsenders can't decode and send data until it's been applied. On the standby, the WAL is flushed first, which will only wake up physical walsenders; and then applied, which will only wake up logical walsenders. Previously, all walsenders were awakened when the WAL was flushed. That was fine for logical walsenders on the primary; but on the standby the flushed WAL would have been not applied yet, so logical walsenders were awakened too early. Per idea from Jeff Davis and Amit Kapila. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
* Handle logical slot conflicts on standbyAndres Freund2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in 6af1793954e, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
* Support invalidating replication slots due to horizon and wal_levelAndres Freund2023-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Needed for logical decoding on a standby. Slots need to be invalidated because of the horizon if rows required for logical decoding are removed. If the primary's wal_level is lowered from 'logical', logical slots on the standby need to be invalidated. The new invalidation methods will be used in a subsequent commit. Logical slots that have been invalidated can be identified via the new pg_replication_slots.conflicting column. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the new pg_replication_slots column. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
* Add io_direct setting (developer-only).Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
* Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
* pg_stat_wal: Accumulate time as instr_time instead of microsecondsAndres Freund2023-03-30
| | | | | | | | | | | | | | | | | | | In instr_time.h it is stated that: * When summing multiple measurements, it's recommended to leave the * running sum in instr_time form (ie, use INSTR_TIME_ADD or * INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end. The reason for that is that converting to microseconds is not cheap, and can loose precision. Therefore this commit changes 'PendingWalStats' to use 'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time' and 'wal_sync_time'. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com
* Revise pg_pwrite_zeros()Michael Paquier2023-03-06
| | | | | | | | | | | | | | | | | The following changes are made to pg_write_zeros(), the API able to write series of zeros using vectored I/O: - Add of an "offset" parameter, to write the size from this position (the 'p' of "pwrite" seems to mean position, though POSIX does not outline ythat directly), hence the name of the routine is incorrect if it is not able to handle offsets. - Avoid memset() of "zbuffer" on every call. - Avoid initialization of the whole IOV array if not needed. - Group the trailing write() call with the main write() call, simplifying the function logic. Author: Andres Freund Reviewed-by: Michael Paquier, Bharath Rupireddy Discussion: https://postgr.es/m/20230215005525.mrrlmqrxzjzhaipl@awork3.anarazel.de
* Don't leak descriptors into subprograms.Thomas Munro2023-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Open long-lived data and WAL file descriptors with O_CLOEXEC. This flag was introduced by SUSv4 (POSIX.1-2008), and by now all of our target Unix systems have it. Our open() implementation for Windows already had that behavior, so provide a dummy O_CLOEXEC flag on that platform. For now, callers of open() and the "thin" wrappers in fd.c that deal in raw descriptors need to pass in O_CLOEXEC explicitly if desired. This commit does that for WAL files, and automatically for everything accessed via VFDs including SMgrRelation and BufFile. (With more discussion we might decide to turn it on automatically for the thin open()-wrappers too to avoid risk of missing places that need it, but these are typically used for short-lived descriptors where we don't expect to fork/exec, and it's remotely possible that extensions could be using these APIs and passing descriptors to subprograms deliberately, so that hasn't been done here.) Do the same for sockets and the postmaster pipe with FD_CLOEXEC. (Later commits might use modern interfaces to remove these extra fcntl() calls and more where possible, but we'll need them as a fallback for a couple of systems, so do it that way in this initial commit.) With this change, subprograms executed for archiving, copying etc will no longer have access to the server's descriptors, other than the ones that we decide to pass down. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
* Revert refactoring of restore command code to shell_restore.cMichael Paquier2023-02-06
| | | | | | | | | | | | | | | | | | | | | This reverts commits 24c35ec and 57169ad. PreRestoreCommand() and PostRestoreCommand() need to be put closer to the system() call calling a restore_command, as they enable in_restore_command for the startup process which would in turn trigger an immediate proc_exit() in the SIGTERM handler. Perhaps we could get rid of this behavior entirely, but 24c35ec has made the window where the flag is enabled much larger than it was, and any Postgres-like actions (palloc, etc.) taken by code paths while the flag is enabled could lead to more severe issues in the shutdown processing. Note that curculio has showed that there are much more problems in this area, unrelated to this change, actually, hence the issues related to that had better be addressed first. Keeping the code of HEAD in line with the stable branches should make that a bit easier. Per discussion with Andres Freund and Nathan Bossart. Discussion: https://postgr.es/m/Y979NR3U5VnWrTwB@paquier.xyz
* Zero initialize uses of instr_time about to trigger compiler warningsAndres Freund2023-01-20
| | | | | | | | | These are all not necessary from a correctness POV. However, in the near future instr_time will be simplified to an int64, at which point gcc would otherwise start to warn about the changed places. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfciua@awork3.anarazel.de
* Improve comment about GetWALAvailability's WALAVAIL_REMOVED code.Tom Lane2023-01-19
| | | | | | Sirisha Chamarthi and Kyotaro Horiguchi Discussion: https://postgr.es/m/CAKrAKeXt-=bgm=d+EDmcC9kWoikp8kbVb3LH0K3K+AGGsykpHQ@mail.gmail.com