aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam
Commit message (Collapse)AuthorAge
* Delay recovery mode LOG after reading backup_label and/or checkpoint recordMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | When beginning recovery, a LOG is displayed by the startup process to show which recovery mode will be used depending on the .signal file(s) set in the data folder, like "standby mode", recovery up to a given target type and value, or archive recovery. A different patch is under discussion to simplify the startup code by requiring the presence of recovery.signal and/or standby.signal when a backup_label file is read. Delaying a bit this LOG ensures that the correct recovery mode would be reported, and putting it at this position does not make it lose its value. While on it, this commit adds a few comments documenting a bit more the initial recovery steps and their dependencies, and fixes an incorrect comment format. This introduces no behavior changes. Extracted from a larger patch by me. Reviewed-by: David Steele, Bowen Shi Discussion: https://postgr.es/m/ZArVOMifjzE7f8W7@paquier.xyz
* Mention standby.signal in FATALs for checkpoint record missing at recoveryMichael Paquier2023-10-30
| | | | | | | | | | | | | | | When beginning recovery from a base backup by reading a backup_label file, it may be possible that no checkpoint record is available depending on the method used when the case backup was taken, which would prevent recovery from beginning. In this case, the FATAL messages issued, initially added by c900c15269f0f, mentioned recovery.signal as an option to do recovery but not standby.signal. Let's add it as an available option, for clarity. Per suggestion from Bowen Shi, extracted from a larger patch by me. Author: Michael Paquier Discussion: https://postgr.es/m/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
* Introduce pg_stat_checkpointerMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | | | Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* Assert that buffers are marked dirty before XLogRegisterBuffer().Jeff Davis2023-10-23
| | | | | | | | | | | Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi Reviewed-by: Heikki Linnakangas
* Change struct tablespaceinfo's oid member from 'char *' to 'Oid'Robert Haas2023-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | This shouldn't change behavior except in the unusual case where there are file in the tablespace directory that have entirely numeric names but are nevertheless not possible names for a tablespace directory, either because their names have leading zeroes that shouldn't be there, or the value is actually zero, or because the value is too large to represent as an OID. In those cases, the directory would previously have made it into the list of tablespaceinfo objects and no longer will. Thus, base backups will now ignore such directories, instead of treating them as legitimate tablespace directories. Similarly, if entries for such tablespaces occur in a tablespace_map file, they will now be rejected as erroneous, instead of being honored. This is infrastructure for future work that wants to be able to know the tablespace of each relation that is part of a backup *as an OID*. By strengthening the up-front validation, we don't have to worry about weird cases later, and can more easily avoid repeated string->integer conversions. Patch by me, reviewed by David Steele. Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
* During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.Robert Haas2023-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
* Reword messages about impending (M)XID exhaustion.Robert Haas2023-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | First, we shouldn't recommend switching to single-user mode, because that's terrible advice. Especially on newer versions where VACUUM will enter emergency mode when nearing (M)XID exhaustion, it's perfectly fine to just VACUUM in multi-user mode. Doing it that way is less disruptive and avoids disabling the safeguards that prevent actual wraparound, so recommend that instead. Second, be more precise about what is going to happen (when we're nearing the limits) or what is happening (when we actually hit them). The database doesn't shut down, nor does it refuse all commands. It refuses commands that assign whichever of XIDs and MXIDs are nearly exhausted. No back-patch. The existing hint that advises going to single-user mode is sufficiently awful advice that removing it or changing it might be justifiable even though we normally avoid changing user-facing messages in back-branches, but I (rhaas) felt that it was better to be more conservative and limit this fix to master only. Aside from the usual risk of breaking translations, people might be used to the existing message, or even have monitoring scripts that look for it. Alexander Alekseev, John Naylor, Robert Haas, reviewed at various times by Peter Geoghegan, Hannu Krosing, and Andres Freund. Discussion: http://postgr.es/m/CA+TgmoZBg95FiR9wVQPAXpGPRkacSt2okVge+PKPPFppN7sfnQ@mail.gmail.com
* Talk about assigning, rather than generating, new MultiXactIds.Robert Haas2023-10-17
| | | | | | | | | The word "assign" is used in various places internally to describe what GetNewMultiXactId does, but the user-facing messages have previously said "generate". For consistency, standardize on "assign," which seems (at least to me) to be slightly clearer. Discussion: http://postgr.es/m/CA+TgmoaoE1_i3=4-7GCTtKLVZVQ2Gh6qESW2VG1OprtycxOHMA@mail.gmail.com
* Move extra code out of the Pre/PostRestoreCommand() section.Nathan Bossart2023-10-16
| | | | | | | | | | | | | | | | | | | | If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see 1b06d7bac9 and 7fed801135) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15
* Improve the naming in wal_sync_method code.Nathan Bossart2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
* Add wait events for checkpoint delay mechanism.Thomas Munro2023-10-13
| | | | | | | | | | | When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
* Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas2023-10-12
| | | | | | | | | | | | | | | | | An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
* Reindent comment in GenericXLogFinish().Tom Lane2023-10-11
| | | | Restore pgindent cleanliness, per buildfarm member koel.
* Fix bug in GenericXLogFinish().Jeff Davis2023-10-10
| | | | | | | | Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
* Remove duplicate words in docs and code comments.Amit Kapila2023-10-09
| | | | | | | Additionally, add a missing "the" in a couple of places. Author: Vignesh C, Dagfinn Ilmari Mannsåker Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
* Tidy-up some appendStringInfo*() usagesDavid Rowley2023-10-03
| | | | | | | | | | | | Make a few newish calls to appendStringInfo() which have no special formatting use appendStringInfoString() instead. Also, adjust usages of appendStringInfoString() which only append a string containing a single character to make use of appendStringInfoChar() instead. This makes the code marginally faster, but primarily this change is so we use the StringInfo type as it was intended to be used. Discussion: https://postgr.es/m/CAApHDvpXKQmL+r=VDNS98upqhr9yGBhv2Jw3GBFFk_wKHcB39A@mail.gmail.com
* Fail hard on out-of-memory failures in xlogreader.cMichael Paquier2023-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
* Correct assertion and comments about XLogRecordMaxSize.Noah Misch2023-10-01
| | | | | | The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf(). Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com
* Fix typo in src/backend/access/transam/README.Etsuro Fujita2023-09-28
|
* Fix edge-case for xl_tot_len broken by bae868ca.Thomas Munro2023-09-26
| | | | | | | | | | | | | | | | | bae868ca removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like bae868ca. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
* Don't trust unvalidated xl_tot_len.Thomas Munro2023-09-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82a4b5, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
* Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.Tom Lane2023-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit 12d768e70 fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
* Rename variable for code clarityDaniel Gustafsson2023-09-15
| | | | | | | | | | | When tracking IO timing for WAL, the duration is what we calculate based on the start and end timestamps, it's not what the variable contains. Rename the timestamp variable to end to better communicate what it contains. Original patch by Krishnakumar with additional hacking to fix another occurrence by me. Author: Krishnakumar R <kksrcv001@gmail.com> Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com
* Quote filenames in error messagesDaniel Gustafsson2023-09-14
| | | | | | | | | | | | | | The majority of all filenames are quoted in user facing error and log messages, but a few were still printed without quotes. While these filenames do not risk causing any ambiguity as their format is strict, quote them anyways to be consistent across all logs. Also concatenate a message to keep it one line to make it easier to grep for in the code. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/080EEABE-6645-4A46-AB20-6285ADAC44FE@yesql.se
* Flush logical slots to disk during a shutdown checkpoint if required.Amit Kapila2023-09-14
| | | | | | | | | | | | | | | | | | | | | It's entirely possible for a logical slot to have a confirmed_flush LSN higher than the last value saved on disk while not being marked as dirty. Currently, it is not a major problem but a later patch adding support for the upgrade of slots relies on that value being properly flushed to disk. It can also help avoid processing the same transactions again in some boundary cases after the clean shutdown and restart. Say, we process some transactions for which we didn't send anything downstream (the changes got filtered) but the confirm_flush LSN is updated due to keepalives. As we don't flush the latest value of confirm_flush LSN, it may lead to processing the same changes again without this patch. The approach taken by this patch has been suggested by Ashutosh Bapat. Author: Vignesh C, Julien Rouhaud, Kuroda Hayato Reviewed-by: Amit Kapila, Dilip Kumar, Michael Paquier, Ashutosh Bapat, Peter Smith, Hou Zhijie Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
* Tweak pg_promote() to report failures on kill() or postmaster failuresMichael Paquier2023-08-29
| | | | | | | | | | | | | | | | | | | | | Since its introduction in 10074651e335, pg_promote() has been returning a false status in three cases: - SIGUSR1 not sent to the postmaster process. - Postmaster death during standby promotion. - Standby not promoted within the specified wait time. An application calling this function will have a hard time understanding what a false state returned actually means. Per discussion, this switches the two first states to fail rather than return a "false" status, making the second case more consistent with the existing CHECK_FOR_INTERRUPTS in the wait loop. False is only returned when the promotion is not completed within the specified time (60s by default). Author: Ashutosh Sharma Reviewed-by: Fujii Masao, Laurenz Albe, Michael Paquier Discussion: https://postgr.es/m/CAE9k0P=QTrwptL0t4J0fuBRDDjgsT-0PVKd-ikd96i1hyL7Bcg@mail.gmail.com
* Make error messages about WAL segment size more consistentPeter Eisentraut2023-08-28
| | | | | | | | | | | | | | | | Make the primary messages more compact and make the detail messages uniform. In initdb.c and pg_resetwal.c, use the newish option_parse_int() to simplify some of the option parsing. For the backend GUC wal_segment_size, add a GUC check hook to do the verification instead of coding it in bootstrap.c. This might be overkill, but that way the check is in the right place and it becomes more self-documenting. In passing, make pg_controldata use the logging API for warning messages. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/9939aa8a-d7be-da2c-7715-0a0b5535a1f7@eisentraut.org
* Introduce macros for protocol characters.Nathan Bossart2023-08-22
| | | | | | | | | | | This commit introduces descriptively-named macros for the identifiers used in wire protocol messages. These new macros are placed in a new header file so that they can be easily used by third-party code. Author: Dave Cramer Reviewed-by: Alvaro Herrera, Tatsuo Ishii, Peter Smith, Robert Haas, Tom Lane, Peter Eisentraut, Michael Paquier Discussion: https://postgr.es/m/CADK3HHKbBmK-PKf1bPNFoMC%2BoBt%2BpD9PH8h5nvmBQskEHm-Ehw%40mail.gmail.com
* ExtendBufferedWhat -> BufferManagerRelation.Thomas Munro2023-08-23
| | | | | | | | | | | | | | | | | Commit 31966b15 invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
* Fix off-by-one in XLogRecordMaxSize check.Noah Misch2023-08-12
| | | | | | | | | pg_logical_emit_message(false, '_', repeat('x', 1069547465)) failed with self-contradictory message "WAL record would be 1069547520 bytes (of maximum 1069547520 bytes)". There's no particular benefit from allowing or denying one byte in either direction; XLogRecordMaxSize could rise a few megabytes without trouble. Hence, this is just for cleanliness. Back-patch to v16, where this check first appeared.
* Document more assumptions of LWLock variable changes with WAL insertsMichael Paquier2023-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a few comments about what LWLockWaitForVar() relies on when a backend waits for a variable update on its LWLocks for WAL insertions up to an expected LSN. First, LWLockWaitForVar() does not include a memory barrier, relying on a spinlock taken at the beginning of WaitXLogInsertionsToFinish(). This was hidden behind two layers of routines in lwlock.c. This assumption is now documented at the top of LWLockWaitForVar(), and detailed at bit more within LWLockConflictsWithVar(). Second, document why WaitXLogInsertionsToFinish() does not include memory barriers, relying on a spinlock at its top, which is, per Andres' input, fine for two different reasons, both depending on the fact that the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a certain value. This area's documentation and assumptions could be improved more in the future, but at least that's a beginning. Author: Bharath Rupireddy, Andres Freund Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Optimize WAL insertion lock acquisition and release with some atomicsMichael Paquier2023-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The WAL insertion lock variable insertingAt is currently being read and written with the help of the LWLock wait list lock to avoid any read of torn values. This wait list lock can become a point of contention on a highly concurrent write workloads. This commit switches insertingAt to a 64b atomic variable that provides torn-free reads/writes. On platforms without 64b atomic support, the fallback implementation uses spinlocks to provide the same guarantees for the values read. LWLockWaitForVar(), through LWLockConflictsWithVar(), reads the new value to check if it still needs to wait with a u64 atomic operation. LWLockUpdateVar() updates the variable before waking up the waiters with an exchange_u64 (full memory barrier). LWLockReleaseClearVar() now uses also an exchange_u64 to reset the variable. Before this commit, all these steps relied on LWLockWaitListLock() and LWLockWaitListUnlock(). This reduces contention on LWLock wait list lock and improves performance of highly-concurrent write workloads. Here are some numbers using pg_logical_emit_message() (HEAD at d6677b93) with various arbitrary record lengths and clients up to 1k on a rather-large machine (64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of TPS numbers coming from pgbench: message_size_b | 16 | 64 | 256 | 1024 --------------------+--------+--------+--------+------- patch_4_clients | 83830 | 82929 | 80478 | 73131 patch_16_clients | 267655 | 264973 | 250566 | 213985 patch_64_clients | 380423 | 378318 | 356907 | 294248 patch_256_clients | 360915 | 354436 | 326209 | 263664 patch_512_clients | 332654 | 321199 | 287521 | 240128 patch_1024_clients | 288263 | 276614 | 258220 | 217063 patch_2048_clients | 252280 | 243558 | 230062 | 192429 patch_4096_clients | 212566 | 213654 | 205951 | 166955 head_4_clients | 83686 | 83766 | 81233 | 73749 head_16_clients | 266503 | 265546 | 249261 | 213645 head_64_clients | 366122 | 363462 | 341078 | 261707 head_256_clients | 132600 | 132573 | 134392 | 165799 head_512_clients | 118937 | 114332 | 116860 | 150672 head_1024_clients | 133546 | 115256 | 125236 | 151390 head_2048_clients | 137877 | 117802 | 120909 | 138165 head_4096_clients | 113440 | 115611 | 120635 | 114361 Bharath has been measuring similar improvements, where the limit of the WAL insertion lock begins to be felt when more than 256 concurrent clients are involved in this specific workload. An extra patch has been discussed to introduce a fast-exit path in LWLockUpdateVar() when there are no waiters, still this does not influence the write-heavy workload cases discussed as there are always waiters. This will be considered separately. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Fix indentation in twophase.cMichael Paquier2023-07-18
| | | | | | | | | This has been missed in cb0cca1, noticed before buildfarm member koel has been able to complain while poking at a different patch. Like the other commit, backpatch all the way down to limit the odds of merge conflicts. Backpatch-through: 11
* Fix recovery of 2PC transaction during crash recoveryMichael Paquier2023-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A crash in the middle of a checkpoint with some two-phase state data already flushed to disk by this checkpoint could cause a follow-up crash recovery to recover twice the same transaction, once from what has been found in pg_twophase/ at the beginning of recovery and a second time when replaying its corresponding record. This would lead to FATAL failures in the startup process during recovery, where the same transaction would have a state recovered twice instead of once: LOG: recovering prepared transaction 731 from shared memory LOG: recovering prepared transaction 731 from shared memory FATAL: lock ExclusiveLock on object 731/0/0 is already held This issue is fixed by skipping the addition of any 2PC state coming from a record whose equivalent 2PC state file has already been loaded in TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(), which is OK as long as the system has not reached a consistent state. The timing to get a messed up recovery processing is very racy, and would very unlikely happen. The thread that has reported the issue has demonstrated the bug using injection points to force a PANIC in the middle of a checkpoint. Issue introduced in 728bd99, so backpatch all the way down. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Michael Paquier Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com Backpatch-through: 11
* Rename session_auth_is_superuser to current_role_is_superuser.Nathan Bossart2023-07-12
| | | | | | | | | | | This variable might've been accurately named when it was added in ea886339b8, but the name hasn't been accurate since at least the introduction of SET ROLE in e5d6b91220. The corresponding documentation was fixed in eedb068c0a. This commit renames the variable accordingly. Suggested-by: Joseph Koshakow Discussion: https://postgr.es/m/CAAvxfHc-HHzONQ2oXdvhFF9ayRnidPwK%2BfVBhRzaBWYYLVQL-g%40mail.gmail.com
* Add new parallel message type to progress reporting.Masahiko Sawada2023-07-11
| | | | | | | | | | | | | | | | | | This commit adds a new type of parallel message 'P' to allow a parallel worker to poke at a leader to update the progress. Currently it supports only incremental progress reporting but it's possible to allow for supporting of other backend progress APIs in the future. There are no users of this new message type as of this commit. That will follow in future commits. Idea from Andres Freund. Author: Sami Imseih Reviewed by: Michael Paquier, Masahiko Sawada Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com
* Silence "missing contrecord" error.Thomas Munro2023-07-03
| | | | | | | | | | | | | | | | | Commit dd38ff28ad added a new error message "missing contrecord" when we fail to reassemble a record. Unfortunately that caused noisy messages to be logged by pg_waldump at end of segment, and by walsender when asked to shut down on a segment boundary. Remove the new error message, so that this condition signals end-of- WAL without a message. It's arguably a reportable condition that should not be silenced while performing crash recovery, but fixing that without introducing noise in the other cases will require more research. Back-patch to 15. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com
* Enable archiving in recovery TAP test 009_twophase.plMichael Paquier2023-06-20
| | | | | | | | | | | | | | | | | This is a follow-up of f663b00, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13
* Initialize 'recordXtime' to silence compiler warning.Heikki Linnakangas2023-06-06
| | | | | | | | | | In reality, recordXtime will always be set by the getRecordTimestamp call, but the compiler doesn't necessarily see that. Back-patch to all supported versions. Author: Tristan Partin Discussion: https://www.postgresql.org/message-id/CT5MN8E11U0M.1NYNCHXYUHY41@gonk
* Pre-beta mechanical code beautification.Tom Lane2023-05-19
| | | | | | | | | | | | | | | Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
* Message style improvementsPeter Eisentraut2023-05-19
|
* Fix typos in commentsMichael Paquier2023-05-02
| | | | | | | | | The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
* Prevent underflow in KeepLogSeg().Nathan Bossart2023-04-27
| | | | | | | | | | | | | | | The call to XLogGetReplicationSlotMinimumLSN() might return a greater LSN than the one given to the function. Subsequent segment number calculations might then underflow, which could result in unexpected behavior when removing or recyling WAL files. This was introduced with max_slot_wal_keep_size in c655077639. To fix, skip the block of code for replication slots if the LSN is greater. Reported-by: Xu Xingwang Author: Kyotaro Horiguchi Reviewed-by: Junwang Zhao Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org Backpatch-through: 13
* Re-add tracking of wait event SLRUFlushSyncMichael Paquier2023-04-26
| | | | | | | | | | | SLRUFlushSync has been accidently removed during dee663f, that has moved the flush of the SLRU files to the checkpointer, so add it back. The issue has been noticed by Thomas when checking for orphaned wait events. Author: Thomas Munro Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/CA+hUKGK6tqm59KuF1z+h5Y8fsWcu5v8+84kduSHwRzwjB2aa_A@mail.gmail.com
* Fix various typos and incorrect/outdated name referencesDavid Rowley2023-04-19
| | | | | Author: Alexander Lakhin Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com
* Fix pg_basebackup with in-place tablespaces some more.Robert Haas2023-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c6f2f01611d4f2c412e92eb7893f76fa590818e8 purported to make this work, but problems remained. In a plain-format backup, the files from an in-place tablespace got included in the tar file for the main tablespace, which is wrong but it's not clear that it has any user-visible consequences. In a tar-format backup, the TABLESPACE_MAP option is used, and so we never iterated over pg_tblspc and thus never backed up the in-place tablespaces anywhere at all. To fix this, reverse the changes in that commit, so that when we scan pg_tblspc during a backup, we create tablespaceinfo objects even for in-place tablespaces. We set the field that would normally contain the absolute pathname to the relative path pg_tblspc/${TSOID}, and that's good enough to make basebackup.c happy without any further changes. However, pg_basebackup needs a couple of adjustments to make it work. First, it needs to understand that a relative path for a tablespace means it's an in-place tablespace. Second, it needs to tolerate the situation where restoring the main tablespace tries to create pg_tblspc or a subdirectory and finds that it already exists, because we restore user-defined tablespaces before the main tablespace. Since in-place tablespaces are only intended for use in development and testing, no back-patch. Patch by me, reviewed by Thomas Munro and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmobwvbEp+fLq2PykMYzizcvuNv0a7gPMJtxOTMOuuRLMHg@mail.gmail.com
* Avoid trying to write an empty WAL record in log_newpage_range().Tom Lane2023-04-17
| | | | | | | | | | | | | | | | | | | If the last few pages in the specified range are empty (all zero), then log_newpage_range() could try to emit an empty WAL record containing no FPIs. This at least upsets an Assert in ReserveXLogInsertLocation, and might perhaps have bad real-world consequences in non-assert builds. This has been broken since log_newpage_range() was introduced, but the case was hard if not impossible to hit before commit 3d6a98457 decided it was okay to leave VM and FSM pages intentionally zero. Nonetheless, it seems prudent to back-patch. log_newpage_range() was added in v12 but later back-patched, so this affects all supported branches. Matthias van de Meent, per report from Justin Pryzby Discussion: https://postgr.es/m/ZD1daibg4RF50IOj@telsasoft.com
* Fix incorrect format placeholdersPeter Eisentraut2023-04-12
|
* Allow logical decoding on standbysAndres Freund2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unsurprisingly, this requires wal_level = logical to be set on the primary and standby. The infrastructure added in 26669757b6a ensures that slots are invalidated if the primary's wal_level is lowered. Creating a slot on a standby waits for a xl_running_xact record to be processed. If the primary is idle (and thus not emitting xl_running_xact records), that can take a while. To make that faster, this commit also introduces the pg_log_standby_snapshot() function. By executing it on the primary, completion of slot creation on the standby can be accelerated. Note that logical decoding on a standby does not itself enforce that required catalog rows are not removed. The user has to use physical replication slots + hot_standby_feedback or other measures to prevent that. If catalog rows required for a slot are removed, the slot is invalidated. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion, for the addition of the pg_log_standby_snapshot() function. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com>