aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam
Commit message (Collapse)AuthorAge
* Fix snapshot builds during promotion of hot standby node with 2PCMichael Paquier2021-10-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some specific logic is done at the end of recovery when involving 2PC transactions: 1) Call RecoverPreparedTransactions(), to recover the state of 2PC transactions into memory (re-acquire locks, etc.). 2) ShutdownRecoveryTransactionEnvironment(), to move back to normal operations, mainly cleaning up recovery locks and KnownAssignedXids (including any 2PC transaction tracked previously). 3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is the tipping point for any process calling RecoveryInProgress() to check if the cluster is still in recovery or not. Any snapshot taken between steps 2) and 3) would be empty, causing any transaction relying on a snapshot at this point to potentially corrupt data as there could still be some 2PC transactions to track, with RecentXmin moving backwards on successive calls to GetSnapshotData() in the same transaction. As SharedRecoveryState is the point to take into account to know if it is safe to discard KnownAssignedXids, this commit moves step 2) after step 3), so as we can never finish with empty snapshots. This exists since the introduction of hot standby, so backpatch all the way down. The window with incorrect snapshots is extremely small, but I have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm member fairywren. Thomas Munro also found it independently. Special thanks to Andres Freund for taking the time to analyze this issue. Reported-by: Thomas Munro, Michael Paquier Analyzed-by: Andres Freund Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de Backpatch-through: 9.6
* Remove unstable, unnecessary test; fix typoAlvaro Herrera2021-10-01
| | | | | | | | | | | | | Commit ff9f111bce24 added some test code that's unportable and doesn't add meaningful coverage. Remove it rather than try and get it to work everywhere. While at it, fix a typo in a log message added by the aforementioned commit. Backpatch to 14. Discussion: https://postgr.es/m/3000074.1632947632@sss.pgh.pa.us
* Fix Portal snapshot tracking to handle subtransactions properly.Tom Lane2021-10-01
| | | | | | | | | | | | | | | | | | | | | | | Commit 84f5c2908 forgot to consider the possibility that EnsurePortalSnapshotExists could run inside a subtransaction with lifespan shorter than the Portal's. In that case, the new active snapshot would be popped at the end of the subtransaction, leaving a dangling pointer in the Portal, with mayhem ensuing. To fix, make sure the ActiveSnapshot stack entry is marked with the same subtransaction nesting level as the associated Portal. It's certainly safe to do so since we won't be here at all unless the stack is empty; hence we can't create an out-of-order stack. Let's also apply this logic in the case where PortalRunUtility sets portalSnapshot, just to be sure that path can't cause similar problems. It's slightly less clear that that path can't create an out-of-order stack, so add an assertion guarding it. Report and patch by Bertrand Drouvot (with kibitzing by me). Back-patch to v11, like the previous commit. Discussion: https://postgr.es/m/ff82b8c5-77f4-3fe7-6028-fcf3303e82dd@amazon.com
* Fix WAL replay in presence of an incomplete recordAlvaro Herrera2021-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Physical replication always ships WAL segment files to replicas once they are complete. This is a problem if one WAL record is split across a segment boundary and the primary server crashes before writing down the segment with the next portion of the WAL record: WAL writing after crash recovery would happily resume at the point where the broken record started, overwriting that record ... but any standby or backup may have already received a copy of that segment, and they are not rewinding. This causes standbys to stop following the primary after the latter crashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8 because the standby is still trying to read the continuation record (contrecord) for the original long WAL record, but it is not there and it will never be. A workaround is to stop the replica, delete the WAL file, and restart it -- at which point a fresh copy is brought over from the primary. But that's pretty labor intensive, and I bet many users would just give up and re-clone the standby instead. A fix for this problem was already attempted in commit 515e3d84a0b5, but it only addressed the case for the scenario of WAL archiving, so streaming replication would still be a problem (as well as other things such as taking a filesystem-level backup while the server is down after having crashed), and it had performance scalability problems too; so it had to be reverted. This commit fixes the problem using an approach suggested by Andres Freund, whereby the initial portion(s) of the split-up WAL record are kept, and a special type of WAL record is written where the contrecord was lost, so that WAL replay in the replica knows to skip the broken parts. With this approach, we can continue to stream/archive segment files as soon as they are complete, and replay of the broken records will proceed across the crash point without a hitch. Because a new type of WAL record is added, users should be careful to upgrade standbys first, primaries later. Otherwise they risk the standby being unable to start if the primary happens to write such a record. A new TAP test that exercises this is added, but the portability of it is yet to be seen. This has been wrong since the introduction of physical replication, so backpatch all the way back. In stable branches, keep the new XLogReaderState members at the end of the struct, to avoid an ABI break. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
* Document XLOG_INCLUDE_XID a little betterAlvaro Herrera2021-09-21
| | | | | | | | | | | | I noticed that commit 0bead9af484c left this flag undocumented in XLogSetRecordFlags, which led me to discover that the flag doesn't actually do what the one comment on it said it does. Improve the situation by adding some more comments. Backpatch to 14, where the aforementioned commit appears. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/202109212119.c3nhfp64t2ql@alvherre.pgsql
* Send NOTIFY signals during CommitTransaction.Tom Lane2021-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Formerly, we sent signals for outgoing NOTIFY messages within ProcessCompletedNotifies, which was also responsible for sending relevant ones of those messages to our connected client. It therefore had to run during the main-loop processing that occurs just before going idle. This arrangement had two big disadvantages: * Now that procedures allow intra-command COMMITs, it would be useful to send NOTIFYs to other sessions immediately at COMMIT (though, for reasons of wire-protocol stability, we still shouldn't forward them to our client until end of command). * Background processes such as replication workers would not send NOTIFYs at all, since they never execute the client communication loop. We've had requests to allow triggers running in replication workers to send NOTIFYs, so that's a problem. To fix these things, move transmission of outgoing NOTIFY signals into AtCommit_Notify, where it will happen during CommitTransaction. Also move the possible call of asyncQueueAdvanceTail there, to ensure we don't bloat the async SLRU if a background worker sends many NOTIFYs with no one listening. We can also drop the call of asyncQueueReadAllNotifications, allowing ProcessCompletedNotifies to go away entirely. That's because commit 790026972 added a call of ProcessNotifyInterrupt adjacent to PostgresMain's call of ProcessCompletedNotifies, and that does its own call of asyncQueueReadAllNotifications, meaning that we were uselessly doing two such calls (inside two separate transactions) whenever inbound notify signals coincided with an outbound notify. We need only set notifyInterruptPending to ensure that ProcessNotifyInterrupt runs, and we're done. The existing documentation suggests that custom background workers should call ProcessCompletedNotifies if they want to send NOTIFY messages. To avoid an ABI break in the back branches, reduce it to an empty routine rather than removing it entirely. Removal will occur in v15. Although the problems mentioned above have existed for awhile, I don't feel comfortable back-patching this any further than v13. There was quite a bit of churn in adjacent code between 12 and 13. At minimum we'd have to also backpatch 51004c717, and a good deal of other adjustment would also be needed, so the benefit-to-risk ratio doesn't look attractive. Per bug #15293 from Michael Powers (and similar gripes from others). Artur Zakirov and Tom Lane Discussion: https://postgr.es/m/153243441449.1404.2274116228506175596@wrigleys.postgresql.org
* Revert "Avoid creating archive status ".ready" files too early"Alvaro Herrera2021-09-04
| | | | | | | | | | This reverts commit 515e3d84a0b5 and equivalent commits in back branches. This solution to the problem has a number of problems, so we'll try again with a different approach. Per note from Andres Freund Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
* Fix broken snapshot handling in parallel workers.Robert Haas2021-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | Pengchengliu reported an assertion failure in a parallel woker while performing a parallel scan using an overflowed snapshot. The proximate cause is that TransactionXmin was set to an incorrect value. The underlying cause is incorrect snapshot handling in parallel.c. In particular, InitializeParallelDSM() was unconditionally calling GetTransactionSnapshot(), because I (rhaas) mistakenly thought that was always retrieving an existing snapshot whereas, at isolation levels less than REPEATABLE READ, it's actually taking a new one. So instead do this only at higher isolation levels where there actually is a single snapshot for the whole transaction. By itself, this is not a sufficient fix, because we still need to guarantee that TransactionXmin gets set properly in the workers. The easiest way to do that seems to be to install the leader's active snapshot as the transaction snapshot if the leader did not serialize a transaction snapshot. This doesn't affect the results of future GetTrasnactionSnapshot() calls since those have to take a new snapshot anyway; what we care about is the side effect of setting TransactionXmin. Report by Pengchengliu. Patch by Greg Nancarrow, except for some comment text which I supplied. Discussion: https://postgr.es/m/002f01d748ac$eaa781a0$bff684e0$@tju.edu.cn
* Avoid creating archive status ".ready" files too earlyAlvaro Herrera2021-08-23
| | | | | | | | | | | | | | | | | | | | | | | WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart <bossartn@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
* Refresh apply delay on reload of recovery_min_apply_delay at recoveryMichael Paquier2021-08-16
| | | | | | | | | | | | | | | | | This commit ensures that the wait interval in the replay delay loop waiting for an amount of time defined by recovery_min_apply_delay is correctly handled on reload, recalculating the delay if this GUC value is updated, based on the timestamp of the commit record being replayed. The previous behavior would be problematic for example with replay still waiting even if the delay got reduced or just cancelled. If the apply delay was increased to a larger value, the wait would have just respected the old value set, finishing earlier. Author: Soumyadeep Chakraborty, Ashwin Agrawal Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com Backpatch-through: 9.6
* Update obsolete comment that still referred to CheckpointLockHeikki Linnakangas2021-07-30
| | | | | | CheckpointLock was removed in commit d18e75664a, and commit ce197e91d0 updated a leftover comment in CreateCheckPoint, but there was another copy of it in CreateRestartPoint still.
* Close yet another race condition in replication slot test codeAlvaro Herrera2021-07-29
| | | | | | | | | | | | | | | Buildfarm shows that this test has a further failure mode when a checkpoint starts earlier than expected, so we detect a "checkpoint completed" line that's not the one we want. Change the config to try and prevent this. Per buildfarm While at it, update one comment that was forgotten in commit d18e75664a2f. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20210729.162038.534808353849568395.horikyota.ntt@gmail.com
* Update minimum recovery point on truncation during WAL replay of abort record.Fujii Masao2021-07-29
| | | | | | | | | | | | | | | | If a file is truncated, we must update minRecoveryPoint. Once a file is truncated, there's no going back; it would not be safe to stop recovery at a point earlier than that anymore. Commit 7bffc9b7bf changed xact_redo_commit() so that it updates minRecoveryPoint on truncation, but forgot to change xact_redo_abort(). Back-patch to all supported versions. Reported-by: mengjuan.cmj@alibaba-inc.com Author: Fujii Masao Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/b029fce3-4fac-4265-968e-16f36ff4d075.mengjuan.cmj@alibaba-inc.com
* Advance old-segment horizon properly after slot invalidationAlvaro Herrera2021-07-16
| | | | | | | | | | | | | | | | When some slots are invalidated due to the max_slot_wal_keep_size limit, the old segment horizon should move forward to stay within the limit. However, in commit c6550776394e we forgot to call KeepLogSeg again to recompute the horizon after invalidating replication slots. In cases where other slots remained, the limits would be recomputed eventually for other reasons, but if all slots were invalidated, the limits would not move at all afterwards. Repair. Backpatch to 13 where the feature was introduced. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Marcin Krupowicz <mk@071.ovh> Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
* Fix incorrect PITR message for transaction ROLLBACK PREPAREDMichael Paquier2021-06-30
| | | | | | | | | | | Reaching PITR on such a transaction would cause the generation of a LOG message mentioning a transaction committed, not aborted. Oversight in 4f1b890. Author: Simon Riggs Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com Backpatch-through: 9.6
* Fix race condition in TransactionGroupUpdateXidStatus().Amit Kapila2021-06-28
| | | | | | | | | | | | | | | | | | | When we cannot immediately acquire XactSLRULock in exclusive mode at commit time, we add ourselves to a list of processes that need their XIDs status update. We do this if the clog page where we need to update the current transaction status is the same as the group leader's clog page, otherwise, we allow the caller to clear it by itself. Now, when we can't add ourselves to any group, we were not clearing the current proc if it has already become a member of some group which was leading to an assertion failure when the same proc was assigned to another backend after the current backend exits. Reported-by: Alexander Lakhin Bug: 17072 Author: Amit Kapila Tested-By: Alexander Lakhin Backpatch-through: 11, where it was introduced Discussion: https://postgr.es/m/17072-2f8764857ef2c92a@postgresql.org
* Tidy up GetMultiXactIdMembers()'s behavior on errorHeikki Linnakangas2021-06-17
| | | | | | | | | | | | | | | | | | | | | | | | | One of the error paths left *members uninitialized. That's not a live bug, because most callers don't look at *members when the function returns -1, but let's be tidy. One caller, in heap_lock_tuple(), does "if (members != NULL) pfree(members)", but AFAICS it never passes an invalid 'multi' value so it should not reach that error case. The callers are also a bit inconsistent in their expectations. heap_lock_tuple() pfrees the 'members' array if it's not-NULL, others pfree() it if "nmembers >= 0", and others if "nmembers > 0". That's not a live bug either, because the function should never return 0, but add an Assert for that to make it more clear. I left the callers alone for now. I also moved the line where we set *nmembers. It wasn't wrong before, but I like to do that right next to the 'return' statement, to make it clear that it's always set on return. Also remove one unreachable return statement after ereport(ERROR), for brevity and for consistency with the similar if-block right after it. Author: Greg Nancarrow with the additional changes by me Backpatch-through: 9.6, all supported versions
* Fix outdated comment that talked about seek position of WAL file.Heikki Linnakangas2021-06-16
| | | | | | | | Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL file, which doesn't change the seek position (unless we fall back to the implementation in src/port/pread.c). Update comment accordingly. Backpatch-through: 12, where we started to use pg_pread()
* Fix corner case failure of new standby to follow new primary.Robert Haas2021-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | This only happens if (1) the new standby has no WAL available locally, (2) the new standby is starting from the old timeline, (3) the promotion happened in the WAL segment from which the new standby is starting, (4) the timeline history file for the new timeline is available from the archive but the WAL files for are not (i.e. this is a race), (5) the WAL files for the new timeline are available via streaming, and (6) recovery_target_timeline='latest'. Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this logic and was an improvement over the previous code, but it mishandled this case. If recovery_target_timeline='latest' and restore_command is set, validateRecoveryParameters() can change recoveryTargetTLI to be different from receiveTLI. If streaming is then tried afterward, expectedTLEs gets initialized with the history of the wrong timeline. It's supposed to be a list of entries explaining how to get to the target timeline, but in this case it ends up with a list of entries explaining how to get to the new standby's original timeline, which isn't right. Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi. Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
* Make standby promotion reset the recovery pause state to 'not paused'.Fujii Masao2021-05-19
| | | | | | | | | | | | | | If a promotion is triggered while recovery is paused, the paused state ends and promotion continues. But previously in that case pg_get_wal_replay_pause_state() returned 'paused' wrongly while a promotion was ongoing. This commit changes a standby promotion so that it marks the recovery pause state as 'not paused' when it's triggered, to fix the issue. Author: Fujii Masao Reviewed-by: Dilip Kumar, Kyotaro Horiguchi Discussion: https://postgr.es/m/f706876c-4894-0ba5-6f4d-79803eeea21b@oss.nttdata.com
* Initial pgindent and pgperltidy run for v14.Tom Lane2021-05-12
| | | | | | | | Also "make reformat-dat-files". The only change worthy of note is that pgindent messed up the formatting of launcher.c's struct LogicalRepWorkerId, which led me to notice that that struct wasn't used at all anymore, so I just took it out.
* Revert recovery prefetching feature.Thomas Munro2021-05-10
| | | | | | | | | | | | | | | | | | This set of commits has some bugs with known fixes, but at this late stage in the release cycle it seems best to revert and resubmit next time, along with some new automated test coverage for this whole area. Commits reverted: dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery." 1d257577: Optionally prefetch referenced data in recovery. f003d9f8: Add circular WAL decoding buffer. 323cbe7c: Remove read_page callback from XLogReader. Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the corresponding section of config.sgml is now reverted. Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
* Add some forgotten LSN_FORMAT_ARGS() in xlogreader.cMichael Paquier2021-04-24
| | | | | | | | | | 6f6f284 has introduced a specific macro to make printf()-ing of LSNs easier. This takes care of what looks like the remaining code paths that did not get the call. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/YIJS9x6K8ruizN7j@paquier.xyz
* Remove use of [U]INT64_FORMAT in some translatable stringsMichael Paquier2021-04-23
| | | | | | | | %lld with (long long), or %llu with (unsigned long long) are more adapted. This is similar to 3286065. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20210421.200000.1462448394029407895.horikyota.ntt@gmail.com
* Fix typoPeter Eisentraut2021-04-21
|
* Remove COMMIT_TS_SETTS record.Fujii Masao2021-04-12
| | | | | | | | | | | | | | | | | | | Commit 438fc4a39c prevented the WAL replay from writing COMMIT_TS_SETTS record. By this change there is no code that generates COMMIT_TS_SETTS record in PostgreSQL core. Also we can think that there are no extensions using the record because we've not received so far any complaints about the issue that commit 438fc4a39c fixed. Therefore this commit removes COMMIT_TS_SETTS record and its related code. Even without this record, the timestamp required for commit timestamp feature can be acquired from the COMMIT record. Bump WAL page magic. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Doc: Review for "Optionally prefetch referenced data in recovery."Thomas Munro2021-04-10
| | | | | | | | Typos, corrections and language improvements in the docs, and a few in code comments too. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
* Optionally prefetch referenced data in recovery.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Add circular WAL decoding buffer.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | Teach xlogreader.c to decode its output into a circular buffer, to support optimizations based on looking ahead. * XLogReadRecord() works as before, consuming records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros. * An alternative new interface XLogNextRecord() is added that returns pointers to DecodedXLogRecord structs that can be examined directly. * XLogReadAhead() provides a second cursor that lets you see further ahead, as long as data is available and there is enough space in the decoding buffer. This returns DecodedXLogRecord pointers to the caller, but also adds them to a queue of records that will later be consumed by XLogNextRecord()/XLogReadRecord(). The buffer's size is controlled with wal_decode_buffer_size. The buffer could potentially be placed into shared memory, for future projects. Large records that don't fit in the circular buffer are called "oversized" and allocated separately with palloc(). Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
* Remove read_page callback from XLogReader.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | Previously, the XLogReader module would fetch new input data using a callback function. Redesign the interface so that it tells the caller to insert more data with a special return value instead. This API suits later patches for prefetching, encryption and maybe other future projects that would otherwise require continually extending the callback interface. As incidental cleanup work, move global variables readOff, readLen and readSegNo inside XlogReaderState. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version) Reviewed-by: Antonin Houska <ah@cybertec.at> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
* Stop archive recovery if WAL generated with wal_level=minimal is found.Fujii Masao2021-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously if hot standby was enabled, archive recovery exited with an error when it found WAL generated with wal_level=minimal. But if hot standby was disabled, it just reported a warning and continued in that case. Which could lead to data loss or errors during normal operation. A warning was emitted, but users could easily miss that and not notice this serious situation until they encountered the actual errors. To improve this situation, this commit changes archive recovery so that it exits with FATAL error when it finds WAL generated with wal_level=minimal whatever the setting of hot standby. This enables users to notice the serious situation soon. The FATAL error is thrown if archive recovery starts from a base backup taken before wal_level is changed to minimal. When archive recovery exits with the error, if users have a base backup taken after setting wal_level to higher than minimal, they can recover the database by starting archive recovery from that newer backup. But note that if such backup doesn't exist, there is no easy way to complete archive recovery, which may make the database server unstartable and users may lose whole database. The commit adds the note about this risk into the document. Even in the case of unstartable database server, previously by just disabling hot standby users could avoid the error during archive recovery, forcibly start up the server and salvage data from it. But note that this commit makes this procedure unavailable at all. Author: Takamichi Osumi Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
* Fix bug in WAL replay of COMMIT_TS_SETTS record.Fujii Masao2021-03-25
| | | | | | | | | | | | | | | | | | Previously the WAL replay of COMMIT_TS_SETTS record called TransactionTreeSetCommitTsData() with the argument write_xlog=true, which generated and wrote new COMMIT_TS_SETTS record. This should not be acceptable because it's during recovery. This commit fixes the WAL replay of COMMIT_TS_SETTS record so that it calls TransactionTreeSetCommitTsData() with write_xlog=false and doesn't generate new WAL during recovery. Back-patch to all supported branches. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."Amit Kapila2021-03-24
| | | | | | | | | | | | | | | | | | | To allow inserts in parallel-mode this feature has to ensure that all the constraints, triggers, etc. are parallel-safe for the partition hierarchy which is costly and we need to find a better way to do that. Additionally, we could have used existing cached information in some cases like indexes, domains, etc. to determine the parallel-safety. List of commits reverted, in reverse chronological order: ed62d3737c Doc: Update description for parallel insert reloption. c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode. c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f. e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f. e4e87a32cc Fix valgrind issue in commit 05c8482f7f. 05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...". Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
* Fix timeline assignment in checkpoints with 2PC transactionsMichael Paquier2021-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any transactions found as still prepared by a checkpoint have their state data read from the WAL records generated by PREPARE TRANSACTION before being moved into their new location within pg_twophase/. While reading such records, the WAL reader uses the callback read_local_xlog_page() to read a page, that is shared across various parts of the system. This callback, since 1148e22a, has introduced an update of ThisTimeLineID when reading a record while in recovery, which is potentially helpful in the context of cascading WAL senders. This update of ThisTimeLineID interacts badly with the checkpointer if a promotion happens while some 2PC data is read from its record, as, by changing ThisTimeLineID, any follow-up WAL records would be written to an timeline older than the promoted one. This results in consistency issues. For instance, a subsequent server restart would cause a failure in finding a valid checkpoint record, resulting in a PANIC, for instance. This commit changes the code reading the 2PC data to reset the timeline once the 2PC record has been read, to prevent messing up with the static state of the checkpointer. It would be tempting to do the same thing directly in read_local_xlog_page(). However, based on the discussion that has led to 1148e22a, users may rely on the updates of ThisTimeLineID when a WAL record page is read in recovery, so changing this callback could break some cases that are working currently. A TAP test reproducing the issue is added, relying on a PITR to precisely trigger a promotion with a prepared transaction still tracked. Per discussion with Heikki Linnakangas, Kyotaro Horiguchi, Fujii Masao and myself. Author: Soumyadeep Chakraborty, Jimmy Yih, Kevin Yeap Discussion: https://postgr.es/m/CAE-ML+_EjH_fzfq1F3RJ1=XaaNG=-Jz-i3JqkNhXiLAsM3z-Ew@mail.gmail.com Backpatch-through: 10
* Code review for server's handling of "tablespace map" files.Tom Lane2021-03-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking at Robert Foggia's report, I noticed a passel of other issues in the same area: * The scheme for backslash-quoting newlines in pathnames is just wrong; it will misbehave if the last ordinary character in a pathname is a backslash. I'm not sure why we're bothering to allow newlines in tablespace paths, but if we're going to do it we should do it without introducing other problems. Hence, backslashes themselves have to be backslashed too. * The author hadn't read the sscanf man page very carefully, because this code would drop any leading whitespace from the path. (I doubt that a tablespace path with leading whitespace could happen in practice; but if we're bothering to allow newlines in the path, it sure seems like leading whitespace is little less implausible.) Using sscanf for the task of finding the first space is overkill anyway. * While I'm not 100% sure what the rationale for escaping both \r and \n is, if the idea is to allow Windows newlines in the file then this code failed, because it'd throw an error if it saw \r followed by \n. * There's no cross-check for an incomplete final line in the map file, which would be a likely apparent symptom of the improper-escaping bug. On the generation end, aside from the escaping issue we have: * If needtblspcmapfile is true then do_pg_start_backup will pass back escaped strings in tablespaceinfo->path values, which no caller wants or is prepared to deal with. I'm not sure if there's a live bug from that, but it looks like there might be (given the dubious assumption that anyone actually has newlines in their tablespace paths). * It's not being very paranoid about the possibility of random stuff in the pg_tblspc directory. IMO we should ignore anything without an OID-like name. The escaping rule change doesn't seem back-patchable: it'll require doubling of backslashes in the tablespace_map file, which is basically a basebackup format change. The odds of that causing trouble are considerably more than the odds of the existing bug causing trouble. The rest of this seems somewhat unlikely to cause problems too, so no back-patch.
* Prevent buffer overrun in read_tablespace_map().Tom Lane2021-03-17
| | | | | | | | | | | | | | | | Robert Foggia of Trustwave reported that read_tablespace_map() fails to prevent an overrun of its on-stack input buffer. Since the tablespace map file is presumed trustworthy, this does not seem like an interesting security vulnerability, but still we should fix it just in the name of robustness. While here, document that pg_basebackup's --tablespace-mapping option doesn't work with tar-format output, because it doesn't. To make it work, we'd have to modify the tablespace_map file within the tarball sent by the server, which might be possible but I'm not volunteering. (Less-painful solutions would require changing the basebackup protocol so that the source server could adjust the map. That's not very appetizing either.)
* Make archiver process an auxiliary process.Fujii Masao2021-03-15
| | | | | | | | | | | | | | | | | | | | | | | This commit changes WAL archiver process so that it's treated as an auxiliary process and can use shared memory. This is an infrastructure patch required for upcoming shared-memory based stats collector patch series. These patch series basically need any processes including archiver that can report the statistics to access to shared memory. Since this patch itself is useful to simplify the code and when users monitor the status of archiver, it's committed separately in advance. This commit simplifies the code for WAL archiving. For example, previously backends need to signal to archiver via postmaster when they notify archiver that there are some WAL files to archive. On the other hand, this commit removes that signal to postmaster and enables backends to notify archier directly using shared latch. Also, as the side of this change, the information about archiver process becomes viewable at pg_stat_activity view. Author: Kyotaro Horiguchi Reviewed-by: Andres Freund, Álvaro Herrera, Julien Rouhaud, Tomas Vondra, Arthur Zakirov, Fujii Masao Discussion: https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp
* Add condition variable for recovery resume.Thomas Munro2021-03-12
| | | | | | | | | Replace a sleep loop with a CV, to get a fast reaction time when recovery is resumed or the postmaster exits via standard infrastructure. Unfortunately we still need to wake up every second to perform extra polling during the recovery pause loop. Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
* Be clear about whether a recovery pause has taken effect.Robert Haas2021-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the code and documentation seem to have essentially assumed than a call to pg_wal_replay_pause() would take place immediately, but that's not the case, because we only check for a pause in certain places. This means that a tool that uses this function and then wants to do something else afterward that is dependent on the pause having taken effect doesn't know how long it needs to wait to be sure that no more WAL is going to be replayed. To avoid that, add a new function pg_get_wal_replay_pause_state() which returns either 'not paused', 'paused requested', or 'paused'. After calling pg_wal_replay_pause() the status will immediate change from 'not paused' to 'pause requested'; when the startup process has noticed this, the status will change to 'pause'. For backward compatibility, pg_is_wal_replay_paused() still exists and returns the same thing as before: true if a pause has been requested, whether or not it has taken effect yet; and false if not. The documentation is updated to clarify. To improve the changes that a pause request is quickly confirmed effective, adjust things so that WaitForWALToBecomeAvailable will swiftly reach a call to recoveryPausesHere() when a pause request is made. Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata, Masahiko Sawada, and Bharath Rupireddy. Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
* Enable parallel SELECT for "INSERT INTO ... SELECT ...".Amit Kapila2021-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Parallel SELECT can't be utilized for INSERT in the following cases: - INSERT statement uses the ON CONFLICT DO UPDATE clause - Target table has a parallel-unsafe: trigger, index expression or predicate, column default expression or check constraint - Target table has a parallel-unsafe domain constraint on any column - Target table is a partitioned table with a parallel-unsafe partition key expression or support function The planner is updated to perform additional parallel-safety checks for the cases listed above, for determining whether it is safe to run INSERT in parallel-mode with an underlying parallel SELECT. The planner will consider using parallel SELECT for "INSERT INTO ... SELECT ...", provided nothing unsafe is found from the additional parallel-safety checks, or from the existing parallel-safety checks for SELECT. While checking parallel-safety, we need to check it for all the partitions on the table which can be costly especially when we decide not to use a parallel plan. So, in a separate patch, we will introduce a GUC and or a reloption to enable/disable parallelism for Insert statements. Prior to entering parallel-mode for the execution of INSERT with parallel SELECT, a TransactionId is acquired and assigned to the current transaction state. This is necessary to prevent the INSERT from attempting to assign the TransactionId whilst in parallel-mode, which is not allowed. This approach has a disadvantage in that if the underlying SELECT does not return any rows, then the TransactionId is not used, however that shouldn't happen in practice in many cases. Author: Greg Nancarrow, Amit Langote, Amit Kapila Reviewed-by: Amit Langote, Hou Zhijie, Takayuki Tsunakawa, Antonin Houska, Bharath Rupireddy, Dilip Kumar, Vignesh C, Zhihong Yu, Amit Kapila Tested-by: Tang, Haiying Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com Discussion: https://postgr.es/m/CAJcOf-fAdj=nDKMsRhQzndm-O13NY4dL6xGcEvdX5Xvbbi0V7g@mail.gmail.com
* Track total amounts of times spent writing and syncing WAL data to disk.Fujii Masao2021-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds new GUC track_wal_io_timing. When this is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted in pg_stat_wal. This information would be useful to check how much WAL write and sync affect the performance. Enabling track_wal_io_timing will make the server query the operating system for the current time every time WAL is written or synced, which may cause significant overhead on some platforms. To avoid such additional overhead in the server with track_io_timing enabled, this commit introduces track_wal_io_timing as a separate parameter from track_io_timing. Note that WAL write and sync activity by walreceiver has not been tracked yet. This commit makes the server also track the numbers of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal, regardless of the setting of track_wal_io_timing. This counters can be used to calculate the WAL write and sync time per request, for example. Bump PGSTAT_FILE_FORMAT_ID. Bump catalog version. Author: Masahiro Ikeda Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
* Track replication origin progress for rollbacks.Amit Kapila2021-03-08
| | | | | | | | | | | | | | | | Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC but it was not complete. It misses to properly track the progress for rollback prepared especially it missed updating the code for recovery. Additionally, we need to allow tracking it on subscriber nodes where wal_level might not be logical. It is required to track decoding of 2PC which is committed in PG14 (a271a1b50e) and also nobody complained about this till now so not backpatching it. Author: Amit Kapila Reviewed-by: Michael Paquier and Ajin Cherian Discussion: https://postgr.es/m/CAA4eK1L-kHmMnSdrRW6UhRbCjR7cgh04c+6psY15qzT6ktcd+g@mail.gmail.com
* Fix some typos, grammar and style in docs and commentsMichael Paquier2021-02-24
| | | | | | | | The portions fixing the documentation are backpatched where needed. Author: Justin Pryzby Discussion: https://postgr.es/m/20210210235557.GQ20012@telsasoft.com backpatch-through: 9.6
* Simplify printing of LSNsPeter Eisentraut2021-02-23
| | | | | | | | | | Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs. Convert all applicable code to use it. Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
* Initialize atomic variable waitStart in PGPROC, at postmaster startup.Fujii Masao2021-02-22
| | | | | | | | | | | | | | | | | Commit 46d6e5f567 added the atomic variable "waitStart" into PGPROC struct, to store the time at which wait for lock acquisition started. Previously this variable was initialized every time each backend started. Instead, this commit makes postmaster initialize it at the startup, to ensure that the variable should be initialized before any use of it. This commit also moves the code to initialize "waitStart" variable for prepare transaction, from TwoPhaseGetDummyProc() to MarkAsPreparingGuts(). Because MarkAsPreparingGuts() is more proper place to do that since it initializes other PGPROC variables. Author: Fujii Masao Reviewed-by: Atsushi Torikoshi Discussion: https://postgr.es/m/1df88660-6f08-cc6e-b7e2-f85296a2bdab@oss.nttdata.com
* Fix bug in COMMIT AND CHAIN command.Fujii Masao2021-02-19
| | | | | | | | | | | | | | | | | | This commit fixes COMMIT AND CHAIN command so that it starts new transaction immediately even if savepoints are defined within the transaction to commit. Previously COMMIT AND CHAIN command did not in that case because commit 280a408b48 forgot to make CommitTransactionCommand() handle a transaction chaining when the transaction state was TBLOCK_SUBCOMMIT. Also this commit adds the regression test for COMMIT AND CHAIN command when savepoints are defined. Back-patch to v12 where transaction chaining was added. Reported-by: Arthur Nascimento Author: Fujii Masao Reviewed-by: Arthur Nascimento, Vik Fearing Discussion: https://postgr.es/m/16867-3475744069228158@postgresql.org
* Use errmsg_internal for debug messagesPeter Eisentraut2021-02-17
| | | | | | An inconsistent set of debug-level messages was not using errmsg_internal(), thus uselessly exposing the messages to translation work. Fix those.
* Display the time when the process started waiting for the lock, in pg_locks, ↵Fujii Masao2021-02-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
* ReadNewTransactionId() -> ReadNextTransactionId().Thomas Munro2021-02-15
| | | | | | | | | The new name conveys the effect better, is more consistent with similar functions ReadNextMultiXactId(), ReadNextFullTransactionId(), and matches the name of the variable that it reads. Reported-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmVR4SakBXQUdhhPpMf1aYvZCnna5%3DHKa7DAgEmBAg%2B8g%40mail.gmail.com
* Allow multiple xacts during table sync in logical replication.Amit Kapila2021-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the initial table data synchronization in logical replication, we use a single transaction to copy the entire table and then synchronize the position in the stream with the main apply worker. There are multiple downsides of this approach: (a) We have to perform the entire copy operation again if there is any error (network breakdown, error in the database operation, etc.) while we synchronize the WAL position between tablesync worker and apply worker; this will be onerous especially for large copies, (b) Using a single transaction in the synchronization-phase (where we can receive WAL from multiple transactions) will have the risk of exceeding the CID limit, (c) The slot will hold the WAL till the entire sync is complete because we never commit till the end. This patch solves all the above downsides by allowing multiple transactions during the tablesync phase. The initial copy is done in a single transaction and after that, we commit each transaction as we receive. To allow recovery after any error or crash, we use a permanent slot and origin to track the progress. The slot and origin will be removed once we finish the synchronization of the table. We also remove slot and origin of tablesync workers if the user performs DROP SUBSCRIPTION .. or ALTER SUBSCRIPTION .. REFERESH and some of the table syncs are still not finished. The commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and ALTER SUBSCRIPTION ... SET PUBLICATION ... with refresh option as true cannot be executed inside a transaction block because they can now drop the slots for which we have no provision to rollback. This will also open up the path for logical replication of 2PC transactions on the subscriber side. Previously, we can't do that because of the requirement of maintaining a single transaction in tablesync workers. Bump catalog version due to change of state in the catalog (pg_subscription_rel). Author: Peter Smith, Amit Kapila, and Takamichi Osumi Reviewed-by: Ajin Cherian, Petr Jelinek, Hou Zhijie and Amit Kapila Discussion: https://postgr.es/m/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com