| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That bit is unlogged and therefore it's wrong to consider it in WAL page
comparison.
Add a test that tickles the case, as branch testing technology allows.
This has been a problem ever since wal consistency checking was
introduced (commit a507b86900f6 for pg10), so backpatch to all supported
branches.
Author: 王海洋 (Haiyang Wang) <wanghaiyang.001@bytedance.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CACciXAD2UvLMOhc4jX9VvOKt7DtYLr3OYRBhvOZ-jRxtzc_7Jg@mail.gmail.com
Discussion: https://postgr.es/m/CACciXADOfErX9Bx0nzE_SkdfXr6Bbpo5R=v_B6MUTEYW4ya+cg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
| |
The five commits ending at cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
closed this race condition for v15+. For v14 through v10, add a HINT to
discourage studying the cosmetic problem.
Reviewed by Kyotaro Horiguchi and David Steele.
Discussion: https://postgr.es/m/20220731061747.GA3692882@rfd.leadboat.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Crash recovery on standby may encounter missing directories
when replaying database-creation WAL records. Prior to this
patch, the standby would fail to recover in such a case;
however, the directories could be legitimately missing.
Consider the following sequence of commands:
CREATE DATABASE
DROP DATABASE
DROP TABLESPACE
If, after replaying the last WAL record and removing the
tablespace directory, the standby crashes and has to replay the
create database record again, crash recovery must be able to continue.
A fix for this problem was already attempted in 49d9cfc68bf4, but it
was reverted because of design issues. This new version is based
on Robert Haas' proposal: any missing tablespaces are created
during recovery before reaching consistency. Tablespaces
are created as real directories, and should be deleted
by later replay. CheckRecoveryConsistency ensures
they have disappeared.
The problems detected by this new code are reported as PANIC,
except when allow_in_place_tablespaces is set to ON, in which
case they are WARNING. Apart from making tests possible, this
gives users an escape hatch in case things don't go as planned.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Author: Paul Guo <paulguo@gmail.com>
Reviewed-by: Anastasia Lubennikova <lubennikovaav@gmail.com> (older versions)
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> (older versions)
Reviewed-by: Michaël Paquier <michael@paquier.xyz>
Diagnosed-by: Paul Guo <paulguo@gmail.com>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a backpatch to branches 10-14 of the following commits:
7170f2159fb2 Allow "in place" tablespaces.
c6f2f01611d4 Fix pg_basebackup with in-place tablespaces.
f6f0db4d6240 Fix pg_tablespace_location() with in-place tablespaces
7a7cd84893e0 doc: Remove mention to in-place tablespaces for pg_tablespace_location()
5344723755bd Remove unnecessary Windows-specific basebackup code.
In-place tablespaces were introduced as a testing helper mechanism, but
they are going to be used for a bugfix in WAL replay to be backpatched
to all stable branches.
I (Álvaro) had to adjust some code to account for lack of
get_dirent_type() in branches prior to 14.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Michaël Paquier <michael@paquier.xyz>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20220722081858.omhn2in5zt3g4nek@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have a few commands that "can't run in a transaction block",
meaning that if they complete their processing but then we fail
to COMMIT, we'll be left with inconsistent on-disk state.
However, the existing defenses for this are only watertight for
simple query protocol. In extended protocol, we didn't commit
until receiving a Sync message. Since the client is allowed to
issue another command instead of Sync, we're in trouble if that
command fails or is an explicit ROLLBACK. In any case, sitting
in an inconsistent state while waiting for a client message
that might not come seems pretty risky.
This case wasn't reachable via libpq before we introduced pipeline
mode, but it's always been an intended aspect of extended query
protocol, and likely there are other clients that could reach it
before.
To fix, set a flag in PreventInTransactionBlock that tells
exec_execute_message to force an immediate commit. This seems
to be the approach that does least damage to existing working
cases while still preventing the undesirable outcomes.
While here, add some documentation to protocol.sgml that explicitly
says how to use pipelining. That's latent in the existing docs if
you know what to look for, but it's better to spell it out; and it
provides a place to document this new behavior.
Per bug #17434 from Yugo Nagata. It's been wrong for ages,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/17434-d9f7a064ce2a88a3@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a non-exclusive backup is canceled, do_pg_abort_backup() is called
and resets some variables set by pg_backup_start (pg_start_backup in v14
or before). But previously it forgot to reset the session state indicating
whether a non-exclusive backup is in progress or not in this session.
This issue could cause an assertion failure when the session running
BASE_BACKUP is terminated after it executed pg_backup_start and
pg_backup_stop (pg_stop_backup in v14 or before). Also it could cause
a segmentation fault when pg_backup_stop is called after BASE_BACKUP
in the same session is canceled.
This commit fixes the issue by making do_pg_abort_backup reset
that session state.
Back-patch to all supported branches.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Masahiko Sawada, Michael Paquier, Robert Haas
Discussion: https://postgr.es/m/3374718f-9fbf-a950-6d66-d973e027f44c@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TransactionIdIsInProgress had a fast path to return 'false' if the
single-item CLOG cache said that the transaction was known to be
committed. However, that was wrong, because a transaction is first
marked as committed in the CLOG but doesn't become visible to others
until it has removed its XID from the proc array. That could lead to an
error:
ERROR: t_xmin is uncommitted in tuple to be updated
or for an UPDATE to go ahead without blocking, before the previous
UPDATE on the same row was made visible.
The window is usually very short, but synchronous replication makes it
much wider, because the wait for synchronous replica happens in that
window.
Another thing that makes it hard to hit is that it's hard to get such
a commit-in-progress transaction into the single item CLOG cache.
Normally, if you call TransactionIdIsInProgress on such a transaction,
it determines that the XID is in progress without checking the CLOG
and without populating the cache. One way to prime the cache is to
explicitly call pg_xact_status() on the XID. Another way is to use a
lot of subtransactions, so that the subxid cache in the proc array is
overflown, making TransactionIdIsInProgress rely on pg_subtrans and
CLOG checks.
This has been broken ever since it was introduced in 2008, but the race
condition is very hard to hit, especially without synchronous
replication. There were a couple of reports of the error starting from
summer 2021, but no one was able to find the root cause then.
TransactionIdIsKnownCompleted() is now unused. In 'master', remove it,
but I left it in place in backbranches in case it's used by extensions.
Also change pg_xact_status() to check TransactionIdIsInProgress().
Previously, it only checked the CLOG, and returned "committed" before
the transaction was actually made visible to other queries. Note that
this also means that you cannot use pg_xact_status() to reproduce the
bug anymore, even if the code wasn't fixed.
Report and analysis by Konstantin Knizhnik. Patch by Simon Riggs, with
the pg_xact_status() change added by me.
Author: Simon Riggs
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/flat/4da7913d-398c-e2ad-d777-f752cf7f0bbb%40garret.ru
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since a117cebd6, some older gcc versions issue "variable may be used
uninitialized in this function" complaints for brin_summarize_range.
Silence that using the same coding pattern as in bt_index_check_internal;
arguably, a117cebd6 had too narrow a view of which compilers might give
trouble.
Nathan Bossart and Tom Lane. Back-patch as the previous commit was.
Discussion: https://postgr.es/m/20220601163537.GA2331988@nathanxps13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a cluster is promoted (aka the control file shows a state different
than DB_IN_ARCHIVE_RECOVERY) while CreateRestartPoint() is still
processing, this function could miss an update of the control file for
"checkPoint" and "checkPointCopy" but still do the recycling and/or
removal of the past WAL segments, assuming that the to-be-updated LSN
values should be used as reference points for the cleanup. This causes
a follow-up restart attempting crash recovery to fail with a PANIC on a
missing checkpoint record if the end-of-recovery checkpoint triggered by
the promotion did not complete while the cluster abruptly stopped or
crashed before the completion of this checkpoint. The PANIC would be
caused by the redo LSN referred in the control file as located in a
segment already gone, recycled by the previous restartpoint with
"checkPoint" out-of-sync in the control file.
This commit fixes the update of the control file during restartpoints so
as "checkPoint" and "checkPointCopy" are updated even if the cluster has
been promoted while a restartpoint is running, to be on par with the set
of WAL segments actually recycled in the end of CreateRestartPoint().
7863ee4 has fixed this problem already on master, but the release timing
of the latest point versions did not let me enough time to study and fix
that on all the stable branches.
Reported-by: Fujii Masao, Rui Zhao
Author: Kyotaro Horiguchi
Reviewed-by: Nathan Bossart, Michael Paquier
Discussion: https://postgr.es/m/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com
Backpatch-through: 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a feature enumerates relations and runs functions associated with
all found relations, the feature's user shall not need to trust every
user having permission to create objects. BRIN-specific functionality
in autovacuum neglected to account for this, as did pg_amcheck and
CLUSTER. An attacker having permission to create non-temp objects in at
least one schema could execute arbitrary SQL functions under the
identity of the bootstrap superuser. CREATE INDEX (not a
relation-enumerating operation) and REINDEX protected themselves too
late. This change extends to the non-enumerating amcheck interface.
Back-patch to v10 (all supported versions).
Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin.
Reported by Alexander Lakhin.
Security: CVE-2022-1552
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The back-patch of commit bbace5697df12398e87ffd9879171c39d27f5b33 had
the unfortunate effect of changing the layout of PGPROC in the
back-branches, which could break extensions. This happened because it
changed the delayChkpt from type bool to type int. So, change it back,
and add a new bool delayChkptEnd field instead. The new field should
fall within what used to be padding space within the struct, and so
hopefully won't cause any extensions to break.
Per report from Markus Wanner and discussion with Tom Lane and others.
Patch originally by me, somewhat revised by Markus Wanner per a
suggestion from Michael Paquier. A very similar patch was developed
by Kyotaro Horiguchi, but I failed to see the email in which that was
posted before writing one of my own.
Discussion: http://postgr.es/m/CA+Tgmoao-kUD9c5nG5sub3F7tbo39+cdr8jKaOVEs_1aBWcJ3Q@mail.gmail.com
Discussion: http://postgr.es/m/20220406.164521.17171257901083417.horikyota.ntt@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
heap_fetch() used to have a "keep_buf" parameter that told it to return
ownership of the buffer pin to the caller after finding that the
requested tuple TID exists but is invisible to the specified snapshot.
This was thoughtlessly removed in commit 5db6df0c0, which broke
heapam_tuple_lock() (formerly EvalPlanQualFetch) because that function
needs to do more accesses to the tuple even if it's invisible. The net
effect is that we would continue to touch the page for a microsecond or
two after releasing pin on the buffer. Usually no harm would result;
but if a different session decided to defragment the page concurrently,
we could see garbage data and mistakenly conclude that there's no newer
tuple version to chain up to. (It's hard to say whether this has
happened in the field. The bug was actually found thanks to a later
change that allowed valgrind to detect accesses to non-pinned buffers.)
The most reasonable way to fix this is to reintroduce keep_buf,
although I made it behave slightly differently: buffer ownership
is passed back only if there is a valid tuple at the requested TID.
In HEAD, we can just add the parameter back to heap_fetch().
To avoid an API break in the back branches, introduce an additional
function heap_fetch_extended() in those branches.
In HEAD there is an additional, less obvious API change: tuple->t_data
will be set to NULL in all cases where buffer ownership is not returned,
in particular when the tuple exists but fails the time qual (and
!keep_buf). This is to defend against any other callers attempting to
access non-pinned buffers. We concluded that making that change in back
branches would be more likely to introduce problems than cure any.
In passing, remove a comment about heap_fetch that was obsoleted by
9a8ee1dc6.
Per bug #17462 from Daniil Anisimov. Back-patch to v12 where the bug
was introduced.
Discussion: https://postgr.es/m/17462-9c98a0f00df9bd36@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.
Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.
Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Invalidate abortedRecPtr and missingContrecPtr after a missing
continuation record is successfully skipped on a standby. This fixes a
PANIC caused when a recently promoted standby attempts to write an
OVERWRITE_RECORD with an LSN of the previously read aborted record.
Backpatch to 10 (all stable versions).
Author: Sami Imseih <simseih@amazon.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/44D259DE-7542-49C4-8A52-2AB01534DCA9@amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commands like ALTER TABLE SET TABLESPACE may leave files for the next
checkpoint to clean up. If such files are not removed by the time DROP
TABLESPACE is called, we request a checkpoint so that they are deleted.
However, there is presently a window before checkpoint start where new
unlink requests won't be scheduled until the following checkpoint. This
means that the checkpoint forced by DROP TABLESPACE might not remove the
files we expect it to remove, and the following ERROR will be emitted:
ERROR: tablespace "mytblspc" is not empty
To fix, add a call to AbsorbSyncRequests() just before advancing the
unlink cycle counter. This ensures that any unlink requests forwarded
prior to checkpoint start (i.e., when ckpt_started is incremented) will
be processed by the current checkpoint. Since AbsorbSyncRequests()
performs memory allocations, it cannot be called within a critical
section, so we also need to move SyncPreCheckpoint() to before
CreateCheckPoint()'s critical section.
This is an old bug, so back-patch to all supported versions.
Author: Nathan Bossart <nathandbossart@gmail.com>
Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220215235845.GA2665318%40nathanxps13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of these are cases where we could call memcpy() or other libc
functions with a NULL pointer and a zero count, which is forbidden
by POSIX even though every production version of libc allows it.
We've fixed such things before in a piecemeal way, but apparently
never made an effort to try to get them all. I don't claim that
this patch does so either, but it gets every failure I observe in
check-world, using clang 12.0.1 on current RHEL8.
numeric.c has a different issue that the sanitizer doesn't like:
"ln(-1.0)" will compute log10(0) and then try to assign the
resulting -Inf to an integer variable. We don't actually use the
result in such a case, so there's no live bug.
Back-patch to all supported branches, with the idea that we might
start running a buildfarm member that tests this case. This includes
back-patching c1132aae3 (Check the size in COPY_POINTER_FIELD),
which previously silenced some of these issues in copyfuncs.c.
Discussion: https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, during UPDATE, the unchanged replica identity key attributes
are not logged separately because they are getting logged as part of the
new tuple. But if they are stored externally then the untoasted values are
not getting logged as part of the new tuple and logical replication won't
be able to replicate such UPDATEs. So we need to log such attributes as
part of the old_key_tuple during UPDATE.
Reported-by: Haiying Tang
Author: Dilip Kumar and Amit Kapila
Reviewed-by: Alvaro Herrera, Haiying Tang, Andres Freund
Backpatch-through: 10
Discussion: https://postgr.es/m/OS0PR01MB611342D0A92D4F4BF26C0F47FB229@OS0PR01MB6113.jpnprd01.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
| |
While individual logical rewrite files were synced to disk, the directory was
not. On some filesystems that could lead to loosing directory entries after a
crash.
Reported-By: Tom Lane <tgl@sss.pgh.pa.us>
Author: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/867F2E29-2782-4869-970E-B984C6D35A8F@amazon.com
Backpatch: 10-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic in charge of writing commit timestamps (enabled with
track_commit_timestamp) for subtransactions had a one-bug bug,
where it would be possible that commit timestamps go missing for the
last subtransaction committed.
While on it, simplify a bit the iteration logic in the loop writing the
commit timestamps, as per suggestions from Kyotaro Horiguchi and Tom
Lane, so as some variable initializations are not part of the loop
itself.
Issue introduced in 73c986a.
Analyzed-by: Alex Kingsborough
Author: Alex Kingsborough, Kyotaro Horiguchi
Discussion: https://postgr.es/m/73A66172-4050-4F2A-B7F1-13508EDA2144@amazon.com
Backpatch-through: 10
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under concurrency, it is possible for two sessions to be merrily locking
and releasing a tuple and marking it again as HEAP_XMAX_INVALID all the
while a third session attempts to lock it, miserably fails at it, and
then contemplates life, the universe and everything only to eventually
fail an assertion that said bit is not set. Before SKIP LOCKED that was
indeed a reasonable expectation, but alas! commit df630b0dd5ea falsified
it.
This bug is as old as time itself, and even older, if you think time
begins with the oldest supported branch. Therefore, backpatch to all
supported branches.
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Discussion: https://postgr.es/m/CANbhV-FeEwMnN8yuMyss7if1ZKjOKfjcgqB26n8pqu1e=q0ebg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using replication origins, pg_replication_origin_xact_setup() is an
optional choice to be able to set a LSN and a timestamp to mark the
origin, which would be additionally added to WAL for transaction commits
or aborts (including 2PC transactions). An assertion in the code path
of PREPARE TRANSACTION assumed that this data should always be set, so
it would trigger when using replication origins without setting up an
origin LSN. Some tests are added to cover more this kind of scenario.
Oversight in commit 1eb6d65.
Per discussion with Amit Kapila and Masahiko Sawada.
Discussion: https://postgr.es/m/YbbBfNSvMm5nIINV@paquier.xyz
Backpatch-through: 11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
REINDEX CONCURRENTLY run on a toast index or a toast relation could
corrupt the target indexes rebuilt, as a backend running in parallel
that manipulates toast values would directly release the lock on the
toast relation when its local operation is done, rather than releasing
the lock once the transaction that manipulated the toast values
committed.
The fix done here is simple: we now hold a ROW EXCLUSIVE lock on the
toast relation when saving or deleting a toast value until the
transaction working on them is committed, so as a concurrent reindex
happening in parallel would be able to wait for any activity and see any
new rows inserted (or deleted).
An isolation test is added to check after the case fixed here, which is
a bit fancy by design as it relies on allow_system_table_mods to rename
the toast table and its index to fixed names. This way, it is possible
to reindex them directly without any dependency on the OID of the
underlying relation. Note that this could not use a DO block either, as
REINDEX CONCURRENTLY cannot be run in a transaction block. The test is
backpatched down to 13, where it is possible, thanks to c4a7a39, to use
allow_system_table_mods in a test suite.
Reported-by: Alexey Ermakov
Analyzed-by: Andres Freund, Noah Misch
Author: Michael Paquier
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/17268-d2fb426e0895abd4@postgresql.org
Backpatch-through: 12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit ff9f111bce24 I mixed up inconsistent definitions of the LSN of
the first record in a page, when the previous record ends exactly at the
page boundary. The correct LSN is adjusted to skip the WAL page header;
I failed to use that when setting XLogReaderState->overwrittenRecPtr,
so at WAL replay time VerifyOverwriteContrecord would refuse to let
replay continue past that record.
Backpatch to 10. 9.6 also contains this bug, but it's no longer being
maintained.
Discussion: https://postgr.es/m/45597.1637694259@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
| |
Buildfarm members kittiwake and tadarida have witnessed errors at this
site. The site discarded key facts. Back-patch to v10 (all supported
versions).
Reviewed by Michael Paquier and Tom Lane.
Discussion: https://postgr.es/m/20211107013157.GB790288@rfd.leadboat.com
|
|
|
|
|
|
|
| |
Introduced in 1d97d3d0867f.
Co-authored-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/83641f59-d566-b33e-ef21-a272a98675aa@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The purpose of commit 8a54e12a38d1545d249f1402f66c8cde2837d97c was to
fix this, and it sufficed when the PREPARE TRANSACTION completed before
the CIC looked for lock conflicts. Otherwise, things still broke. As
before, in a cluster having used CIC while having enabled prepared
transactions, queries that use the resulting index can silently fail to
find rows. It may be necessary to reindex to recover from past
occurrences; REINDEX CONCURRENTLY suffices. Fix this for future index
builds by making CIC wait for arbitrarily-recent prepared transactions
and for ordinary transactions that may yet PREPARE TRANSACTION. As part
of that, have PREPARE TRANSACTION transfer locks to its dummy PGPROC
before it calls ProcArrayClearTransaction(). Back-patch to 9.6 (all
supported versions).
Andrey Borodin, reviewed (in earlier versions) by Andres Freund.
Discussion: https://postgr.es/m/01824242-AA92-4FE9-9BA7-AEBAFFEA3D0C@yandex-team.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During a replication slot creation, an ERROR generated in the same
transaction as the one creating a to-be-exported snapshot would have
left the backend in an inconsistent state, as the associated static
export snapshot state was not being reset on transaction abort, but only
on the follow-up command received by the WAL sender that created this
snapshot on replication slot creation. This would trigger inconsistency
failures if this session tried to export again a snapshot, like during
the creation of a replication slot.
Note that a snapshot export cannot happen in a transaction block, so
there is no need to worry resetting this state for subtransaction
aborts. Also, this inconsistent state would very unlikely show up to
users. For example, one case where this could happen is an
out-of-memory error when building the initial snapshot to-be-exported.
Dilip found this problem while poking at a different patch, that caused
an error in this code path for reasons unrelated to HEAD.
Author: Dilip Kumar
Reviewed-by: Michael Paquier, Zhihong Yu
Discussion: https://postgr.es/m/CAFiTN-s0zA1Kj0ozGHwkYkHwa5U0zUE94RSc_g81WrpcETB5=w@mail.gmail.com
Backpatch-through: 9.6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some specific logic is done at the end of recovery when involving 2PC
transactions:
1) Call RecoverPreparedTransactions(), to recover the state of 2PC
transactions into memory (re-acquire locks, etc.).
2) ShutdownRecoveryTransactionEnvironment(), to move back to normal
operations, mainly cleaning up recovery locks and KnownAssignedXids
(including any 2PC transaction tracked previously).
3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is
the tipping point for any process calling RecoveryInProgress() to check
if the cluster is still in recovery or not.
Any snapshot taken between steps 2) and 3) would be empty, causing any
transaction relying on a snapshot at this point to potentially corrupt
data as there could still be some 2PC transactions to track, with
RecentXmin moving backwards on successive calls to GetSnapshotData() in
the same transaction.
As SharedRecoveryState is the point to take into account to know if it
is safe to discard KnownAssignedXids, this commit moves step 2) after
step 3), so as we can never finish with empty snapshots.
This exists since the introduction of hot standby, so backpatch all the
way down. The window with incorrect snapshots is extremely small, but I
have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm
member fairywren. Thomas Munro also found it independently. Special
thanks to Andres Freund for taking the time to analyze this issue.
Reported-by: Thomas Munro, Michael Paquier
Analyzed-by: Andres Freund
Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de
Backpatch-through: 9.6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 84f5c2908 forgot to consider the possibility that
EnsurePortalSnapshotExists could run inside a subtransaction with
lifespan shorter than the Portal's. In that case, the new active
snapshot would be popped at the end of the subtransaction, leaving
a dangling pointer in the Portal, with mayhem ensuing.
To fix, make sure the ActiveSnapshot stack entry is marked with
the same subtransaction nesting level as the associated Portal.
It's certainly safe to do so since we won't be here at all unless
the stack is empty; hence we can't create an out-of-order stack.
Let's also apply this logic in the case where PortalRunUtility
sets portalSnapshot, just to be sure that path can't cause similar
problems. It's slightly less clear that that path can't create
an out-of-order stack, so add an assertion guarding it.
Report and patch by Bertrand Drouvot (with kibitzing by me).
Back-patch to v11, like the previous commit.
Discussion: https://postgr.es/m/ff82b8c5-77f4-3fe7-6028-fcf3303e82dd@amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A broken HOT chain is not an unexpected condition, even when the offset
number points past the end of the page's line pointer array.
heap_prune_chain() does not (and never has) treated this condition as
unexpected, so derivative code in heap_index_delete_tuples() shouldn't
do so either.
Oversight in commit 4228817449.
The assertion can probably only fail on Postgres 14 and master. Earlier
releases don't have commit 3c3b8a4b, which taught VACUUM to truncate the
line pointer array of heap pages. Backpatch all the same, just to be
consistent.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/17197-9438f31f46705182@postgresql.org
Backpatch: 12-, just like commit 4228817449.
|
|
|
|
|
|
|
|
|
|
| |
Attempting to make hashfloat4() look as much as possible like
hashfloat8(), I'd figured I could replace NaNs with get_float4_nan()
before widening to float8. However, results from protosciurus
and topminnow show that on some platforms that produces a different
bit-pattern from get_float8_nan(), breaking the intent of ce773f230.
Rearrange so that we use the result of get_float8_nan() for all NaN
cases. As before, back-patch.
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 515e3d84a0b5 and equivalent commits in back
branches. This solution to the problem has a number of problems, so
we'll try again with a different approach.
Per note from Andres Freund
Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The IEEE 754 standard allows a wide variety of bit patterns for NaNs,
of which at least two ("NaN" and "-NaN") are pretty easy to produce
from SQL on most machines. This is problematic because our btree
comparison functions deem all NaNs to be equal, but our float hash
functions know nothing about NaNs and will happily produce varying
hash codes for them. That causes unexpected results from queries
that hash a column containing different NaN values. It could also
produce unexpected lookup failures when using a hash index on a
float column, i.e. "WHERE x = 'NaN'" will not find all the rows
it should.
To fix, special-case NaN in the float hash functions, not too much
unlike the existing special case that forces zero and minus zero
to hash the same. I arranged for the most vanilla sort of NaN
(that coming from the C99 NAN constant) to still have the same
hash code as before, to reduce the risk to existing hash indexes.
I dithered about whether to back-patch this into stable branches,
but ultimately decided to do so. It's a clear improvement for
queries that hash internally. If there is anybody who has -NaN
in a hash index, they'd be well advised to re-index after applying
this patch ... but the misbehavior if they don't will not be much
worse than the misbehavior they had before.
Per bug #17172 from Ma Liangzhu.
Discussion: https://postgr.es/m/17172-7505bea9e04e230f@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
| |
Most data-corruption reports mention the location of the problem, but
this one failed to. Add it.
Backpatch all the way back. In 12 and older, also assign the
ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf890 for
13 and later.
Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Somehow, spgist overlooked the need to call pgstat_count_index_scan().
Hence, pg_stat_all_indexes.idx_scan and equivalent columns never
became nonzero for an SP-GiST index, although the related per-tuple
counters worked fine.
This fix works a bit differently from other index AMs, in that the
counter increment occurs in spgrescan not spggettuple/spggetbitmap.
It looks like this won't make the user-visible semantics noticeably
different, so I won't go to the trouble of introducing an is-this-
the-first-call flag just to make the counter bumps happen in the
same places.
Per bug #17163 from Christian Quest. Back-patch to all supported
versions.
Discussion: https://postgr.es/m/17163-b8c5cc88322a5e92@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pengchengliu reported an assertion failure in a parallel woker while
performing a parallel scan using an overflowed snapshot. The proximate
cause is that TransactionXmin was set to an incorrect value. The
underlying cause is incorrect snapshot handling in parallel.c.
In particular, InitializeParallelDSM() was unconditionally calling
GetTransactionSnapshot(), because I (rhaas) mistakenly thought that
was always retrieving an existing snapshot whereas, at isolation
levels less than REPEATABLE READ, it's actually taking a new one. So
instead do this only at higher isolation levels where there actually
is a single snapshot for the whole transaction.
By itself, this is not a sufficient fix, because we still need to
guarantee that TransactionXmin gets set properly in the workers. The
easiest way to do that seems to be to install the leader's active
snapshot as the transaction snapshot if the leader did not serialize a
transaction snapshot. This doesn't affect the results of future
GetTrasnactionSnapshot() calls since those have to take a new snapshot
anyway; what we care about is the side effect of setting TransactionXmin.
Report by Pengchengliu. Patch by Greg Nancarrow, except for some comment
text which I supplied.
Discussion: https://postgr.es/m/002f01d748ac$eaa781a0$bff684e0$@tju.edu.cn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files. Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it. If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.
To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.
This has always been wrong, so backpatch all the way back.
Author: Nathan Bossart <bossartn@amazon.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit ensures that the wait interval in the replay delay loop
waiting for an amount of time defined by recovery_min_apply_delay is
correctly handled on reload, recalculating the delay if this GUC value
is updated, based on the timestamp of the commit record being replayed.
The previous behavior would be problematic for example with replay
still waiting even if the delay got reduced or just cancelled. If the
apply delay was increased to a larger value, the wait would have just
respected the old value set, finishing earlier.
Author: Soumyadeep Chakraborty, Ashwin Agrawal
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com
Backpatch-through: 9.6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.
Commit 7bffc9b7bf changed xact_redo_commit() so that it updates
minRecoveryPoint on truncation, but forgot to change xact_redo_abort().
Back-patch to all supported versions.
Reported-by: mengjuan.cmj@alibaba-inc.com
Author: Fujii Masao
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/b029fce3-4fac-4265-968e-16f36ff4d075.mengjuan.cmj@alibaba-inc.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes the startup process in the standby server so that
it handles the interrupt signals after waiting for wal_retrieve_retry_interval
on the latch and resetting it, before entering another wait on the latch.
This change causes the standby server to promptly handle interrupt signals.
Otherwise, previously, there was the case where the standby needs to
wait extra five seconds to shutdown when the shutdown request arrived
while the startup process was waiting for wal_retrieve_retry_interval
on the latch.
Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty
Reviewed-by: Soumyadeep Chakraborty
Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
Per discussion of BUG #17073, back-patch to all supported versions.
Discussion: https://postgr.es/m/17073-1a5fdaed0fa5d4d0@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should have been removed in commit 7e30c186da, which split the loop
into two. Only the first loop uses the 'from' variable; updating it in
the second loop is bogus. It was never read after the first loop, so this
was harmless and surely optimized away by the compiler, but let's be tidy.
Backpatch to all supported versions.
Author: Ranier Vilela
Discussion: https://www.postgresql.org/message-id/CAEudQAoWq%2BAL3BnELHu7gms2GN07k-np6yLbukGaxJ1vY-zeiQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
Reaching PITR on such a transaction would cause the generation of a LOG
message mentioning a transaction committed, not aborted.
Oversight in 4f1b890.
Author: Simon Riggs
Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com
Backpatch-through: 9.6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we cannot immediately acquire CLogControlLock in exclusive mode at
commit time, we add ourselves to a list of processes that need their XIDs
status update. We do this if the clog page where we need to update the
current transaction status is the same as the group leader's clog page,
otherwise, we allow the caller to clear it by itself. Now, when we can't
add ourselves to any group, we were not clearing the current proc if it
has already become a member of some group which was leading to an
assertion failure when the same proc was assigned to another backend after
the current backend exits.
Reported-by: Alexander Lakhin
Bug: 17072
Author: Amit Kapila
Tested-By: Alexander Lakhin
Backpatch-through: 11, where it was introduced
Discussion: https://postgr.es/m/17072-2f8764857ef2c92a@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the error paths left *members uninitialized. That's not a live
bug, because most callers don't look at *members when the function
returns -1, but let's be tidy. One caller, in heap_lock_tuple(), does
"if (members != NULL) pfree(members)", but AFAICS it never passes an
invalid 'multi' value so it should not reach that error case.
The callers are also a bit inconsistent in their expectations.
heap_lock_tuple() pfrees the 'members' array if it's not-NULL, others
pfree() it if "nmembers >= 0", and others if "nmembers > 0". That's
not a live bug either, because the function should never return 0, but
add an Assert for that to make it more clear. I left the callers alone
for now.
I also moved the line where we set *nmembers. It wasn't wrong before,
but I like to do that right next to the 'return' statement, to make it
clear that it's always set on return.
Also remove one unreachable return statement after ereport(ERROR), for
brevity and for consistency with the similar if-block right after it.
Author: Greg Nancarrow with the additional changes by me
Backpatch-through: 9.6, all supported versions
|
|
|
|
|
|
|
|
| |
Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL
file, which doesn't change the seek position (unless we fall back to
the implementation in src/port/pread.c). Update comment accordingly.
Backpatch-through: 12, where we started to use pg_pread()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were already reporting it, but only after the parallel workers were
finished, which is visibly much later than what happens in a serial
build.
With this change we report it when the leader starts its own sort phase
when participating in the build (the normal case). Now this might
happen a little later than when the workers start their sorting phases,
but a) communicating the actual phase start from workers is likely to be
a hassle, and b) the sort phase start is pretty fuzzy anyway, since
sorting per se is actually initiated by tuplesort.c internally earlier
than tuplesort_performsort() is called.
Backpatch to pg12, where the progress reporting code for CREATE INDEX
went in.
Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Greg Nancarrow <gregn4422@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/1128176d-1eee-55d4-37ca-e63644422adb
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.
Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.
Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.
Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
|
| |
|