| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
| |
XLogReaderFree failed to free the per-block data buffers, when they
happened to not be used by the latest read WAL record.
Michael Paquier. Backpatch to 9.5, where the per-block buffers were added.
|
|
|
|
|
|
|
|
|
|
| |
If there were no subtransactions (or multixacts) active, we would calculate
the oldestxid == next xid. That's correct, but if next XID happens to be
on the next pg_subtrans (pg_multixact) page, the page does not exist yet,
and SimpleLruTruncate will produce an "apparent wraparound" warning. The
warning is harmless in this case, but looks very alarming to users.
Backpatch to all supported versions. Patch and analysis by Thomas Munro.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was already a sanity-check in the other direction: if a page was
marked with WILL_INIT, it had to be initialized by the redo routine. It's
not strictly necessary for correctness that a page is marked with WILL_INIT
if it's going to be initialized at redo, but it's a missed optimization if
nothing else.
Fix a few instances of this issue in SP-GiST, where a block in WAL record
was not marked with WILL_INIT, but was in fact always initialized at redo.
We were creating a full-page image of the page unnecessarily in those
cases.
Backpatch to 9.5, where the new WILL_INIT flag was added.
|
|
|
|
|
|
| |
Patch by David Rowley. Backpatch to 9.5, as some of the calls were new in
9.5, and keeping the code in sync with master makes future backpatching
easier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
XLogFileCopy() was changed heavily in commit de76884. However it was
partially reverted in commit 7abc685 and most of those changes to
XLogFileCopy() were no longer needed. Then commit 7cbee7c removed
those unnecessary code, but XLogFileCopy() looked different in master
and 9.4 though the contents are almost the same.
This patch makes XLogFileCopy() look the same in master and back-branches,
which makes back-patching easier, per discussion on pgsql-hackers.
Back-patch to 9.5.
Discussion: 55760844.7090703@iki.fi
Michael Paquier
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When archive recovery and restartpoints were initially introduced,
checkpoint_segments was ignored on the grounds that the files restored from
archive don't consume any space in the recovery server. That was changed in
later releases, but even then it was arguably a feature rather than a bug,
as performing restartpoints as often as checkpoints during normal operation
might be excessive, but you might nevertheless not want to waste a lot of
space for pre-allocated WAL by setting checkpoint_segments to a high value.
But now that we have separate min_wal_size and max_wal_size settings, you
can bound WAL usage with max_wal_size, and still avoid consuming excessive
space usage by setting min_wal_size to a lower value, so that argument is
moot.
There are still some issues with actually limiting the space usage to
max_wal_size: restartpoints in recovery can only start after seeing the
checkpoint record, while a checkpoint starts flushing buffers as soon as
the redo-pointer is set. Restartpoint is paced to happen at the same
leisurily speed, determined by checkpoint_completion_target, as checkpoints,
but because they are started later, max_wal_size can be exceeded by upto
one checkpoint cycle's worth of WAL, depending on
checkpoint_completion_target. But that seems better than not trying at all,
and max_wal_size is a soft limit anyway.
The documentation already claimed that max_wal_size is obeyed in recovery,
so this just fixes the behaviour to match the docs. However, add some
weasel-words there to mention that max_wal_size may well be exceeded by
some amount in recovery.
|
|
|
|
|
|
|
|
|
| |
Seems like cheap insurance for WAL bugs. A spurious call to
XLogBeginInsert() in itself would be fairly harmless, but if there is any
data registered and the insertion is not completed/cancelled properly, there
is a risk that the data ends up in a wrong WAL record.
Per Jeff Janes's suggestion.
|
|
|
|
|
|
|
|
|
|
| |
Don't apply rmtree(), which will gleefully remove an entire subtree,
and don't even apply unlink() unless it's symlink or a directory,
the only things that we expect to find.
Amit Kapila, with minor tweaks by me, per extensive discussions
involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas,
at least some of whom also reviewed the code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.
Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets. Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.
Discussion: 20150608163707.GD20772@alap3.anarazel.de
Backpatch into 9.3, where it became more likely that multixacts wrap
around.
|
|
|
|
|
|
|
|
|
|
| |
9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The
other WAL_DEBUG dependant messages check for the wal_debug GUC, but this
one did not. While at it replace 'upto' with 'up to'.
Discussion: 20150610110253.GF3832@alap3.anarazel.de
Backpatch to 9.4, the first release containing 9a20a9b2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(),
and that function looks only at the on-disk state, it's possible for it
to fail to find a page that exists in the in-memory SLRU that has not
been written yet. If that happens, SetOffsetVacuumLimit() will
erroneously decide to force emergency autovacuuming immediately.
We should probably fix find_multixact_start() to consider the data
cached in memory as well as on the on-disk state, but that's no excuse
for SetOffsetVacuumLimit() to be stupid about the case where it can
no longer read the value after having previously succeeded in doing so.
Report by Andres Freund.
|
|
|
|
|
|
|
| |
tablesapce -> tablespace
there -> their
These were introduced in 72d422a52, so no need to backpatch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Remove unused argument "dstfname" and related code from XLogFileCopy().
* Previously XLogFileCopy() returned a pstrdup'd string so that
InstallXLogFileSegment() used it later. Since the pstrdup'd string was never
free'd, there could be a risk of memory leak. It was almost harmless because
the startup process exited just after calling XLogFileCopy(), it existed.
This commit changes XLogFileCopy() so that it directly calls
InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that
memory leak problem.
* Extend InstallXLogFileSegment() so that the caller can specify the log level.
Which allows us to emit an error when InstallXLogFileSegment() fails a disk
file access like link() and rename(). Previously it was always logged with
LOG level and additionally needed to be logged with ERROR when we wanted
to treat it as an error.
Michael Paquier
|
|
|
|
|
|
|
|
|
|
|
|
| |
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.
This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.
Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recent commits, mainly b69bf30b9bfacafc733a9ba77c9587cf54d06c0c and
53bb309d2d5a9432d2602c93ed18e58bd2924e15, introduced mechanisms to
protect against wraparound of the MultiXact member space: the number
of multixacts that can exist at one time is limited to 2^32, but the
total number of members in those multixacts is also limited to 2^32,
and older code did not take care to enforce the second limit,
potentially allowing old data to be overwritten while it was still
needed.
Unfortunately, these new mechanisms failed to account for the fact
that the code paths in which they run might be executed during
recovery or while the cluster was in an inconsistent state. Also,
they failed to account for the fact that users who used pg_upgrade
to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have
might oldestMultiXid = 1 in the control file despite the true value
being larger.
To fix these problems, first, avoid unnecessarily examining the
mmembers of MultiXacts when the cluster is not known to be consistent.
TruncateMultiXact has done this for a long time, and this patch does
not fix that. But the new calls used to prevent member wraparound
are not needed until we reach normal running, so avoid calling them
earlier. (SetMultiXactIdLimit is actually called before InRecovery
is set, so we can't rely on that; we invent our own multixact-specific
flag instead.)
Second, make failure to look up the members of a MultiXact a non-fatal
error. Instead, if we're unable to determine the member offset at
which wraparound would occur, postpone arming the member wraparound
defenses until we are able to do so. If we're unable to determine the
member offset that should force autovacuum, force it continuously
until we are able to do so. If we're unable to deterine the member
offset at which we should truncate the members SLRU, log a message and
skip truncation.
An important consequence of these changes is that anyone who does have
a bogus oldestMultiXid = 1 value in pg_control will experience
immediate emergency autovacuuming when upgrading to a release that
contains this fix. The release notes should highlight this fact. If
a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1
in the control file, they may wish to vacuum any tables with
relminmxid = 1 prior to upgrading in order to avoid an immediate
emergency autovacuum after the upgrade. This must be done with a
PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age
and vacuum_multixact_freeze_table_age set to 0.
This patch also adds an additional log message at each database server
startup, indicating either that protections against member wraparound
have been engaged, or that they have not. In the latter case, once
autovacuum has advanced oldestMultiXid to a sane value, the message
indicating that the guards have been engaged will appear at the next
checkpoint. A few additional messages have also been added at the DEBUG1
level so that the correct operation of this code can be properly audited.
Along the way, this patch fixes another, related bug in TruncateMultiXact
that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at
all, the truncation code looks up NextMultiXactId, which doesn't exist
yet. This can lead to TruncateMultiXact removing every file in
pg_multixact/offsets instead of keeping one around, as it should.
This in turn will cause the database server to refuse to start
afterwards.
Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and
Thomas Munro.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
|
| |
|
|
|
|
|
|
|
|
| |
Typo in commit 7cbee7c0a. No practical effect since the buffer should
never actually be overrun, but various compilers and static analyzers will
whine about it.
Petr Jelinek
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With commit de768844, a copy of the partial segment was archived with the
.partial suffix, but the original file was still left in pg_xlog, so it
didn't actually solve the problems with archiving the partial segment that
it was supposed to solve. With this patch, the partial segment is renamed
rather than copied, so we only archive it with the .partial suffix.
Also be more robust in detecting if the last segment is already being
archived. Previously I used XLogArchiveIsBusy() for that, but that's not
quite right. With archive_mode='always', there might be a .ready file for
it, and we don't want to rename it to .partial in that case.
The old segment is needed until we're fully committed to the new timeline,
i.e. until we've written the end-of-recovery WAL record and updated the
min recovery point and timeline in the control file. So move the renaming
later in the startup sequence, after all that's been done.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously even if recovery_target_action was set to pause and
the recovery target was reached, the recovery could never be paused.
Because the setting of pause was *always* overridden with that of
shutdown unexpectedly. This override is valid and intentional
if hot_standby is not enabled because there is no way to resume
the paused recovery in this case and the setting of pause is
completely useless. But not if hot_standby is enabled.
This patch changes the code so that the setting of pause is overridden
with that of shutdown only when hot_standby is not enabled.
Bug reported by Andres Freund
|
|
|
|
| |
Patch by CharSyam, plus a few more I spotted with grep.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.
Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.
Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
|
| |
|
|
|
|
| |
Dmitriy Olshevskiy
|
| |
|
|
|
|
|
|
|
| |
In 'always' mode, the standby independently archives all files it receives
from the primary.
Original patch by Fujii Masao, docs and review by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Windows can't reliably restore symbolic links from a tar format, so
instead during backup start we create a tablespace_map file, which is
used by the restoring postgres to create the correct links in pg_tblspc.
The backup protocol also now has an option to request this file to be
included in the backup stream, and this is used by pg_basebackup when
operating in tar mode.
This is done on all platforms, not just Windows.
This means that pg_basebackup will not not work in tar mode against 9.4
and older servers, as this protocol option isn't implemented there.
Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Analysis by Noah Misch shows that the 25% threshold set by commit
53bb309d2d5a9432d2602c93ed18e58bd2924e15 is lower than any other,
similar autovac threshold. While we don't know exactly what value
will be optimal for all users, it is better to err a little on the
high side than on the low side. A higher value increases the risk
that users might exhaust the available space and start seeing errors
before autovacuum can clean things up sufficiently, but a user who
hits that problem can compensate for it by reducing
autovacuum_multixact_freeze_max_age to a value dependent on their
average multixact size. On the flip side, if the emergency cap
imposed by that patch kicks in too early, the user will experience
excessive wraparound scanning and will be unable to mitigate that
problem by configuration. The new value will hopefully reduce the
risk of such bad experiences while still providing enough headroom
to avoid multixact member exhaustion for most users.
Along the way, adjust the documentation to reflect the effects of
commit 04e6d3b877e060d8445eb653b7ea26b1ee5cec6b, which taught
autovacuum to run for multixact wraparound even when autovacuum
is configured off.
|
|
|
|
| |
Thomas Munro, with some adjustments by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c advanced the stop point
at vacuum time, but this has subsequently been shown to be unsafe as a
result of analysis by myself and Thomas Munro and testing by Thomas
Munro. The crux of the problem is that the SLRU deletion logic may
get confused about what to remove if, at exactly the right time during
the checkpoint process, the head of the SLRU crosses what used to be
the tail.
This patch, by me, fixes the problem by advancing the stop point only
following a checkpoint. This has the additional advantage of making
the removal logic work during recovery more like the way it works during
normal running, which is probably good.
At least one of the calls to DetermineSafeOldestOffset which this patch
removes was already dead, because MultiXactAdvanceOldest is called only
during recovery and DetermineSafeOldestOffset was set up to do nothing
during recovery. That, however, is inconsistent with the principle that
recovery and normal running should work similarly, and was confusing to
boot.
Along the way, fix some comments that previous patches in this area
neglected to update. It's not clear to me whether there's any
concrete basis for the decision to use only half of the multixact ID
space, but it's neither necessary nor sufficient to prevent multixact
member wraparound, so the comments should not say otherwise.
|
|
|
|
|
|
|
|
| |
Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c failed to take into
account the possibility that there might be no multixacts in existence
at all.
Report by Thomas Munro; patch by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.
To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.
There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
|
|
|
|
|
| |
We had many instances of the strlen + strspn combination to check for that.
This makes the code a bit easier to read.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic introduced in commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c
and repaired in commits 669c7d20e6374850593cb430d332e11a3992bbcf and
7be47c56af3d3013955c91c2877c08f2a0e3e6a2 helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.
To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases. This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.
This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off. We'll need to fix
that via a separate patch.
Thomas Munro and Robert Haas
|
|
|
|
|
|
|
|
|
|
| |
The old formula didn't have enough parentheses, so it would do the wrong
thing, and it used / rather than % to find a remainder. The effect of
these oversights is that the stop point chosen by the logic introduced in
commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c might be rather
meaningless.
Thomas Munro, reviewed by Kevin Grittner, with a whitespace tweak by me.
|
|
|
|
| |
Per request from Peter Eisentraut.
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk. This could lead to data corruption.
Back-patch to all supported versions.
Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
|
|
|
|
|
|
|
| |
prev_regbuf was never set, and therefore the same-rel flag was never set on
WAL records.
Report and fix by Zhanq Zq
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to create the pg_multixact/offsets file deleted by pg_upgrade
much earlier than we originally were: it was in TrimMultiXact(), which
runs after we exit recovery, but it actually needs to run earlier than
the first call to SetMultiXactIdLimit (before recovery), because that
routine already wants to read the first offset segment.
Per pg_upgrade trouble report from Jeff Janes.
While at it, silence a compiler warning about a pointless assert that an
unsigned variable was being tested non-negative. This was a signed
constant in Thomas Munro's patch which I changed to unsigned before
commit. Pointed out by Andres Freund.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
|
|
|
|
|
|
| |
Reword messages, rename a confusingly named function.
Per Robert Haas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Multixact member files are subject to early wraparound overflow and
removal: if the average multixact size is above a certain threshold (see
note below) the protections against offset overflow are not enough:
during multixact truncation at checkpoint time, some
pg_multixact/members files would be removed because the server considers
them to be old and not needed anymore. This leads to loss of files that
are critical to interpret existing tuples's Xmax values.
To protect against this, since we don't have enough info in pg_control
and we can't modify it in old branches, we maintain shared memory state
about the oldest value that we need to keep; we use this during new
multixact creation to abort if an old still-needed file would get
overwritten. This value is kept up to date by checkpoints, which makes
it not completely accurate but should be good enough. We start emitting
warnings sometime earlier, so that the eventual multixact-shutdown
doesn't take DBAs completely by surprise (more precisely: once 20
members SLRU segments are remaining before shutdown.)
On troublesome average multixact size: The threshold size depends on the
multixact freeze parameters. The oldest age is related to the greater of
multixact_freeze_table_age and multixact_freeze_min_age: anything
older than that should be removed promptly by autovacuum. If autovacuum
is keeping up with multixact freezing, the troublesome multixact average
size is
(2^32-1) / Max(freeze table age, freeze min age)
or around 28 members per multixact. Having an average multixact size
larger than that will eventually cause new multixact data to overwrite
the data area for older multixacts. (If autovacuum is not able to keep
up, or there are errors in vacuuming, the actual maximum is
multixact_freeeze_max_age instead, at which point multixact generation
is stopped completely. The default value for this limit is 400 million,
which means that the multixact size that would cause trouble is about 10
members).
Initial bug report by Timothy Garnett, bug #12990
Backpatch to 9.3, where the problem was introduced.
Authors: Álvaro Herrera, Thomas Munro
Reviews: Thomas Munro, Amit Kapila, Robert Haas, Kevin Grittner
|
|
|
|
|
| |
Author: Dmitriy Olshevskiy
Discussion: 553D00A6.4090205@bk.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the startup process recovers transactions by scanning pg_twophase
directory, it should clear MyLockedGxact after it's done processing each
transaction. Like we do during normal operation, at PREPARE TRANSACTION.
Otherwise, if the startup process exits due to an error, it will try to
clear the locking_backend field of the last recovered transaction. That's
usually harmless, but if the error happens in MarkAsPreparing, while
holding TwoPhaseStateLock, the shmem-exit hook will try to acquire
TwoPhaseStateLock again, and deadlock with itself.
This fixes bug #13128 reported by Grant McAlister. The bug was introduced
by commit bb38fb0d, so backpatch to all supported versions like that
commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After the WAL format changes, the calculation of the size of a checkpoint
record became incorrect. Instead of trying to fix the math, check that the
previous record, i.e. the xl_prev value that we'd write for the next
record, matches the last checkpoint's redo pointer. That way it's not
dependent on the size of the checkpoint record at all.
The old logic was actually slightly wrong all along: if the previous
checkpoint record crossed a page boundary, the page headers threw off the
record size calculation, and the checkpoint was not skipped. The new
checkpoint would not cross a page boundary, so this only resulted in at
most one extra checkpoint after the system became idle. The new logic fixes
that. (It's not worth fixing in backbranches).
However, it makes some sense to try to keep the latest checkpoint contained
fully in a page, or at least in a single WAL segment, just on general
robustness grounds. If something goes awfully wrong, it's more likely that
you can recover the latest WAL segment, than the last two WAL segments. So
I added an extra check that the checkpoint is not skipped if the previous
checkpoint crossed a WAL segment.
Reported by Jeff Janes.
|
|
|
|
|
|
|
|
| |
SLRU_SEGMENTS_PER_PAGE -> SLRU_PAGES_PER_SEGMENT
I introduced this ancient typo in subtrans.c and later propagated it to
multixact.c. I fixed the latter in f741300c, but only back to 9.3;
backpatch to all supported branches for consistency.
|
|
|
|
|
|
|
|
|
|
| |
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.
Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.
This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.
To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
|