| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
It does currently, and I don't see us changing that any time soon, but we
don't make that assumption anywhere else.
Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that
added this assumption.
|
|
|
|
|
|
|
|
|
|
| |
In GIN, an all-zeros page would be leaked forever, and never reused. Just
add them to the FSM in vacuum, and they will be reinitialized when grabbed
from the FSM. On master and 9.5, attempting to access the page's opaque
struct also caused an assertion failure, although that was otherwise
harmless.
Reported by Jeff Janes. Backpatch to all supported versions.
|
|
|
|
|
|
|
|
|
|
|
|
| |
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.
Backpatch to 9.2, where SP-GiST indexes were added.
|
|
|
|
|
|
|
|
|
|
| |
If there were no subtransactions (or multixacts) active, we would calculate
the oldestxid == next xid. That's correct, but if next XID happens to be
on the next pg_subtrans (pg_multixact) page, the page does not exist yet,
and SimpleLruTruncate will produce an "apparent wraparound" warning. The
warning is harmless in this case, but looks very alarming to users.
Backpatch to all supported versions. Patch and analysis by Thomas Munro.
|
|
|
|
|
|
|
|
|
|
| |
After calling XLogInitBufferForRedo(), the page might be all-zeros if it was
not in page cache already. btree_xlog_unlink_page initialized the page
correctly, but it called PageGetSpecialPointer before initializing it, which
would lead to a corrupt page at WAL replay, if the unlinked page is not in
page cache.
Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
VACUUM FREEZE generated false cancelations of standby queries on an
otherwise idle master. Caused by an off-by-one error on cutoff_xid
which goes back to original commit.
Backpatch to all versions 9.0+
Analysis and report by Marco Nenciarini
Bug fix by Simon Riggs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Replay of the WAL record for setting a bit in the visibility map
contained an assertion that a full-page image of that record type can only
occur with checksums enabled. But it can also happen with wal_log_hints, so
remove the assertion. Unlike checksums, wal_log_hints can be changed on the
fly, so it would be complicated to figure out if it was enabled at the time
that the WAL record was generated.
2. wal_log_hints has the same effect on the locking needed to read the LSN
of a page as data checksums. BufferGetLSNAtomic() didn't get the memo.
Backpatch to 9.4, where wal_log_hints was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.
Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets. Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.
Discussion: 20150608163707.GD20772@alap3.anarazel.de
Backpatch into 9.3, where it became more likely that multixacts wrap
around.
|
|
|
|
|
|
|
|
|
|
| |
9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The
other WAL_DEBUG dependant messages check for the wal_debug GUC, but this
one did not. While at it replace 'upto' with 'up to'.
Discussion: 20150610110253.GF3832@alap3.anarazel.de
Backpatch to 9.4, the first release containing 9a20a9b2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(),
and that function looks only at the on-disk state, it's possible for it
to fail to find a page that exists in the in-memory SLRU that has not
been written yet. If that happens, SetOffsetVacuumLimit() will
erroneously decide to force emergency autovacuuming immediately.
We should probably fix find_multixact_start() to consider the data
cached in memory as well as on the on-disk state, but that's no excuse
for SetOffsetVacuumLimit() to be stupid about the case where it can
no longer read the value after having previously succeeded in doing so.
Report by Andres Freund.
|
|
|
|
|
|
|
|
|
|
|
|
| |
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.
This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.
Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recent commits, mainly b69bf30b9bfacafc733a9ba77c9587cf54d06c0c and
53bb309d2d5a9432d2602c93ed18e58bd2924e15, introduced mechanisms to
protect against wraparound of the MultiXact member space: the number
of multixacts that can exist at one time is limited to 2^32, but the
total number of members in those multixacts is also limited to 2^32,
and older code did not take care to enforce the second limit,
potentially allowing old data to be overwritten while it was still
needed.
Unfortunately, these new mechanisms failed to account for the fact
that the code paths in which they run might be executed during
recovery or while the cluster was in an inconsistent state. Also,
they failed to account for the fact that users who used pg_upgrade
to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have
might oldestMultiXid = 1 in the control file despite the true value
being larger.
To fix these problems, first, avoid unnecessarily examining the
mmembers of MultiXacts when the cluster is not known to be consistent.
TruncateMultiXact has done this for a long time, and this patch does
not fix that. But the new calls used to prevent member wraparound
are not needed until we reach normal running, so avoid calling them
earlier. (SetMultiXactIdLimit is actually called before InRecovery
is set, so we can't rely on that; we invent our own multixact-specific
flag instead.)
Second, make failure to look up the members of a MultiXact a non-fatal
error. Instead, if we're unable to determine the member offset at
which wraparound would occur, postpone arming the member wraparound
defenses until we are able to do so. If we're unable to determine the
member offset that should force autovacuum, force it continuously
until we are able to do so. If we're unable to deterine the member
offset at which we should truncate the members SLRU, log a message and
skip truncation.
An important consequence of these changes is that anyone who does have
a bogus oldestMultiXid = 1 value in pg_control will experience
immediate emergency autovacuuming when upgrading to a release that
contains this fix. The release notes should highlight this fact. If
a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1
in the control file, they may wish to vacuum any tables with
relminmxid = 1 prior to upgrading in order to avoid an immediate
emergency autovacuum after the upgrade. This must be done with a
PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age
and vacuum_multixact_freeze_table_age set to 0.
This patch also adds an additional log message at each database server
startup, indicating either that protections against member wraparound
have been engaged, or that they have not. In the latter case, once
autovacuum has advanced oldestMultiXid to a sane value, the message
indicating that the guards have been engaged will appear at the next
checkpoint. A few additional messages have also been added at the DEBUG1
level so that the correct operation of this code can be properly audited.
Along the way, this patch fixes another, related bug in TruncateMultiXact
that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at
all, the truncation code looks up NextMultiXactId, which doesn't exist
yet. This can lead to TruncateMultiXact removing every file in
pg_multixact/offsets instead of keeping one around, as it should.
This in turn will cause the database server to refuse to start
afterwards.
Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and
Thomas Munro.
|
|
|
|
|
|
|
|
|
| |
This file has been patched over and over, and the differences to master
caused by pgindent are annoying enough that it seems saner to make the
older branches look the same.
Backpatch to 9.3, which is as far back as backpatching of bugfixes is
necessary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
|
|
|
|
|
|
|
|
|
|
|
| |
Multixact truncation is now handled differently, and this file hadn't
gotten the memo.
Per note from Amit Langote. I didn't use his patch, though.
Also update the description of infomask bits, which weren't completely up
to date either. This commit also propagates b01a4f6838 back to 9.3 and
9.4, which apparently I failed to do back then.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Analysis by Noah Misch shows that the 25% threshold set by commit
53bb309d2d5a9432d2602c93ed18e58bd2924e15 is lower than any other,
similar autovac threshold. While we don't know exactly what value
will be optimal for all users, it is better to err a little on the
high side than on the low side. A higher value increases the risk
that users might exhaust the available space and start seeing errors
before autovacuum can clean things up sufficiently, but a user who
hits that problem can compensate for it by reducing
autovacuum_multixact_freeze_max_age to a value dependent on their
average multixact size. On the flip side, if the emergency cap
imposed by that patch kicks in too early, the user will experience
excessive wraparound scanning and will be unable to mitigate that
problem by configuration. The new value will hopefully reduce the
risk of such bad experiences while still providing enough headroom
to avoid multixact member exhaustion for most users.
Along the way, adjust the documentation to reflect the effects of
commit 04e6d3b877e060d8445eb653b7ea26b1ee5cec6b, which taught
autovacuum to run for multixact wraparound even when autovacuum
is configured off.
|
|
|
|
| |
Thomas Munro, with some adjustments by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c advanced the stop point
at vacuum time, but this has subsequently been shown to be unsafe as a
result of analysis by myself and Thomas Munro and testing by Thomas
Munro. The crux of the problem is that the SLRU deletion logic may
get confused about what to remove if, at exactly the right time during
the checkpoint process, the head of the SLRU crosses what used to be
the tail.
This patch, by me, fixes the problem by advancing the stop point only
following a checkpoint. This has the additional advantage of making
the removal logic work during recovery more like the way it works during
normal running, which is probably good.
At least one of the calls to DetermineSafeOldestOffset which this patch
removes was already dead, because MultiXactAdvanceOldest is called only
during recovery and DetermineSafeOldestOffset was set up to do nothing
during recovery. That, however, is inconsistent with the principle that
recovery and normal running should work similarly, and was confusing to
boot.
Along the way, fix some comments that previous patches in this area
neglected to update. It's not clear to me whether there's any
concrete basis for the decision to use only half of the multixact ID
space, but it's neither necessary nor sufficient to prevent multixact
member wraparound, so the comments should not say otherwise.
|
|
|
|
|
|
|
|
| |
Commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c failed to take into
account the possibility that there might be no multixacts in existence
at all.
Report by Thomas Munro; patch by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic introduced in commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c
and repaired in commits 669c7d20e6374850593cb430d332e11a3992bbcf and
7be47c56af3d3013955c91c2877c08f2a0e3e6a2 helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.
To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases. This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.
This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off. We'll need to fix
that via a separate patch.
Thomas Munro and Robert Haas
|
|
|
|
|
|
|
|
|
|
| |
The old formula didn't have enough parentheses, so it would do the wrong
thing, and it used / rather than % to find a remainder. The effect of
these oversights is that the stop point chosen by the logic introduced in
commit b69bf30b9bfacafc733a9ba77c9587cf54d06c0c might be rather
meaningless.
Thomas Munro, reviewed by Kevin Grittner, with a whitespace tweak by me.
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk. This could lead to data corruption.
Back-patch to all supported versions.
Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to create the pg_multixact/offsets file deleted by pg_upgrade
much earlier than we originally were: it was in TrimMultiXact(), which
runs after we exit recovery, but it actually needs to run earlier than
the first call to SetMultiXactIdLimit (before recovery), because that
routine already wants to read the first offset segment.
Per pg_upgrade trouble report from Jeff Janes.
While at it, silence a compiler warning about a pointless assert that an
unsigned variable was being tested non-negative. This was a signed
constant in Thomas Munro's patch which I changed to unsigned before
commit. Pointed out by Andres Freund.
|
|
|
|
|
|
| |
Reword messages, rename a confusingly named function.
Per Robert Haas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Multixact member files are subject to early wraparound overflow and
removal: if the average multixact size is above a certain threshold (see
note below) the protections against offset overflow are not enough:
during multixact truncation at checkpoint time, some
pg_multixact/members files would be removed because the server considers
them to be old and not needed anymore. This leads to loss of files that
are critical to interpret existing tuples's Xmax values.
To protect against this, since we don't have enough info in pg_control
and we can't modify it in old branches, we maintain shared memory state
about the oldest value that we need to keep; we use this during new
multixact creation to abort if an old still-needed file would get
overwritten. This value is kept up to date by checkpoints, which makes
it not completely accurate but should be good enough. We start emitting
warnings sometime earlier, so that the eventual multixact-shutdown
doesn't take DBAs completely by surprise (more precisely: once 20
members SLRU segments are remaining before shutdown.)
On troublesome average multixact size: The threshold size depends on the
multixact freeze parameters. The oldest age is related to the greater of
multixact_freeze_table_age and multixact_freeze_min_age: anything
older than that should be removed promptly by autovacuum. If autovacuum
is keeping up with multixact freezing, the troublesome multixact average
size is
(2^32-1) / Max(freeze table age, freeze min age)
or around 28 members per multixact. Having an average multixact size
larger than that will eventually cause new multixact data to overwrite
the data area for older multixacts. (If autovacuum is not able to keep
up, or there are errors in vacuuming, the actual maximum is
multixact_freeeze_max_age instead, at which point multixact generation
is stopped completely. The default value for this limit is 400 million,
which means that the multixact size that would cause trouble is about 10
members).
Initial bug report by Timothy Garnett, bug #12990
Backpatch to 9.3, where the problem was introduced.
Authors: Álvaro Herrera, Thomas Munro
Reviews: Thomas Munro, Amit Kapila, Robert Haas, Kevin Grittner
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the startup process recovers transactions by scanning pg_twophase
directory, it should clear MyLockedGxact after it's done processing each
transaction. Like we do during normal operation, at PREPARE TRANSACTION.
Otherwise, if the startup process exits due to an error, it will try to
clear the locking_backend field of the last recovered transaction. That's
usually harmless, but if the error happens in MarkAsPreparing, while
holding TwoPhaseStateLock, the shmem-exit hook will try to acquire
TwoPhaseStateLock again, and deadlock with itself.
This fixes bug #13128 reported by Grant McAlister. The bug was introduced
by commit bb38fb0d, so backpatch to all supported versions like that
commit.
|
|
|
|
|
|
|
|
| |
SLRU_SEGMENTS_PER_PAGE -> SLRU_PAGES_PER_SEGMENT
I introduced this ancient typo in subtrans.c and later propagated it to
multixact.c. I fixed the latter in f741300c, but only back to 9.3;
backpatch to all supported branches for consistency.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.
This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.
To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ed9cc2b5df59fdbc50cce37399e26b03ab2c1686 made it unnecessary to pass
start_nblkno to _hash_splitbucket(), and for that matter unnecessary to
have the internal nblkno variable either. My compiler didn't complain
about that, but some did. I also rearranged the use of oblkno a bit to
make that case more parallel.
Report and initial patch by Petr Jelinek, rearranged a bit by me.
Back-patch to all branches, like the previous patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_hash_splitbucket() obtained the base page of the new bucket by calling
_hash_getnewbuf(), but it held no exclusive lock that would prevent some
other process from calling _hash_getnewbuf() at the same time. This is
contrary to _hash_getnewbuf()'s API spec and could in fact cause failures.
In practice, we must only call that function while holding write lock on
the hash index's metapage.
An additional problem was that we'd already modified the metapage's bucket
mapping data, meaning that failure to extend the index would leave us with
a corrupt index.
Fix both issues by moving the _hash_getnewbuf() call to just before we
modify the metapage in _hash_expandtable().
Unfortunately there's still a large problem here, which is that we could
also incur ENOSPC while trying to get an overflow page for the new bucket.
That would leave the index corrupt in a more subtle way, namely that some
index tuples that should be in the new bucket might still be in the old
one. Fixing that seems substantially more difficult; even preallocating as
many pages as we could possibly need wouldn't entirely guarantee that the
bucket split would complete successfully. So for today let's just deal
with the base case.
Per report from Antonin Houska. Back-patch to all active branches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recovery delays are implemented by waiting on a latch, and latches take
milliseconds as a parameter. The required amount of waiting was computed
using microsecond resolution though and the wait loop's abort condition
was checking the delay in microseconds as well. This could lead to
short spurts of busy looping when the overall wait time was below a
millisecond, but above 0 microseconds.
Instead just formulate the wait loop's abort condition in millisecond
granularity as well. Given that that's recovery_min_apply_delay
resolution, it seems harmless to not wait for less than a millisecond.
Backpatch to 9.4 where recovery_min_apply_delay was introduced.
Discussion: 20150323141819.GH26995@alap3.anarazel.de
|
|
|
|
|
| |
Per bug #12850 by Walter Nordmann. Backpatch to 9.4 where the leak was
introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.
This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.
This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...
This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.
To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records. If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.
The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.
Discussion: 20150223165359.GF30784@awork2.anarazel.de,
369698E947874884A77849D8FE3680C2@maumau,
5CF4ABBA67674088B3941894E22A0D25@maumau
Per complaint from maumau and Thom Brown
Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer. Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding. An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input. Back-patch to 9.4
just in case that is reachable now.
In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().
All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).
All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.
Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.
It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.
(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)
Backpatch to 9.4, where GIN fast scan was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 804b6b6db4dcfc590a468e7be390738f9f7755fb we modified
BuildIndexValueDescription to pay attention to which columns are visible
to the user, but unfortunatley that commit neglected to consider indexes
which are built on expressions.
Handle error-reporting of violations of constraint indexes based on
expressions by not returning any detail when the user does not have
table-level SELECT rights.
Backpatch to 9.0, as the prior commit was.
Pointed out by Tom.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When gin_fuzzy_search_limit was used, we could jump out of startScan()
without calling startScanKey(). That was harmless in 9.3 and below, because
startScanKey()() didn't do anything interesting, but in 9.4 it initializes
information needed for skipping entries (aka GIN fast scans), and you
readily get a segfault if it's not done. Nevertheless, it was clearly wrong
all along, so backpatch all the way to 9.1 where the early return was
introduced.
(AFAICS startScanKey() did nothing useful in 9.3 and below, because the
fields it initialized were already initialized in ginFillScanKey(), but I
don't dare to change that in a minor release. ginFillScanKey() is always
called in gingetbitmap() even though there's a check there to see if the
scan keys have already been initialized, because they never are; ginrescan()
free's them.)
In the passing, remove unnecessary if-check from the second inner loop in
startScan(). We already check in the first loop that the condition is true
for all entries.
Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although
AFAICS it causes a live bug only in 9.4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.
Instead, include only those columns which the user is providing or which
the user has select rights on. If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription. Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.
Back-patch all the way, as column-level privileges are now in all
supported versions.
This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
|
|
|
|
|
|
|
|
| |
The flag is supposed to be copied from the record. Same issue with
track_commit_timestamps, but that's master-only.
Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was
added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko
mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case
where a tuple is marked with a multixact with at most a FOR SHARE lock,
and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock;
this case should block but doesn't.
Include a new isolation tester spec file to explicitely try all the
tuple lock combinations; without the fix it shows the problem:
starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit
step s1_begin: BEGIN;
step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo;
a
1
step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE;
a
1
step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE;
a
1
step s1_commit: COMMIT;
With the fixed code, step s2_tuplock3 blocks until session 1 commits,
which is the correct behavior.
All other cases behave correctly.
Backpatch to 9.3, like the commit that introduced the problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At one point in the development of this feature, it was claimed that
allowing negative values would be useful to compensate for timezone
differences between master and slave servers. That was based on a mistaken
assumption that commit timestamps are recorded in local time; but of course
they're in UTC. Nor is a negative apply delay likely to be a sane way of
coping with server clock skew. However, the committed patch still treated
negative delays as doing something, and the timezone misapprehension
survived in the user documentation as well.
If recovery_min_apply_delay were a proper GUC we'd just set the minimum
allowed value to be zero; but for the moment it seems better to treat
negative settings as if they were zero.
In passing do some extra wordsmithing on the parameter's documentation,
including correcting a second misstatement that the parameter affects
processing of Restore Point records.
Issue noted by Michael Paquier, who also provided the code patch; doc
changes by me. Back-patch to 9.4 where the feature was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were trying to acquire the lock even when we were subsequently
not sleeping in some other transaction, which opens us up unnecessarily
to deadlocks. In particular, this is troublesome if an update tries to
lock an updated version of a tuple and finds itself doing EvalPlanQual
update chain walking; more than two sessions doing this concurrently
will find themselves sleeping on each other because the HW tuple lock
acquisition in heap_lock_tuple called from EvalPlanQualFetch races with
the same tuple lock being acquired in heap_update -- one of these
sessions sleeps on the other one to finish while holding the tuple lock,
and the other one sleeps on the tuple lock.
Per trouble report from Andrew Sackville-West in
http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230
His scenario can be simplified down to a relatively simple
isolationtester spec file which I don't include in this commit; the
reason is that the current isolationtester is not able to deal with more
than one blocked session concurrently and it blocks instead of raising
the expected deadlock. In the future, if we improve isolationtester, it
would be good to include the spec file in the isolation schedule. I
posted it in
http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org
Hat tip to Mark Kirkwood, who helped diagnose the trouble.
|
|
|
|
|
|
|
| |
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.
Backpatch to 9.3 and above, where the fast promotion was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.
That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.
That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.
To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.
Backpatch to 9.2 where support for taking basebackups from standbys
was added.
|
|
|
|
|
|
|
|
|
|
|
| |
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.
Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).
Mark Dilger identified and patched the type violations; the message
rewordings are mine.
|
|
|
|
|
|
|
|
|
| |
It was an oversight in the original commit.
Also note in the sample config file that changing wal_log_hints requires a
restart.
Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
|