| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
| |
Dates outside the supported range could be entered, but would not print
reasonably, and operations such as conversion to timestamp wouldn't behave
sanely either. Since this has the potential to result in undumpable table
data, it seems worth back-patching.
Hitoshi Harada
|
|
|
|
|
|
|
|
|
| |
This partially reverts commit 21a39de5809cd3050a37d2554323cc1d0cbeed9d,
restoring the pre-9.2 cost estimates for index usage. That change
introduced much too large a bias against larger indexes, as per reports
from Jeff Janes and others. The whole thing needs a rewrite, which I've
done in HEAD, but the safest thing to do in 9.2 is just to undo this
multiplier change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If VirtualXactLock() has to wait for a transaction that holds its VXID lock
as a fast-path lock, it must first convert the fast-path lock to a regular
lock. It failed to take the required "partition" lock on the main
shared-memory lock table while doing so. This is the direct cause of the
assert failure in GetLockStatusData() recently observed in the buildfarm,
but more worryingly it could result in arbitrary corruption of the shared
lock table if some other process were concurrently engaged in modifying the
same partition of the lock table. Fortunately, VirtualXactLock() is only
used by CREATE INDEX CONCURRENTLY and DROP INDEX CONCURRENTLY, so the
opportunities for failure are fewer than they might have been.
In passing, improve some comments and be a bit more consistent about
order of operations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SPI_execute() and related functions create a CachedPlan, execute it once,
and immediately discard it, so that the functionality offered by
plancache.c is of no value in this code path. And performance measurements
show that the extra data copying and invalidation checking done by
plancache.c slows down simple queries by 10% or more compared to 9.1.
However, enough of the SPI code is shared with functions that do need plan
caching that it seems impractical to bypass plancache.c altogether.
Instead, let's invent a variant version of cached plans that preserves
99% of the API but doesn't offer any of the actual functionality, nor the
overhead. This puts SPI_execute() performance back on par, or maybe even
slightly better, than it was before. This change should resolve recent
complaints of performance degradation from Dong Ye, Pavel Stehule, and
others.
By avoiding data copying, this change also reduces the amount of memory
needed to execute many-statement SPI_execute() strings, as for instance in
a recent complaint from Tomas Vondra.
An additional benefit of this change is that multi-statement SPI_execute()
query strings are now processed fully serially, that is we complete
execution of earlier statements before running parse analysis and planning
on following ones. This eliminates a long-standing POLA violation, in that
DDL that affects the behavior of a later statement will now behave as
expected.
Back-patch to 9.2, since this was a performance regression compared to 9.1.
(In 9.2, place the added struct fields so as to avoid changing the offsets
of existing fields.)
Heikki Linnakangas and Tom Lane
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you take a base backup from a standby server with "pg_basebackup -X
fetch", and the timeline switches while the backup is being taken, the
backup used to fail with an error "requested WAL segment %s has already
been removed". This is because the server-side code that sends over the
required WAL files would not construct the WAL filename with the correct
timeline after a switch.
Fix that by using readdir() to scan pg_xlog for all the WAL segments in the
range, regardless of timeline.
Also, include all timeline history files in the backup, if taken with
"-X fetch". That fixes another related bug: If a timeline switch happened
just before the backup was initiated in a standby, the WAL segment
containing the initial checkpoint record contains WAL from the older
timeline too. Recovery will not accept that without a timeline history file
that lists the older timeline.
Backpatch to 9.2. Versions prior to that were not affected as you could not
take a base backup from a standby before 9.2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cascading standby patch in 9.2 changed the way WAL files are treated
when restored from the archive. Before, they were restored under a temporary
filename, and not kept in pg_xlog, but after the patch, they were copied
under pg_xlog. This is necessary for a cascading standby to find them, but
it also means that if the archive goes offline and a standby is restarted,
it can recover back to where it was using the files in pg_xlog. It also
means that if you take an offline backup from a standby server, it includes
all the required WAL files in pg_xlog.
However, the same change was not made to timeline history files, so if the
WAL segment containing the checkpoint record contains a timeline switch, you
will still get an error if you try to restart recovery without the archive,
or recover from an offline backup taken from the standby.
With this patch, timeline history files restored from archive are copied
into pg_xlog like WAL files are, so that pg_xlog contains all the files
required to recover. This is a corner-case pre-existing issue in 9.2, but
even more important in master where it's possible for a standby to follow a
timeline switch through streaming replication. To make that possible, the
timeline history files must be present in pg_xlog.
|
|
|
|
|
|
|
|
| |
Code review for commit 2f582f76b1945929ff07116cd4639747ce9bb8a1: don't use
a static variable for what ought to be a deparse_context field, fix
non-multibyte-safe test for spaces, avoid useless and potentially O(N^2)
(though admittedly with a very small constant) calculations of wrap
positions when we aren't going to wrap.
|
|
|
|
|
|
|
|
|
|
|
|
| |
transformExpr() is required to cope with already-transformed expression
trees, for various ugly-but-not-quite-worth-cleaning-up reasons. However,
some of its newer subroutines hadn't gotten the memo. This accounts for
bug #7763 from Norbert Buchmuller: transformRowExpr() was overwriting the
previously determined type of a RowExpr during CREATE TABLE LIKE INCLUDING
INDEXES. Additional investigation showed that transformXmlExpr had the
same kind of problem, but all the other cases seem to be safe.
Andres Freund and Tom Lane
|
|
|
|
|
|
|
|
| |
If a relation file was removed when the server-side counterpart of
pg_basebackup was just about to open it to send it to the client, you'd
get a "could not open file" error. Fix that.
Backpatch to 9.1, this goes back to when pg_basebackup was introduced.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If pg_extension_config_dump() is executed again for a table already listed
in the extension's extconfig, the code was blindly making a new array entry.
This does not seem useful. Fix it to replace the existing array entry
instead, so that it's possible for extension update scripts to alter the
filter conditions for configuration tables.
In addition, teach ALTER EXTENSION DROP TABLE to check for an extconfig
entry for the target table, and remove it if present. This is not a 100%
solution because it's allowed for an extension update script to just
summarily DROP a member table, and that code path doesn't go through
ExecAlterExtensionContentsStmt. We could probably make that case clean
things up if we had to, but it would involve sticking a very ugly wart
somewhere in the guts of dependency.c. Since on the whole it seems quite
unlikely that extension updates would want to remove pre-existing
configuration tables, making the case possible with an explicit command
seems sufficient.
Per bug #7756 from Regina Obe. Back-patch to 9.1 where extensions were
introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After the recovery target timeline is changed, we would still recycle and
preallocate WAL segments on the old target timeline. Those WAL segments
created for the old timeline are a waste of space, although otherwise
harmless.
The problem is that when installing a recycled WAL segment as a future one,
ThisTimeLineID is used to construct the filename. ThisTimeLineID is
initialized in the checkpointer process to the recovery target timeline at
startup, but it was not updated when the startup process chooses a new
target timeline (recovery_target_timeline='latest'). To fix, always update
ThisTimeLineID before recycling WAL segments at a restartpoint.
This still leaves a small window where we might install WAL segments under
wrong timeline ID, if the target timeline is changed just as we're about to
start recycling. Also, when we're not on the target timeline yet, but still
replaying some older timeline, we'll install WAL segments to the newer
timeline anyway and they will still go wasted. We'll just live with the
waste in that situation.
Commit to 9.2 and 9.1. Older versions didn't change recovery target timeline
after startup, and for master, I'll commit a slightly different variant of
this.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you restored from a backup taken from a standby, and the last record in
the backup is the checkpoint record, ie. there is no redo required except
for the checkpoint record, we would fail to notice that we've reached the
end-of-backup point, and the database is consistent. The result was an
error "WAL ends before end of online backup". To fix, move the
have-we-reached-end-of-backup check into CheckRecoveryConsistency(), which
is already responsible for similar checks with minRecoveryPoint, and is
called in the right places.
Backpatch to 9.2, this check and bug did not exist before that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During crash recovery, we remove disk files belonging to temporary tables,
but the system catalog entries for such tables are intentionally not
cleaned up right away. Instead, the first backend that uses a temp schema
is expected to clean out any leftover objects therein. This approach
requires that we be careful to ignore leftover temp tables (since any
actual access attempt would fail), *even if their BackendId matches our
session*, if we have not yet established use of the session's corresponding
temp schema. That worked fine in the past, but was broken by commit
debcec7dc31a992703911a9953e299c8d730c778 which incorrectly removed the
rd_islocaltemp relcache flag. Put it back, and undo various changes
that substituted tests like "rel->rd_backend == MyBackendId" for use
of a state-aware flag. Per trouble report from Heikki Linnakangas.
Back-patch to 9.1 where the erroneous change was made. In the back
branches, be careful to add rd_islocaltemp in a spot in the struct that
was alignment padding before, so as not to break existing add-on code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We failed to ever fill the sixth line (LISTEN_ADDR), which caused the
attempt to fill the seventh line (SHMEM_KEY) to fail, so that the shared
memory key never got added to the file in standalone mode. This has been
broken since we added more content to our lock files in 9.1.
To fix, tweak the logic in CreateLockFile to add an empty LISTEN_ADDR
line in standalone mode. This is a tad grotty, but since that function
already knows almost everything there is to know about the contents of
lock files, it doesn't seem that it's any better to hack it elsewhere.
It's not clear how significant this bug really is, since a standalone
backend should never have any children and thus it seems not critical
to be able to check the nattch count of the shmem segment externally.
But I'm going to back-patch the fix anyway.
This problem had escaped notice because of an ancient (and in hindsight
pretty dubious) decision to suppress LOG-level messages by default in
standalone mode; so that the elog(LOG) complaint in AddToDataDirLockFile
that should have warned of the problem didn't do anything. Fixing that
is material for a separate patch though.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a tuple is larger than page size minus space reserved for fillfactor,
heap_multi_insert would never find a page that it fits in and repeatedly ask
for a new page from RelationGetBufferForTuple. If a tuple is too large to
fit on any page, taking fillfactor into account, RelationGetBufferForTuple
will always expand the relation. In a normal insert, heap_insert will accept
that and put the tuple on the new page. heap_multi_insert, however, does a
fillfactor check of its own, and doesn't accept the newly-extended page
RelationGetBufferForTuple returns, even though there is no other choice to
make the tuple fit.
Fix that by making the logic in heap_multi_insert more like the heap_insert
logic. The first tuple is always put on the page RelationGetBufferForTuple
gives us, and the fillfactor check is only applied to the subsequent tuples.
Report from David Gould, although I didn't use his patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The dynahash code requires the number of buckets in a hash table to fit
in an int; but since we calculate the desired hash table size dynamically,
there are various scenarios where we might calculate too large a value.
The resulting overflow can lead to infinite loops, division-by-zero
crashes, etc. I (tgl) had previously installed some defenses against that
in commit 299d1716525c659f0e02840e31fbe4dea3, but that covered only one
call path. Moreover it worked by limiting the request size to work_mem,
but in a 64-bit machine it's possible to set work_mem high enough that the
problem appears anyway. So let's fix the problem at the root by installing
limits in the dynahash.c functions themselves.
Trouble report and patch by Jeff Davis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EndRecPtr is the last record that we've read, but not necessarily yet
replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the
last replayed record instead. This caused recovery to think it's reached
consistency too early.
Now that we do the check in CheckRecoveryConsistency correctly, we have to
move the call of that function to after redoing a record. The current place,
after reading a record but before replaying it, is wrong. In particular, if
there are no more records after the one ending at minRecoveryPoint, we don't
enter hot standby until one extra record is generated and read by the
standby, and CheckRecoveryConsistency is called. These two bugs conspired
to make the code appear to work correctly, except for the small window
between reading the last record that reaches minRecoveryPoint, and
replaying it.
In the passing, rename recoveryLastRecPtr, which is the last record
replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing
with replayEndRecPtr, which is the last record read that we're about to
replay.
Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao.
Backpatch to 9.0, where Hot Standby subtly changed the test from
"minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The
former works because where the test is performed, we have always read one
more record than we've replayed.
|
|
|
|
|
|
|
|
|
| |
If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.
Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that,
minRecoveryPoint was not updated during recovery at all.
|
|
|
|
|
|
|
|
| |
Commit 729205571e81b4767efc42ad7beb53663e08d1ff added privileges on data
types, but there were a number of oversights. The implementation of
default privileges for types missed a few places, and pg_dump was
utterly innocent of the whole concept. Per bug #7741 from Nathan Alden,
and subsequent wider investigation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When deleteOneObject closes and reopens the pg_depend relation,
we must see to it that the relcache pointer held by the calling function
(typically performMultipleDeletions) is updated. Usually the relcache
entry is retained so that the pointer value doesn't change, which is why
the problem had escaped notice ... but after a cache flush event there's
no guarantee that the same memory will be reassigned. To fix, change
the recursive functions' APIs so that we pass around a "Relation *"
not just "Relation".
Per investigation of occasional buildfarm failures. This is trivial
to reproduce with -DCLOBBER_CACHE_ALWAYS, which points up the sad
lack of any buildfarm member running that way on a regular basis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we're not in hot standby mode, then there's no way for users to connect
to reset the recoveryPause flag, so we shouldn't pause. The code was aware
of this but the test to see if pausing was safe was seriously inadequate:
it wasn't paying attention to reachedConsistency, and besides what it was
testing was that we could legally enter hot standby, not that we have
done so. Get rid of that in favor of checking LocalHotStandbyActive,
which because of the coding in CheckRecoveryConsistency is tantamount to
checking that we have told the postmaster to enter hot standby.
Also, move the recoveryPausesHere() call that reacts to asynchronous
recoveryPause requests so that it's not in the middle of application of a
WAL record. I put it next to the recoveryStopsHere() call --- in future
those are going to need to interact significantly, so this seems like a
good waystation.
Also, don't bother trying to read another WAL record if we've already
decided not to continue recovery. This was no big deal when the code was
written originally, but now that reading a record might entail actions like
fetching an archive file, it seems a bit silly to do it like that.
Per report from Jeff Janes and subsequent discussion. The pause feature
needs quite a lot more work, but this gets rid of some indisputable bugs,
and seems safe enough to back-patch.
|
|
|
|
|
|
|
|
|
|
| |
When waiting for an XLOG_BACKUP_RECORD the minRecoveryPoint
will be incorrect, so we must not declare recovery as consistent
before we have seen the record. Major bug allowing recovery to end
too early in some cases, allowing people to see inconsistent db.
This patch to HEAD and 9.2, other fix required for 9.1 and 9.0
Simon Riggs and Andres Freund, bug report by Jeff Janes
|
|
|
|
|
|
|
|
|
|
|
|
| |
During VACUUM if we pause to perform a cycle
of index cleanup we drop the vmbuffer pin,
so we should do the same thing when heap
scan completes. This avoids holding vmbuffer
pin across the main index cleanup in VACUUM,
which could be minutes or hours longer than
necessary for correctness.
Bug report and suggested fix from Pavan Deolasee
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit c11130690d6dca64267201a169cfb38c1adec5ef in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.
This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)
Diagnosed by me and independently by Andres Freund. Back-patch to all
branches supporting hot standby.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If wal_level = hot_standby we update the checkpoint nextxid,
though in the case where a wraparound occurred half-way through
a checkpoint we would neglect updating the epoch also. Updating
the nextxid is arguably the wrong thing to do, but changing that
may introduce subtle bugs into hot standby startup, while updating
the value doesn't cause any known bugs yet. Minimal fix now to
HEAD and backbranches, wider fix later in HEAD.
Bug reported in #6291 by Daniel Farina and slightly differently in
Cause analysis and recommended fixes from Tom Lane and Andres Freund.
Applied patch is minimal version of Andres Freund's work.
|
|
|
|
| |
Commit 81107282a changed it in assert.c, but overlooked this other file.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we had not been holding buffer pin continuously since the tuple was
initially fetched by the UPDATE or DELETE query, it would be possible for
VACUUM or a page-prune operation to move the tuple while we're trying to
copy it. This would result in a garbage "old" tuple value being passed to
an AFTER ROW UPDATE or AFTER ROW DELETE trigger. The preconditions for
this are somewhat improbable, and the timing constraints are very tight;
so it's not so surprising that this hasn't been reported from the field,
even though the bug has been there a long time.
Problem found by Andres Freund. Back-patch to all active branches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The length of a socket path name is constrained by the size of struct
sockaddr_un, and there's not a lot we can do about it since that is a
kernel API. However, it would be a good thing if we produced an
intelligible error message when the user specifies a socket path that's too
long --- and getaddrinfo's standard API is too impoverished to do this in
the natural way. So insert explicit tests at the places where we construct
a socket path name. Now you'll get an error that makes sense and even
tells you what the limit is, rather than something generic like
"Non-recoverable failure in name resolution".
Per trouble report from Jeremy Drake and a fix idea from Andrew Dunstan.
|
|
|
|
| |
Resolves bug 7572 reported by Daniele Varrazzo
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, make the final state be indisvalid = true and indisready = false,
which is otherwise nonsensical. This is pretty ugly but we can't add
another column without forcing initdb, and it's too late for that in 9.2.
(There's a cleaner fix in HEAD.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
Portions of this need to be back-patched even further, but I'll work
on that separately.
Problem reported by Amit Kapila, diagnosis by Pavan Deolasee,
fix by Tom Lane and Andres Freund.
|
|
|
|
|
|
|
|
| |
This was apparently a typo, which caused recovery to think that it
immediately reached the end of backup, and allowed the database to start
up too early.
Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit d573e239f03506920938bf0be56c868d9c3416da, "Take fewer
snapshots". While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query. This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra. While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a query such as "SELECT DISTINCT min(x) FROM tab", the DISTINCT is
pretty useless (there being only one output row), but nonetheless it
shouldn't fail. But it could fail if "tab" is an inheritance parent,
because planagg.c's code for fixing up equivalence classes after making the
index-optimized MIN/MAX transformation wasn't prepared to find child-table
versions of the aggregate expression. The least ugly fix seems to be
to add an option to mutate_eclass_expressions() to skip child-table
equivalence class members, which aren't used anymore at this stage of
planning so it's not really necessary to fix them. Since child members
are ignored in many cases already, it seems plausible for
mutate_eclass_expressions() to have an option to ignore them too.
Per bug #7703 from Maxim Boguk.
Back-patch to 9.1. Although the same code exists before that, it cannot
encounter child-table aggregates AFAICS, because the index optimization
transformation cannot succeed on inheritance trees before 9.1 (for lack
of MergeAppend).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
XLogRecPtrIsInvalid() only checks the xrecoff field, which is correct when
checking if a WAL record could legally begin at the given position, but WAL
sending can legally be paused at a page boundary, in which case xrecoff is
0. Use XLByteEQ(..., InvalidXLogRecPtr) instead, which checks that both
xlogid and xrecoff are 0.
9.3 doesn't have this problem because XLogRecPtr is now a single 64-bit
integer, so XLogRecPtrIsInvalid() does the right thing. Apply to 9.2, and
9.1 where pg_stat_replication view was introduced.
Kyotaro HORIGUCHI, reviewed by Fujii Masao.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When startup process opens a WAL segment after replaying part of it, it
validates the first page on the WAL segment, even though the page it's
really interested in later in the file. As part of the validation, it checks
that the TLI on the page header is >= the TLI it saw on the last page it
read. If the segment contains a timeline switch, and we have already
replayed it, and then re-open the WAL segment (because of streaming
replication got disconnected and reconnected, for example), the TLI check
will fail when the first page is validated. Fix that by relaxing the TLI
check when re-opening a WAL segment.
Backpatch to 9.0. Earlier versions had the same code, but before standby
mode was introduced in 9.0, recovery never tried to re-read a segment after
partially replaying it.
Reported by Amit Kapila, while testing a new feature.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Once we've received a shutdown signal (SIGINT or SIGTERM), we should not
launch any more child processes, even if we get signals requesting such.
The normal code path for spawning backends has always understood that,
but the postmaster's infrastructure for hot standby and autovacuum didn't
get the memo. As reported by Hari Babu in bug #7643, this could lead to
failure to shut down at all in some cases, such as when SIGINT is received
just before the startup process sends PMSIGNAL_RECOVERY_STARTED: we'd
launch a bgwriter and checkpointer, and then those processes would have no
idea that they ought to quit. Similarly, launching a new autovacuum worker
would result in waiting till it finished before shutting down.
Also, switch the order of the code blocks in reaper() that detect startup
process crash versus shutdown termination. Once we've sent it a signal,
we should not consider that exit(1) is surprising. This is just a cosmetic
fix since shutdown occurs correctly anyway, but better not to log a phony
complaint about startup process crash.
Back-patch to 9.0. Some parts of this might be applicable before that,
but given the lack of prior complaints I'm not going to worry too much
about older branches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some platforms throw an exception for this division, rather than returning
a necessarily-overflowed result. Since we were testing for overflow after
the fact, an exception isn't nice. We can avoid the problem by treating
division by -1 as negation.
Add some regression tests so that we'll find out if any compilers try to
optimize away the overflow check conditions.
Back-patch of commit 1f7cb5c30983752ff8de833de30afcaee63536d0.
Per discussion with Xi Wang, though this is different from the patch he
submitted.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous definitions of these GUC variables allowed them to range
up to INT_MAX, but in point of fact the underlying code would suffer
overflows or other errors with large values. Reduce the maximum values
to something that won't misbehave. There's no apparent value in working
harder than this, since very large delays aren't sensible for any of
these. (Note: the risk with archive_timeout is that if we're late
checking the state, the timestamp difference it's being compared to
might overflow. So we need some amount of slop; the choice of INT_MAX/2
is arbitrary.)
Per followup investigation of bug #7670. Although this isn't a very
significant fix, might as well back-patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to avoid calling WaitLatch with timeouts exceeding INT_MAX.
Fortunately a simple clamp will do the trick, since no harm is done if
the wait times out before it's really time to rotate the log file.
Per bug #7670 (probably bug #7545 is the same thing, too).
In passing, fix bogus definition of log_rotation_age's maximum value in
guc.c --- it was numerically right, but only because MINS_PER_HOUR and
SECS_PER_MINUTE have the same value.
Back-patch to 9.2. Before that, syslogger wasn't using WaitLatch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Traditionally check_partial_indexes() has only looked at restriction
clauses while trying to prove partial indexes usable in queries. However,
join clauses can also be used in some cases; mainly, that a strict operator
on "x" proves an "x IS NOT NULL" index predicate, even if the operator is
in a join clause rather than a restriction clause. Adding this code fixes
a regression in 9.2, because previously we would take join clauses into
account when considering whether a partial index could be used in a
nestloop inner indexscan path. 9.2 doesn't handle nestloop inner
indexscans in the same way, and this consideration was overlooked in the
rewrite. Moving the work to check_partial_indexes() is a better solution
anyway, since the proof applies whether or not we actually use the index
in that particular way, and we don't have to do it over again for each
possible outer relation. Per report from Dave Cramer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The correct answer for this (or any other case with arg2 = -1) is zero,
but some machines throw a floating-point exception instead of behaving
sanely. Commit f9ac414c35ea084ff70c564ab2c32adb06d5296f dealt with this
in int4mod, but overlooked the fact that it also happens in int8mod
(at least on my Linux x86_64 machine). Protect int2mod as well; it's
not clear whether any machines fail there (mine does not) but since the
test is so cheap it seems better safe than sorry. While at it, simplify
the original guard in int4mod: we need only check for arg2 == -1, we
don't need to check arg1 explicitly.
Xi Wang, with some editing by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
record_out() leaks memory: it fails to free the strings returned by the
per-column output functions, and also is careless about detoasted values.
This results in a query-lifespan memory leakage when returning composite
values to the client, because printtup() runs the output functions in the
query-lifespan memory context. Fix it to handle these issues the same way
printtup() does. Also fix a similar leakage in record_send().
(At some point we might want to try to run output functions in
shorter-lived memory contexts, so that we don't need a zero-leakage policy
for them. But that would be a significantly more invasive patch, which
doesn't seem like material for back-patching.)
In passing, use appendStringInfoCharMacro instead of appendStringInfoChar
in the innermost data-copying loop of record_out, to try to shave a few
cycles from this function's runtime.
Per trouble report from Carlos Henrique Reimer. Back-patch to all
supported versions.
|
|
|
|
|
|
|
|
| |
At commit all standby locks are released
for the top-level transaction, so searching
for locks for each subtransaction is both
pointless and costly (N^2) in the presence
of many AccessExclusiveLocks.
|
|
|
|
| |
Andres Freund and Simon Riggs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since transformSetOperationTree() recurses, it can be driven to stack
overflow with enough UNION/INTERSECT/EXCEPT clauses in a query. Add a
check to ensure it fails cleanly instead of crashing. Per report from
Matthew Gerber (though it's not clear whether this is the only thing
going wrong for him).
Historical note: I think the reasoning behind not putting a check here in
the beginning was that the check in transformExpr() ought to be sufficient
to guard the whole parser. However, because transformSetOperationTree()
recurses all the way to the bottom of the set-operation tree before doing
any analysis of the statement's expressions, that check doesn't save it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the sleep is interrupted by a signal, we must recompute the remaining
time to wait; otherwise, a steady stream of non-wait-terminating interrupts
could delay return from WaitLatch indefinitely. This has been shown to be
a problem for the autovacuum launcher, and there may well be other places
now or in the future with similar issues. So we'd better make the function
robust, even though this'll add at least one gettimeofday call per wait.
Back-patch to 9.2. We might eventually need to fix 9.1 as well, but the
code is quite different there, and the usage of WaitLatch in 9.1 is so
limited that it's not clearly important to do so.
Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
heavily.
|