| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
|
|
|
| |
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SP-GiST's original scheme for avoiding deadlocks during concurrent index
insertions doesn't work, as per report from Hailong Li, and there isn't any
evident way to make it work completely. We could possibly lock individual
inner tuples instead of their whole pages, but preliminary experimentation
suggests that the performance penalty would be huge. Instead, if we fail
to get a buffer lock while descending the tree, just restart the tree
descent altogether. We keep the old tuple positioning rules, though, in
hopes of reducing the number of cases where this can happen.
Teodor Sigaev, somewhat edited by Tom Lane
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
elog.c has historically treated LOG messages as low-priority during
bootstrap and standalone operation. This has led to confusion and even
masked a bug, because the normal expectation of code authors is that
elog(LOG) will put something into the postmaster log, and that wasn't
happening during initdb. So get rid of the special-case rule and make
the priority order the same as it is in normal operation. To keep from
cluttering initdb's output and the behavior of a standalone backend,
tweak the severity level of three messages routinely issued by xlog.c
during startup and shutdown so that they won't appear in these cases.
Per my proposal back in December.
|
|
|
|
|
|
|
|
| |
Since commit f21bb9cfb5646e1793dcc9c0ea697bab99afa523, this function
ignores the caller-provided length and loops until it finds a
terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the
previous loop control logic. In passing, revert the addition of an
unused variable by the same commit, presumably a debugging relic.
|
| |
|
|
|
|
|
|
| |
A few more minor spelling corrections, no functional changes.
Thom Brown
|
|
|
|
|
|
|
|
| |
Seems cleaner to get the currently-replayed TLI in the same call to
GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the
comment what the code does when recovery has already ended
(RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make
resetting ThisTimeLineID afterwards more explicit.
|
|
|
|
|
|
| |
Fix a few spelling mistakes.
Per bug report #8193 from Lajos Veres.
|
|
|
|
|
|
|
|
|
|
| |
Make slightly better decisions about indentation than what pgindent
is capable of. Mostly breaking out long function calls into one
line per argument, with a few other minor adjustments.
No functional changes- all whitespace.
pgindent ran cleanly (didn't change anything) after.
Passes all regressions.
|
|
|
|
|
| |
This is the first run of the Perl-based pgindent script. Also update
pgindent instructions.
|
|
|
|
|
|
|
| |
Not necessary for correctness, just to make
log_checkpoints output look less singular.
Requested by Fujii Masao
|
|
|
|
|
|
|
|
|
|
|
|
| |
checkpointer needs to reset ThisTimeLineID after
a restartpoint to allow installing/recycling new
WAL files. If recovery has already ended this
would leave ThisTimeLineID set incorrectly and
so we must reset it otherwise later checkpoints
do not have the correct timeline.
Bug report by Heikki Linnakangas.
Further investigation by Heikki and myself.
|
|
|
|
|
|
|
|
| |
This simplifies the handling of crashes after fast promotion and various
minor cases that can exist in short timing windows around that case.
Broad fix to bug reported by Michael Paquier on -hackers,
approach prompted by Heikki Linnakangas
|
| |
|
|
|
|
| |
Michael Paquier
|
|
|
|
|
|
|
|
|
|
|
| |
If OID wraparound should occur while in standalone mode (unlikely but
possible), we want to advance the counter to FirstNormalObjectId not
FirstBootstrapObjectId. Otherwise, user objects might be created with OIDs
in the system-reserved range. That isn't immediately harmful but it poses
a risk of conflicts during future pg_upgrade operations.
Noted by Andres Freund. Back-patch to all supported branches, since all of
them are supported sources for pg_upgrade operations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d22a09dc70f9830fa78c1cd1a3a453e4e473d354 introduced official support
for GiST consistentFns that want to cache data using the FmgrInfo fn_extra
pointer: the idea was to preserve the cached values across gistrescan(),
whereas formerly they'd been leaked. However, there was an oversight in
that, namely that multiple scan keys might reference the same column's
consistentFn; the code would result in propagating the same cache value
into multiple scan keys, resulting in crashes or wrong answers. Use a
separate array instead to ensure that each scan key keeps its own state.
Per bug #8143 from Joel Roller. Back-patch to 9.2 where the bug was
introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a standby server has a cascading standby server connected to it, it's
possible that WAL has already been sent up to the next WAL page boundary,
splitting a WAL record in the middle, when the first standby server is
promoted. Don't throw an assertion failure or error in walsender if that
happens.
Also, fix a variant of the same bug in pg_receivexlog: if it had already
received WAL on previous timeline up to a segment boundary, when the
upstream standby server is promoted so that the timeline switch record falls
on the previous segment, pg_receivexlog would miss the segment containing
the timeline switch. To fix that, have walsender send the position of the
timeline switch at end-of-streaming, in addition to the next timeline's ID.
It was previously assumed that the switch happened exactly where the
streaming stopped.
Note: this is an incompatible change in the streaming protocol. You might
get an error if you try to stream over timeline switches, if the client is
running 9.3beta1 and the server is more recent. It should be fine after a
reconnect, however.
Reported by Fujii Masao.
|
|
|
|
|
|
|
| |
What we have implemented is a radix tree (or a radix trie or a patricia
trie), but the docs and code comments incorrectly called it a "suffix tree".
Alexander Korotkov
|
|
|
|
|
|
| |
The value is not used anywhere in code, but will
allow future changes to the checksum version
should that become necessary in the future.
|
|
|
|
|
| |
Continue to allow a request for synchronous
checkpoints as a mechanism in case of problems.
|
|
|
|
|
|
|
| |
This saves some memory from each index relcache entry. At least on a 64-bit
machine, it saves just enough to shrink a typical relcache entry's memory
usage from 2k to 1k. That's nice if you have a lot of backends and a lot of
indexes.
|
|
|
|
| |
Per a note from Dickson S. Guedes.
|
|
|
|
|
|
|
| |
KeepLogSeg function was broken when we switched to use a 64-bit int for the
segment number.
Per report from Jeff Janes.
|
|
|
|
|
|
|
| |
Heikki reported comment was wrong, so fixed
code to match the comment: we only need to
take additional locking precautions when we
have a shared lock on the buffer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors
caused by concurrent hint writes to buffer while share locked. To make this work
we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal
path for backup blocks, which assumes the underlying buffer is exclusive locked.
Resulting code completely changes layout of XLOG_HINT WAL records, but
this isn't even beta code, so this is a low impact change.
In passing, avoid taking WALInsertLock for full page writes on checksummed
hints, remove related cruft from XLogInsert() and improve xlog_desc record for
XLOG_HINT.
Andres Freund
Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova,
review by Jeff Davis and Simon Riggs. Applied with changes from review
and some comment editing.
|
|
|
|
|
|
|
|
|
| |
In CLUSTER, VACUUM FULL and ALTER TABLE SET TABLESPACE
I erroneously set checksum before log_newpage, which
sets the LSN and invalidates the checksum. So set
checksum immediately *after* log_newpage.
Bug report Fujii Masao, Fix and patch by Jeff Davis
|
|
|
|
|
|
|
|
|
|
|
| |
The pg_start_backup() and pg_stop_backup() functions checked the privileges
of the initially-authenticated user rather than the current user, which is
wrong. For example, a user-defined index function could successfully call
these functions when executed by ANALYZE within autovacuum. This could
allow an attacker with valid but low-privilege database access to interfere
with creation of routine backups. Reported and fixed by Noah Misch.
Security: CVE-2013-1901
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a heap update, if the old and new tuple were on different pages, and the
new page no longer existed (because it was subsequently truncated away by
vacuum), heap_xlog_update forgot to release the pin on the old buffer. This
bug was introduced by the "Fix multiple problems in WAL replay" patch,
commit 3bbf668de9f1bc172371681e80a4e769b6d014c8 (on master branch).
With full_page_writes=off, this triggered an "incorrect local pin count"
error later in replay, if the old page was vacuumed.
This fixes bug #7969, reported by Yunong Xiao. Backpatch to 9.0, like the
commit that introduced this bug.
|
|
|
|
|
| |
If required, recovery.conf can now be located outside of the data directory.
Server needs read/write permissions on this directory.
|
|
|
|
| |
Daniele Varrazzo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.
WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.
Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.
Checksum used is WAL CRC-32 truncated to 16-bits.
Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.
Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.
Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
|
|
|
|
|
|
|
|
|
| |
We had two copies of this function in the backend and libpq, which was
already pretty bogus, but it turns out that we need it in some other
programs that don't use libpq (such as pg_test_fsync). So put it where
it probably should have been all along. The signal-mask-initialization
support in src/backend/libpq/pqsignal.c stays where it is, though, since
we only need that in the backend.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we were about to enter archive recovery after crash recovery, we scanned
the archive for the latest tli history file, and set the recovery target
timeline to that. However, when we actually tried to read the history file,
we would not fetch the file from the archive, because we were not in archive
recovery yet.
To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch
the file from archive if archive recovery is requested, even if we're not in
archive recovery yet.
Backpatch to 9.2. Mitsumasa KONDO
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A materialized view has a rule just like a view and a heap and
other physical properties like a table. The rule is only used to
populate the table, references in queries refer to the
materialized data.
This is a minimal implementation, but should still be useful in
many cases. Currently data is only populated "on demand" by the
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW statements.
It is expected that future releases will add incremental updates
with various timings, and that a more refined concept of defining
what is "fresh" data will be developed. At some point it may even
be possible to have queries use a materialized in place of
references to underlying tables, but that requires the other
above-mentioned features to be working first.
Much of the documentation work by Robert Haas.
Review by Noah Misch, Thom Brown, Robert Haas, Marko Tiikkaja
Security review by KaiGai Kohei, with a decision on how best to
implement sepgsql still pending.
|
|
|
|
|
|
|
|
|
|
|
|
| |
fmgr_sql had been designed on the assumption that the FmgrInfo it's called
with has only query lifespan. This is demonstrably unsafe in connection
with range types, as shown in bug #7881 from Andrew Gierth. Fix things
so that we re-generate the function's cache data if the (sub)transaction
it was made in is no longer active.
Back-patch to 9.2. This might be needed further back, but it's not clear
whether the case can realistically arise without range types, so for now
I'll desist from back-patching further.
|
|
|
|
|
|
| |
We must still initialize minRecoveryPoint if we start straight with archive
recovery, e.g when recovering from a normal base backup taken with
pg_start/stop_backup. Otherwise we never consider the system consistent.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you create a base backup using an atomic filesystem snapshot, and try to
perform PITR starting from that base backup, or if you just kill a master
server and create recovery.conf to put it into standby mode, we don't know
how far we need to recover before reaching consistency. Normally in crash
recovery, we replay all the WAL present in pg_xlog, and assume that we're
consistent after that. And normally in archive recovery, minRecoveryPoint,
backupEndRequired, or backupEndPoint is set in the control file, indicating
how far we need to replay to reach consistency. But if the server was
previously up and running normally, and you kill -9 it or take an atomic
filesystem snapshot, none of those fields are set in the control file.
The solution is to perform crash recovery first, replaying all the WAL in
pg_xlog. After that's done, we assume that the system is consistent like in
normal crash recovery, and switch to archive recovery mode after that.
Per report from Kyotaro HORIGUCHI. In his scenario, recovery.conf was
created after "pg_ctl stop -m i". I'm not sure we need to support that exact
scenario, but we should support backing up using a filesystem snapshot,
which looks identical.
This issue goes back to at least 9.0, where hot standby was introduced and
we started to track when consistency is reached. In 9.1 and 9.2, we would
open up for hot standby too early, and queries could briefly see an
inconsistent state. But 9.2 made it more visible, as we started to PANIC if
we see a reference to a non-existing page during recovery, if we've already
reached consistency. This is a fairly big patch, so back-patch to 9.2 only,
where the issue is more visible. We can consider back-patching further after
this has received some more testing in 9.2 and master.
|
|
|
|
|
|
|
| |
This enables non-backend code, such as pg_xlogdump, to use it easily.
The previous location, in src/backend/catalog/catalog.c, made that
essentially impossible because that file depends on many backend-only
facilities; so this needs to live separately.
|
|
|
|
|
|
|
|
|
|
| |
Revert my earlier fix for the bug that unarchived WAL files get deleted on
crash recovery, commit c9cc7e05c6d82a9781883a016c70d95aa4923122. We create
a .done file for files streamed or restored from archive, so the WAL file
recycling logic used during normal operation works just as well during
archive recovery.
Per Fujii Masao's suggestion.
|
|
|
|
|
|
|
|
|
| |
This is a forward-patch of commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d,
applied to 9.2 back in August. The plan was to do something else in master,
but it looks like it's not going to happen, so let's just apply the 9.2
solution to master as well.
Fujii Masao
|
|
|
|
|
| |
Bug reported by Jehan-Guillaume (ioguix) de Rorthais. This was introduced
with the change to keep WAL files restored from archive in pg_xlog, in 9.2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently it's only possible for loadable modules to get control during
post-commit cleanup of a transaction. That doesn't work too well if they
want to do something that could throw an error; for example, an FDW might
need to issue a remote commit, which could well fail. To improve matters,
extend the existing APIs for XactCallback and SubXactCallback functions
to provide new pre-commit events for this purpose.
The release notes will need to mention that existing callback functions
should be checked to make sure they don't do something unwanted when one
of the new event types occurs. In the examples within our source tree,
contrib/sepgsql was fine but plpgsql had been a bit too cute.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reason this wasn't supported before was that GiST indexes need an
increasing sequence to detect concurrent page-splits. In a regular WAL-
logged GiST index, the LSN of the page-split record is used for that
purpose, and in a temporary index, we can get away with a backend-local
counter. Neither of those methods works for an unlogged relation.
To provide such an increasing sequence of numbers, create a "fake LSN"
counter that is saved and restored across shutdowns. On recovery, unlogged
relations are blown away, so the counter doesn't need to survive that
either.
Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.
|
|
|
|
|
|
|
|
|
|
| |
The intention was to request a regular online checkpoint immediately after
end of recovery, when performing "fast promotion". However, because the
checkpoint was requested before other backends were allowed to write WAL,
the checkpointer process performed a restartpoint rather than a checkpoint.
Delay the RequestCheckPoint call until after recovery has truly ended, so
that you get a real checkpoint.
|
|
|
|
|
|
| |
This isn't used for anything but a sanity check at the moment, but it could
be highly valuable for debugging purposes. It could also be used to recreate
timeline history by traversing WAL, which seems useful.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After further reflection I was unconvinced that the existing coding is
guaranteed to return valid union datums in every code path for multi-column
indexes. Fix that by forcing a gistunionsubkey() call at the end of the
recursion. Having done that, we can remove some clearly-redundant calls
elsewhere. This should be a little faster for multi-column indexes (since
the previous coding would uselessly do such a call for each column while
unwinding the recursion), as well as much harder to break.
Also, simplify the handling of cases where one side or the other of a
primary split contains only don't-care tuples. The previous coding used a
very ugly hack in removeDontCares() that essentially forced one random
tuple to be treated as non-don't-care, providing a random initial choice of
seed datum for the secondary split. It seems unlikely that that method
will give better-than-random splits. Instead, treat such a split as
degenerate and just let the next column determine the split, the same way
that we handle fully degenerate cases where the two sides produce identical
union datums.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This LOG message was put in over five years ago with the evident
expectation that we'd make all GiST opclasses support secondary split
directly. However, no such thing ever happened, and indeed the number of
opclasses supporting it decreased to zero in 9.2. The reason is that
improving on the default implementation isn't that easy --- the
opclass-specific code that did exist, before 9.2, doesn't appear to have
been any improvement over the default.
Hence, remove the message altogether. There's certainly no point in
nagging users about this in released branches, but I doubt that we'll
ever implement complete opclass-specific support anyway.
|