| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.
WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.
Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.
Checksum used is WAL CRC-32 truncated to 16-bits.
Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we were about to enter archive recovery after crash recovery, we scanned
the archive for the latest tli history file, and set the recovery target
timeline to that. However, when we actually tried to read the history file,
we would not fetch the file from the archive, because we were not in archive
recovery yet.
To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch
the file from archive if archive recovery is requested, even if we're not in
archive recovery yet.
Backpatch to 9.2. Mitsumasa KONDO
|
|
|
|
|
|
|
|
|
|
|
|
| |
fmgr_sql had been designed on the assumption that the FmgrInfo it's called
with has only query lifespan. This is demonstrably unsafe in connection
with range types, as shown in bug #7881 from Andrew Gierth. Fix things
so that we re-generate the function's cache data if the (sub)transaction
it was made in is no longer active.
Back-patch to 9.2. This might be needed further back, but it's not clear
whether the case can realistically arise without range types, so for now
I'll desist from back-patching further.
|
|
|
|
| |
Per buildfarm.
|
|
|
|
|
|
|
|
|
| |
This is a forward-patch of commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d,
applied to 9.2 back in August. The plan was to do something else in master,
but it looks like it's not going to happen, so let's just apply the 9.2
solution to master as well.
Fujii Masao
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently it's only possible for loadable modules to get control during
post-commit cleanup of a transaction. That doesn't work too well if they
want to do something that could throw an error; for example, an FDW might
need to issue a remote commit, which could well fail. To improve matters,
extend the existing APIs for XactCallback and SubXactCallback functions
to provide new pre-commit events for this purpose.
The release notes will need to mention that existing callback functions
should be checked to make sure they don't do something unwanted when one
of the new event types occurs. In the examples within our source tree,
contrib/sepgsql was fine but plpgsql had been a bit too cute.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reason this wasn't supported before was that GiST indexes need an
increasing sequence to detect concurrent page-splits. In a regular WAL-
logged GiST index, the LSN of the page-split record is used for that
purpose, and in a temporary index, we can get away with a backend-local
counter. Neither of those methods works for an unlogged relation.
To provide such an increasing sequence of numbers, create a "fake LSN"
counter that is saved and restored across shutdowns. On recovery, unlogged
relations are blown away, so the counter doesn't need to survive that
either.
Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.
|
|
|
|
|
|
| |
This isn't used for anything but a sanity check at the moment, but it could
be highly valuable for debugging purposes. It could also be used to recreate
timeline history by traversing WAL, which seems useful.
|
|
|
|
|
|
|
|
|
|
| |
Improve comments, rename some variables and functions, slightly simplify
a couple of APIs, in an attempt to make this code readable by people other
than its original author.
Even though this is essentially just cosmetic, back-patch to all active
branches, because otherwise it's going to make back-patching future fixes
in this file very painful.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When considering a non-last column in a multi-column GiST index,
gistsplit.c tries to improve on the split chosen by the opclass-specific
pickSplit function by considering penalties for the next column. However,
there were two bugs in this code: it failed to recompute the union keys for
the leftmost index columns, even though these might well change after
reassigning tuples; and it included the old union keys in the recomputation
for the columns it did recompute, so that those keys couldn't get smaller
even if they should. The first problem could result in an invalid index
in which searches wouldn't find index entries that are in fact present;
the second would make the index less efficient to search.
Both of these errors were caused by misuse of gistMakeUnionItVec, whose
API was designed in a way that just begged such errors to be made. There
is no situation in which it's safe or useful to compute the union keys for
a subset of the index columns, and there is no caller that wants any
previous union keys to be included in the computation; so the undocumented
choice to treat the union keys as in/out rather than pure output parameters
is a waste of code as well as being dangerous.
Hence, rather than just making a minimal patch, I've changed the API of
gistMakeUnionItVec to remove the "startkey" parameter (it now always
processes all index columns) and treat the attr/isnull arrays as purely
output parameters.
In passing, also get rid of a couple of unnecessary and dangerous uses
of static variables in gistutil.c. It's remarkable that the one in
gistMakeUnionKey hasn't given us portability troubles before now, because
in addition to posing a re-entrancy hazard, it was unsafely assuming that
a static char[] array would have at least Datum alignment.
Per investigation of a trouble report from Tomas Vondra. (There are also
some bugs in contrib/btree_gist to be fixed, but that seems like material
for a separate patch.) Back-patch to all supported branches.
|
|
|
|
|
|
|
|
|
|
| |
The new rmgrlist.h header, containing all necessary data
about built-in resource managers, allows other pieces of code to
access them.
In particular, this allows a future pg_xlogdump program to extract
rm_desc function pointers, without having to keep a duplicate list of
them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
|
|
|
|
|
|
|
|
|
|
|
| |
pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we
can achieve very fast failover when the apply delay is low. Write new WAL record
XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log
readers. If we skip synchronous end of recovery checkpoint we request a normal
spread checkpoint so that the window of re-recovery is low.
Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao.
Review by Heikki Linnakangas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Tuples marked SELECT FOR UPDATE in a cluster that's later processed by
pg_upgrade would have a different infomask bit pattern than those
produced by 9.3dev; that bit pattern was being seen as "dead" by HEAD
(because they would fail the "is this tuple locked" test, and so the
visibility rules would thing they're updated, even though there's no
HEAP_UPDATED version of them). In other words, some rows could silently
disappear after pg_upgrade.
With this new definition, those tuples become visible again.
This is breakage resulting from my commit 0ac5ad5134.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a standby server follows the master using WAL archive, and it chooses
a new timeline (recovery_target_timeline='latest'), it only fetches the
timeline history file for the chosen target timeline, not any other history
files that might be missing from pg_xlog. For example, if the current
timeline is 2, and we choose 4 as the new recovery target timeline, the
history file for timeline 3 is not fetched, even if it's part of this
server's history. That's enough for the standby itself - the history file
for timeline 4 includes timeline 3 as well - but if a cascading standby
server wants to recover to timeline 3, it needs the history file. To fix,
when a new recovery target timeline is chosen, try to copy any missing
history files from the archive to pg_xlog between the old and new target
timeline.
A second similar issue was with the WAL files. When a standby recovers from
archive, and it reaches a segment that contains a switch to a new timeline,
recovery fetches only the WAL file labelled with the new timeline's ID. The
file from the new timeline contains a copy of the WAL from the old timeline
up to the point where the switch happened, and recovery recovers it from the
new file. But in streaming replication, walsender only tries to read it
from the old timeline's file. To fix, change walsender to read it from the
new file, so that it behaves the same as recovery in that sense, and doesn't
try to open the possibly nonexistent file with the old timeline's ID.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.
When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.
This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
|
|
|
|
|
|
|
|
|
|
|
| |
The patch that turned XLogRecPtr into a uint64 inadvertently changed the
on-disk format of GiST indexes, because the NSN field in the GiST page
opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
field back to the two-field struct that XLogRecPtr was before. This is the
same we did to LSNs in the page header to avoid changing on-disk format.
Bump catversion, as this invalidates any existing GiST indexes built on
9.3devel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This new facility can not only be used by xlog.c to carry out crash
recovery, but also by external programs. By supplying a function to
read XLog pages from somewhere, all the WAL reading can be used for
completely different purposes.
For the standard backend use, the behavior should be pretty much the
same as previously. As for non-backend programs, an hypothetical
pg_xlogdump program is now closer to reality, but some more backend
support is still necessary.
This patch was originally submitted by Andres Freund in a different
form, but Heikki Linnakangas opted for and authored another design of
the concept. Andres has advanced the patch since Heikki's initial
version. Review and some (mostly cosmetics) changes by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited. In
commit 21a39de5809cd3050a37d2554323cc1d0cbeed9d I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b556ff37982ba2d882db424aa484c0d07 fudged it some more,
but still with not a lot of principle behind it.
What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs. We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads. However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs. So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent. The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).
To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree. There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.
This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward. In 9.2 I'll just revert
the fudge-factor change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you take a base backup from a standby server with "pg_basebackup -X
fetch", and the timeline switches while the backup is being taken, the
backup used to fail with an error "requested WAL segment %s has already
been removed". This is because the server-side code that sends over the
required WAL files would not construct the WAL filename with the correct
timeline after a switch.
Fix that by using readdir() to scan pg_xlog for all the WAL segments in the
range, regardless of timeline.
Also, include all timeline history files in the backup, if taken with
"-X fetch". That fixes another related bug: If a timeline switch happened
just before the backup was initiated in a standby, the WAL segment
containing the initial checkpoint record contains WAL from the older
timeline too. Recovery will not accept that without a timeline history file
that lists the older timeline.
Backpatch to 9.2. Versions prior to that were not affected as you could not
take a base backup from a standby before 9.2.
|
|
|
|
|
| |
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cascading standby patch in 9.2 changed the way WAL files are treated
when restored from the archive. Before, they were restored under a temporary
filename, and not kept in pg_xlog, but after the patch, they were copied
under pg_xlog. This is necessary for a cascading standby to find them, but
it also means that if the archive goes offline and a standby is restarted,
it can recover back to where it was using the files in pg_xlog. It also
means that if you take an offline backup from a standby server, it includes
all the required WAL files in pg_xlog.
However, the same change was not made to timeline history files, so if the
WAL segment containing the checkpoint record contains a timeline switch, you
will still get an error if you try to restart recovery without the archive,
or recover from an offline backup taken from the standby.
With this patch, timeline history files restored from archive are copied
into pg_xlog like WAL files are, so that pg_xlog contains all the files
required to recover. This is a corner-case pre-existing issue in 9.2, but
even more important in master where it's possible for a standby to follow a
timeline switch through streaming replication. To make that possible, the
timeline history files must be present in pg_xlog.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter. The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway. The clarity of simpler expressions
makes the change worthwhile.
Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).
Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
|
|
|
|
|
|
|
| |
Commit 061e7efb1b did away with its last caller, but neglected to remove
the actual definition.
Author: Andres Freund
|
|
|
|
| |
Fujii Masao
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the time, the last replayed record comes from the recovery target
timeline, but there is a corner case where it makes a difference. When
the startup process scans for a new timeline, and decides to change recovery
target timeline, there is a window where the recovery target TLI has already
been bumped, but there are no WAL segments from the new timeline in pg_xlog
yet. For example, if we have just replayed up to point 0/30002D8, on
timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog
that contains the WAL up to that point. When recovery switches recovery
target timeline to 2, a walsender can immediately try to read WAL from
0/30002D8, from timeline 2, so it will try to open WAL file
000000020000000000000003. However, that doesn't exist yet - the startup
process hasn't copied that file from the archive yet nor has the walreceiver
streamed it yet, so walsender fails with error "requested WAL segment
000000020000000000000003 has already been removed". That's harmless, in that
the standby will try to reconnect later and by that time the segment is
already created, but error messages that should be ignored are not good.
To fix that, have walsender track the TLI of the last replayed record,
instead of the recovery target timeline. That way walsender will not try to
read anything from timeline 2, until the WAL segment has been created and at
least one record has been replayed from it. The recovery target timeline is
now xlog.c's internal affair, it doesn't need to be exposed in shared memory
anymore.
This fixes the error reported by Thom Brown. depesz the same error message,
but I'm not sure if this fixes his scenario.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.
There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.
START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.
Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
|
|
|
|
|
|
|
| |
This makes unnecessary the ugly hack used to #include postgres.h in
pg_basebackup.
Based on Alvaro Herrera's patch
|
|
|
|
|
|
|
| |
This allows us to do some more rigorous sanity checking for various
incorrect point-in-time recovery scenarios, and provides more information
for debugging purposes. It will also come handy in the upcoming patch to
allow timeline switches to be replicated by streaming replication.
|
|
|
|
|
|
|
|
| |
Allow support only for freezing tuples by explicit
command. Previous coding mistakenly extended
slightly beyond what was agreed as correct on -hackers.
So essentially a partial revoke of earlier work,
leaving just the COPY FREEZE command.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.
Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
|
|
|
|
|
| |
This is necessary (but not sufficient) to have them compilable outside
of a backend environment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I moved ExecuteRecoveryCommand() from xlog.c to xlogarchive.c, I didn't
realize that it's called from the checkpoint process, not the startup
process. I tried to use InRedo variable to decide whether or not to attempt
cleaning up the archive (must not do so before we have read the initial
checkpoint record), but that variable is only valid within the startup
process.
Instead, let ExecuteRecoveryCommand() always clean up the archive, and add
an explicit argument to RestoreArchivedFile() to say whether that's allowed
or not. The caller knows better.
Reported by Erik Rijkers, diagnosis by Fujii Masao. Only 9.3devel is
affected.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This prevents surprising behavior when a FOR EACH ROW trigger
BEFORE UPDATE or BEFORE DELETE directly or indirectly updates or
deletes the the old row. Prior to this patch the requested action
on the row could be silently ignored while all triggered actions
based on the occurence of the requested action could be committed.
One example of how this could happen is if the BEFORE DELETE
trigger for a "parent" row deleted "children" which had trigger
functions to update summary or status data on the parent.
This also prevents similar surprising problems if the query has a
volatile function which updates a target row while it is already
being updated.
There are related issues present in FOR UPDATE cursors and READ
COMMITTED queries which are not handled by this patch. These
issues need further evalution to determine what change, if any, is
needed.
Where the new error messages are generated, in most cases the best
fix will be to move code from the BEFORE trigger to an AFTER
trigger. Where this is not feasible, the trigger can avoid the
error by re-issuing the triggering statement and returning NULL.
Documentation changes will be submitted in a separate patch.
Kevin Grittner and Tom Lane with input from Florian Pflug and
Robert Haas, based on problems encountered during conversion of
Wisconsin Circuit Court trigger logic to plpgsql triggers.
|
|
|
|
|
|
| |
This is just refactoring, to make the functions accessible outside xlog.c.
A followup patch will make use of that, to allow fetching timeline history
files over streaming replication.
|
|
|
|
|
|
|
|
| |
This fixes another error in commit 9e8da0f75731aaa7605cf4656c21ea09e84d2eb1.
I neglected to make the mark/restore functionality save and restore the
current set of array key values, which led to strange behavior if an
IndexScan with ScalarArrayOpExpr quals was used as the inner side of a
mergejoin. Per bug #7570 from Melese Tesfaye.
|
|
|
|
| |
It doesn't really need rel.h; relcache.h is enough.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.
This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.
I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h. In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API. With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.
Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers. This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We use a hash table to track the parents of inner pages, but when inserting
to a leaf page, the caller of gistbufferinginserttuples() must pass a
correct block number of the leaf's parent page. Before gistProcessItup()
descends to a child page, it checks if the downlink needs to be adjusted to
accommodate the new tuple, and updates the downlink if necessary. However,
updating the downlink might require splitting the page, which might move the
downlink to a page to the right. gistProcessItup() doesn't realize that, so
when it descends to the leaf page, it might pass an out-of-date parent block
number as a result. Fix that by returning the block a tuple was inserted to
from gistbufferinginserttuples().
This fixes the bug reported by Zdeněk Jílovec.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The correct test for whether a redirection tuple is removable is whether
tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding
failed to protect index searches being done in concurrent transactions that
have no XID. This mirrors the recent fix in btree's page recycling logic
made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2.
Also, WAL-log the newest XID of any removed redirection tuple on an index
page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL
replay. This protects against concurrent Hot Standby transactions possibly
needing to see the redirection tuple(s).
Per my query of 2012-03-12 and subsequent discussion.
|
|
|
|
|
| |
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
|
|
|
|
|
|
|
|
|
| |
Commit 061e7efb1b4c5b8a5d02122b7780531b8d5bf23d changed the rules
for splitting xlog records across pages, but neglected to update this
test. It's possible that there's some better action here than just
removing the test completely, but this at least appears to get some
of the things that are currently broken (like initdb on MacOS X)
working again.
|
|
|
|
|
|
|
|
|
|
| |
The latter was already the dominant use, and it's preferable because
in C the convention is that intXX means XX bits. Therefore, allowing
mixed use of int2, int4, int8, int16, int32 is obviously confusing.
Remove the typedefs for int2 and int4 for now. They don't seem to be
widely used outside of the PostgreSQL source tree, and the few uses
can probably be cleaned up by the time this ships.
|
|
|
|
|
| |
Peter Eisentraut advised me that UINT64CONST is the proper way to do that,
not LL suffix.
|
|
|
|
|
| |
Per warning from buildfarm member 'locust'. At least I think this what's
making it upset.
|