aboutsummaryrefslogtreecommitdiff
path: root/src/include/access
Commit message (Collapse)AuthorAge
* Allow I/O reliability checks using 16-bit checksumsSimon Riggs2013-03-22
| | | | | | | | | | | | | | | | | | | Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.
* Fix tli history file fetching, broken by the archive after crash recevery patch.Heikki Linnakangas2013-03-07
| | | | | | | | | | | | | | If we were about to enter archive recovery after crash recovery, we scanned the archive for the latest tli history file, and set the recovery target timeline to that. However, when we actually tried to read the history file, we would not fetch the file from the archive, because we were not in archive recovery yet. To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch the file from archive if archive recovery is requested, even if we're not in archive recovery yet. Backpatch to 9.2. Mitsumasa KONDO
* Fix SQL function execution to be safe with long-lived FmgrInfos.Tom Lane2013-03-03
| | | | | | | | | | | | fmgr_sql had been designed on the assumption that the FmgrInfo it's called with has only query lifespan. This is demonstrably unsafe in connection with range types, as shown in bug #7881 from Andrew Gierth. Fix things so that we re-generate the function's cache data if the (sub)transaction it was made in is no longer active. Back-patch to 9.2. This might be needed further back, but it's not clear whether the case can realistically arise without range types, so for now I'll desist from back-patching further.
* Need to decorate XactIsoLevel as PGDLLIMPORT for postgres_fdw.Tom Lane2013-02-21
| | | | Per buildfarm.
* Force archive_status of .done for xlogs created by dearchival/replication.Simon Riggs2013-02-15
| | | | | | | | | This is a forward-patch of commit 6f4b8a4f4f7a2d683ff79ab59d3693714b965e3d, applied to 9.2 back in August. The plan was to do something else in master, but it looks like it's not going to happen, so let's just apply the 9.2 solution to master as well. Fujii Masao
* Invent pre-commit/pre-prepare/pre-subcommit events for xact callbacks.Tom Lane2013-02-14
| | | | | | | | | | | | | | Currently it's only possible for loadable modules to get control during post-commit cleanup of a transaction. That doesn't work too well if they want to do something that could throw an error; for example, an FDW might need to issue a remote commit, which could well fail. To improve matters, extend the existing APIs for XactCallback and SubXactCallback functions to provide new pre-commit events for this purpose. The release notes will need to mention that existing callback functions should be checked to make sure they don't do something unwanted when one of the new event types occurs. In the examples within our source tree, contrib/sepgsql was fine but plpgsql had been a bit too cute.
* Support unlogged GiST index.Heikki Linnakangas2013-02-11
| | | | | | | | | | | | | | | The reason this wasn't supported before was that GiST indexes need an increasing sequence to detect concurrent page-splits. In a regular WAL- logged GiST index, the LSN of the page-split record is used for that purpose, and in a temporary index, we can get away with a backend-local counter. Neither of those methods works for an unlogged relation. To provide such an increasing sequence of numbers, create a "fake LSN" counter that is saved and restored across shutdowns. On recovery, unlogged relations are blown away, so the counter doesn't need to survive that either. Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.
* Include previous TLI in end-of-recovery and shutdown checkpoint records.Heikki Linnakangas2013-02-11
| | | | | | This isn't used for anything but a sanity check at the moment, but it could be highly valuable for debugging purposes. It could also be used to recreate timeline history by traversing WAL, which seems useful.
* Document and clean up gistsplit.c.Tom Lane2013-02-10
| | | | | | | | | | Improve comments, rename some variables and functions, slightly simplify a couple of APIs, in an attempt to make this code readable by people other than its original author. Even though this is essentially just cosmetic, back-patch to all active branches, because otherwise it's going to make back-patching future fixes in this file very painful.
* Repair bugs in GiST page splitting code for multi-column indexes.Tom Lane2013-02-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When considering a non-last column in a multi-column GiST index, gistsplit.c tries to improve on the split chosen by the opclass-specific pickSplit function by considering penalties for the next column. However, there were two bugs in this code: it failed to recompute the union keys for the leftmost index columns, even though these might well change after reassigning tuples; and it included the old union keys in the recomputation for the columns it did recompute, so that those keys couldn't get smaller even if they should. The first problem could result in an invalid index in which searches wouldn't find index entries that are in fact present; the second would make the index less efficient to search. Both of these errors were caused by misuse of gistMakeUnionItVec, whose API was designed in a way that just begged such errors to be made. There is no situation in which it's safe or useful to compute the union keys for a subset of the index columns, and there is no caller that wants any previous union keys to be included in the computation; so the undocumented choice to treat the union keys as in/out rather than pure output parameters is a waste of code as well as being dangerous. Hence, rather than just making a minimal patch, I've changed the API of gistMakeUnionItVec to remove the "startkey" parameter (it now always processes all index columns) and treat the attr/isnull arrays as purely output parameters. In passing, also get rid of a couple of unnecessary and dangerous uses of static variables in gistutil.c. It's remarkable that the one in gistMakeUnionKey hasn't given us portability troubles before now, because in addition to posing a re-entrancy hazard, it was unsafely assuming that a static char[] array would have at least Datum alignment. Per investigation of a trouble report from Tomas Vondra. (There are also some bugs in contrib/btree_gist to be fixed, but that seems like material for a separate patch.) Back-patch to all supported branches.
* Split out list of XLog resource managersAlvaro Herrera2013-02-06
| | | | | | | | | | The new rmgrlist.h header, containing all necessary data about built-in resource managers, allows other pieces of code to access them. In particular, this allows a future pg_xlogdump program to extract rm_desc function pointers, without having to keep a duplicate list of them.
* Provide database object names as separate fields in error messages.Tom Lane2013-01-29
| | | | | | | | | | | | | | | | | | This patch addresses the problem that applications currently have to extract object names from possibly-localized textual error messages, if they want to know for example which index caused a UNIQUE_VIOLATION failure. It adds new error message fields to the wire protocol, which can carry the name of a table, table column, data type, or constraint associated with the error. (Since the protocol spec has always instructed clients to ignore unrecognized field types, this should not create any compatibility problem.) Support for providing these new fields has been added to just a limited set of error reports (mainly, those in the "integrity constraint violation" SQLSTATE class), but we will doubtless add them to more calls in future. Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with additional hacking by Tom Lane.
* Fast promote mode skips checkpoint at end of recovery.Simon Riggs2013-01-29
| | | | | | | | | | | pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we can achieve very fast failover when the apply delay is low. Write new WAL record XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log readers. If we skip synchronous end of recovery checkpoint we request a normal spread checkpoint so that the window of re-recovery is low. Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao. Review by Heikki Linnakangas
* Redefine HEAP_XMAX_IS_LOCKED_ONLYAlvaro Herrera2013-01-24
| | | | | | | | | | | | | | Tuples marked SELECT FOR UPDATE in a cluster that's later processed by pg_upgrade would have a different infomask bit pattern than those produced by 9.3dev; that bit pattern was being seen as "dead" by HEAD (because they would fail the "is this tuple locked" test, and so the visibility rules would thing they're updated, even though there's no HEAP_UPDATED version of them). In other words, some rows could silently disappear after pg_upgrade. With this new definition, those tuples become visible again. This is breakage resulting from my commit 0ac5ad5134.
* Improve concurrency of foreign key lockingAlvaro Herrera2013-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
* Fix more issues with cascading replication and timeline switches.Heikki Linnakangas2013-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | When a standby server follows the master using WAL archive, and it chooses a new timeline (recovery_target_timeline='latest'), it only fetches the timeline history file for the chosen target timeline, not any other history files that might be missing from pg_xlog. For example, if the current timeline is 2, and we choose 4 as the new recovery target timeline, the history file for timeline 3 is not fetched, even if it's part of this server's history. That's enough for the standby itself - the history file for timeline 4 includes timeline 3 as well - but if a cascading standby server wants to recover to timeline 3, it needs the history file. To fix, when a new recovery target timeline is chosen, try to copy any missing history files from the archive to pg_xlog between the old and new target timeline. A second similar issue was with the WAL files. When a standby recovers from archive, and it reaches a segment that contains a switch to a new timeline, recovery fetches only the WAL file labelled with the new timeline's ID. The file from the new timeline contains a copy of the WAL from the old timeline up to the point where the switch happened, and recovery recovers it from the new file. But in streaming replication, walsender only tries to read it from the old timeline's file. To fix, change walsender to read it from the new file, so that it behaves the same as recovery in that sense, and doesn't try to open the possibly nonexistent file with the old timeline's ID.
* Use the right timeline when beginning to stream from master.Heikki Linnakangas2013-01-18
| | | | | | | | | | | | | | | | | | | | | The xlogreader refactoring broke the logic to decide which timeline to start streaming from. XLogPageRead() uses the timeline history to check which timeline the requested WAL position falls into. However, after the refactoring, XLogPageRead() is always first called with the first page in the segment, to verify the segment header, and only then with the actual WAL position we're interested in. That first read of the segment's header made XLogPageRead() to always start streaming from the old timeline containing the segment header, not the timeline containing the actual record, if there was a timeline switch within the segment. I thought I fixed this yesterday, but that fix was too narrow and only fixed this for the corner-case that the timeline switch happened in the first page of the segment. To fix this more robustly, pass explicitly the position of the record we're actually interested in to XLogPageRead, and use that to decide which timeline to read from, rather than deduce it from the page and offset. Per report from Fujii Masao.
* Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.Heikki Linnakangas2013-01-17
| | | | | | | | | | | | | | | | | | | | | | This mirrors the changes done earlier to the server in standby mode. When receivelog reaches the end of a timeline, as reported by the server, it fetches the timeline history file of the next timeline, and restarts streaming from the new timeline by issuing a new START_STREAMING command. When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the last segment on the old timeline. This helps you to tell apart a partial segment left in the directory because of a timeline switch, and a completed segment. If you just follow a single server, it won't make a difference, but it can be significant in more complicated scenarios where new WAL is still generated on the old timeline. This includes two small changes to the streaming replication protocol: First, when you reach the end of timeline while streaming, the server now sends the TLI of the next timeline in the server's history to the client. pg_receivexlog uses that as the next timeline, so that it doesn't need to parse the timeline history file like a standby server does. Second, when BASE_BACKUP command sends the begin and end WAL positions, it now also sends the timeline IDs corresponding the positions.
* Make GiST indexes on-disk compatible with 9.2 again.Heikki Linnakangas2013-01-17
| | | | | | | | | | | The patch that turned XLogRecPtr into a uint64 inadvertently changed the on-disk format of GiST indexes, because the NSN field in the GiST page opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that field back to the two-field struct that XLogRecPtr was before. This is the same we did to LSNs in the page header to avoid changing on-disk format. Bump catversion, as this invalidates any existing GiST indexes built on 9.3devel.
* Split out XLog reading as an independent facilityAlvaro Herrera2013-01-16
| | | | | | | | | | | | | | | | | This new facility can not only be used by xlog.c to carry out crash recovery, but also by external programs. By supplying a function to read XLog pages from somewhere, all the WAL reading can be used for completely different purposes. For the standard backend use, the behavior should be pretty much the same as previously. As for non-backend programs, an hypothetical pg_xlogdump program is now closer to reality, but some more backend support is still necessary. This patch was originally submitted by Andres Freund in a different form, but Heikki Linnakangas opted for and authored another design of the concept. Andres has advanced the patch since Heikki's initial version. Review and some (mostly cosmetics) changes by me.
* Redesign the planner's handling of index-descent cost estimation.Tom Lane2013-01-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically we've used a couple of very ad-hoc fudge factors to try to get the right results when indexes of different sizes would satisfy a query with the same number of index leaf tuples being visited. In commit 21a39de5809cd3050a37d2554323cc1d0cbeed9d I tweaked one of these fudge factors, with results that proved disastrous for larger indexes. Commit bf01e34b556ff37982ba2d882db424aa484c0d07 fudged it some more, but still with not a lot of principle behind it. What seems like a better way to address these issues is to explicitly model index-descent costs, since that's what's really at stake when considering diferent indexes with similar leaf-page-level costs. We tried that once long ago, and found that charging random_page_cost per page descended through was way too much, because upper btree levels tend to stay in cache in real-world workloads. However, there's still CPU costs to think about, and the previous fudge factors can be seen as a crude attempt to account for those costs. So this patch replaces those fudge factors with explicit charges for the number of tuple comparisons needed to descend the index tree, plus a small charge per page touched in the descent. The cost multipliers are chosen so that the resulting charges are in the vicinity of the historical (pre-9.2) fudge factors for indexes of up to about a million tuples, while not ballooning unreasonably beyond that, as the old fudge factor did (even more so in 9.2). To make this work accurately for btree indexes, add some code that allows extraction of the known root-page height from a btree. There's no equivalent number readily available for other index types, but we can use the log of the number of index pages as an approximate substitute. This seems like too much of a behavioral change to risk back-patching, but it should improve matters going forward. In 9.2 I'll just revert the fudge-factor change.
* Tolerate timeline switches while "pg_basebackup -X fetch" is running.Heikki Linnakangas2013-01-03
| | | | | | | | | | | | | | | | | | | | | | If you take a base backup from a standby server with "pg_basebackup -X fetch", and the timeline switches while the backup is being taken, the backup used to fail with an error "requested WAL segment %s has already been removed". This is because the server-side code that sends over the required WAL files would not construct the WAL filename with the correct timeline after a switch. Fix that by using readdir() to scan pg_xlog for all the WAL segments in the range, regardless of timeline. Also, include all timeline history files in the backup, if taken with "-X fetch". That fixes another related bug: If a timeline switch happened just before the backup was initiated in a standby, the WAL segment containing the initial checkpoint record contains WAL from the older timeline too. Recovery will not accept that without a timeline history file that lists the older timeline. Backpatch to 9.2. Versions prior to that were not affected as you could not take a base backup from a standby before 9.2.
* Update copyrights for 2013Bruce Momjian2013-01-01
| | | | | Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.
* Keep timeline history files restored from archive in pg_xlog.Heikki Linnakangas2012-12-30
| | | | | | | | | | | | | | | | | | | | | | | The cascading standby patch in 9.2 changed the way WAL files are treated when restored from the archive. Before, they were restored under a temporary filename, and not kept in pg_xlog, but after the patch, they were copied under pg_xlog. This is necessary for a cascading standby to find them, but it also means that if the archive goes offline and a standby is restarted, it can recover back to where it was using the files in pg_xlog. It also means that if you take an offline backup from a standby server, it includes all the required WAL files in pg_xlog. However, the same change was not made to timeline history files, so if the WAL segment containing the checkpoint record contains a timeline switch, you will still get an error if you try to restart recovery without the archive, or recover from an offline backup taken from the standby. With this patch, timeline history files restored from archive are copied into pg_xlog like WAL files are, so that pg_xlog contains all the files required to recover. This is a corner-case pre-existing issue in 9.2, but even more important in master where it's possible for a standby to follow a timeline switch through streaming replication. To make that possible, the timeline history files must be present in pg_xlog.
* Remove obsolete XLogRecPtr macrosAlvaro Herrera2012-12-28
| | | | | | | | | | | | | | | | | This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance. These were useful for brevity when XLogRecPtrs were split in xlogid/xrecoff; but now that they are simple uint64's, they are just clutter. The only downside to making this change would be ease of backporting patches, but that has been negated by other substantive changes to the involved code anyway. The clarity of simpler expressions makes the change worthwhile. Most of the changes are mechanical, but in a couple of places, the patch author chose to invert the operator sense, making the code flow more logical (and more in line with preceding comments). Author: Andres Freund Eyeballed by Dimitri Fontaine and Alvaro Herrera
* Remove unused NextLogPage macroAlvaro Herrera2012-12-27
| | | | | | | Commit 061e7efb1b did away with its last caller, but neglected to remove the actual definition. Author: Andres Freund
* Forgot to remove extern declaration of GetRecoveryTargetTLI()Heikki Linnakangas2012-12-21
| | | | Fujii Masao
* Follow TLI of last replayed record, not recovery target TLI, in walsenders.Heikki Linnakangas2012-12-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the time, the last replayed record comes from the recovery target timeline, but there is a corner case where it makes a difference. When the startup process scans for a new timeline, and decides to change recovery target timeline, there is a window where the recovery target TLI has already been bumped, but there are no WAL segments from the new timeline in pg_xlog yet. For example, if we have just replayed up to point 0/30002D8, on timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog that contains the WAL up to that point. When recovery switches recovery target timeline to 2, a walsender can immediately try to read WAL from 0/30002D8, from timeline 2, so it will try to open WAL file 000000020000000000000003. However, that doesn't exist yet - the startup process hasn't copied that file from the archive yet nor has the walreceiver streamed it yet, so walsender fails with error "requested WAL segment 000000020000000000000003 has already been removed". That's harmless, in that the standby will try to reconnect later and by that time the segment is already created, but error messages that should be ignored are not good. To fix that, have walsender track the TLI of the last replayed record, instead of the recovery target timeline. That way walsender will not try to read anything from timeline 2, until the WAL segment has been created and at least one record has been replayed from it. The recovery target timeline is now xlog.c's internal affair, it doesn't need to be exposed in shared memory anymore. This fixes the error reported by Thom Brown. depesz the same error message, but I'm not sure if this fixes his scenario.
* Allow a streaming replication standby to follow a timeline switch.Heikki Linnakangas2012-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this patch, streaming replication would refuse to start replicating if the timeline in the primary doesn't exactly match the standby. The situation where it doesn't match is when you have a master, and two standbys, and you promote one of the standbys to become new master. Promoting bumps up the timeline ID, and after that bump, the other standby would refuse to continue. There's significantly more timeline related logic in streaming replication now. First of all, when a standby connects to primary, it will ask the primary for any timeline history files that are missing from the standby. The missing files are sent using a new replication command TIMELINE_HISTORY, and stored in standby's pg_xlog directory. Using the timeline history files, the standby can follow the latest timeline present in the primary (recovery_target_timeline='latest'), just as it can follow new timelines appearing in an archive directory. START_REPLICATION now takes a TIMELINE parameter, to specify exactly which timeline to stream WAL from. This allows the standby to request the primary to send over WAL that precedes the promotion. The replication protocol is changed slightly (in a backwards-compatible way although there's little hope of streaming replication working across major versions anyway), to allow replication to stop when the end of timeline reached, putting the walsender back into accepting a replication command. Many thanks to Amit Kapila for testing and reviewing various versions of this patch.
* Make xlog_internal.h includable in frontend context.Heikki Linnakangas2012-12-13
| | | | | | | This makes unnecessary the ugly hack used to #include postgres.h in pg_basebackup. Based on Alvaro Herrera's patch
* Write exact xlog position of timeline switch in the timeline history file.Heikki Linnakangas2012-12-04
| | | | | | | This allows us to do some more rigorous sanity checking for various incorrect point-in-time recovery scenarios, and provides more information for debugging purposes. It will also come handy in the upcoming patch to allow timeline switches to be replicated by streaming replication.
* Reduce scope of changes for COPY FREEZE.Simon Riggs2012-12-02
| | | | | | | | Allow support only for freezing tuples by explicit command. Previous coding mistakenly extended slightly beyond what was agreed as correct on -hackers. So essentially a partial revoke of earlier work, leaving just the COPY FREEZE command.
* COPY FREEZE and mark committed on fresh tables.Simon Riggs2012-12-01
| | | | | | | | | | | | | | | When a relfilenode is created in this subtransaction or a committed child transaction and it cannot otherwise be seen by our own process, mark tuples committed ahead of transaction commit for all COPY commands in same transaction. If FREEZE specified on COPY and pre-conditions met then rows will also be frozen. Both options designed to avoid revisiting rows after commit, increasing performance of subsequent commands after data load and upgrade. pg_restore changes later. Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design input from Tom Lane, Robert Haas and Kevin Grittner
* Split out rmgr rm_desc functions into their own filesAlvaro Herrera2012-11-28
| | | | | This is necessary (but not sufficient) to have them compilable outside of a backend environment.
* Fix archive_cleanup_command.Heikki Linnakangas2012-11-19
| | | | | | | | | | | | | | | | When I moved ExecuteRecoveryCommand() from xlog.c to xlogarchive.c, I didn't realize that it's called from the checkpoint process, not the startup process. I tried to use InRedo variable to decide whether or not to attempt cleaning up the archive (must not do so before we have read the initial checkpoint record), but that variable is only valid within the startup process. Instead, let ExecuteRecoveryCommand() always clean up the archive, and add an explicit argument to RestoreArchivedFile() to say whether that's allowed or not. The caller knows better. Reported by Erik Rijkers, diagnosis by Fujii Masao. Only 9.3devel is affected.
* Fix multiple problems in WAL replay.Tom Lane2012-11-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
* Throw error if expiring tuple is again updated or deleted.Kevin Grittner2012-10-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This prevents surprising behavior when a FOR EACH ROW trigger BEFORE UPDATE or BEFORE DELETE directly or indirectly updates or deletes the the old row. Prior to this patch the requested action on the row could be silently ignored while all triggered actions based on the occurence of the requested action could be committed. One example of how this could happen is if the BEFORE DELETE trigger for a "parent" row deleted "children" which had trigger functions to update summary or status data on the parent. This also prevents similar surprising problems if the query has a volatile function which updates a target row while it is already being updated. There are related issues present in FOR UPDATE cursors and READ COMMITTED queries which are not handled by this patch. These issues need further evalution to determine what change, if any, is needed. Where the new error messages are generated, in most cases the best fix will be to move code from the BEFORE trigger to an AFTER trigger. Where this is not feasible, the trigger can avoid the error by re-issuing the triggering statement and returning NULL. Documentation changes will be submitted in a separate patch. Kevin Grittner and Tom Lane with input from Florian Pflug and Robert Haas, based on problems encountered during conversion of Wisconsin Circuit Court trigger logic to plpgsql triggers.
* Split off functions related to timeline history files and XLOG archiving.Heikki Linnakangas2012-10-02
| | | | | | This is just refactoring, to make the functions accessible outside xlog.c. A followup patch will make use of that, to allow fetching timeline history files over streaming replication.
* Fix btmarkpos/btrestrpos to handle array keys.Tom Lane2012-09-27
| | | | | | | | This fixes another error in commit 9e8da0f75731aaa7605cf4656c21ea09e84d2eb1. I neglected to make the mark/restore functionality save and restore the current set of array key values, which led to strange behavior if an IndexScan with ScalarArrayOpExpr quals was used as the inner side of a mergejoin. Per bug #7570 from Melese Tesfaye.
* Trim spgist_private.h inclusionAlvaro Herrera2012-09-05
| | | | It doesn't really need rel.h; relcache.h is enough.
* Fix bugs in cascading replication with recovery_target_timeline='latest'Heikki Linnakangas2012-09-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cascading replication code assumed that the current RecoveryTargetTLI never changes, but that's not true with recovery_target_timeline='latest'. The obvious upshot of that is that RecoveryTargetTLI in shared memory needs to be protected by a lock. A less obvious consequence is that when a cascading standby is connected, and the standby switches to a new target timeline after scanning the archive, it will continue to stream WAL to the cascading standby, but from a wrong file, ie. the file of the previous timeline. For example, if the standby is currently streaming from the middle of file 000000010000000000000005, and the timeline changes, the standby will continue to stream from that file. However, the WAL on the new timeline is in file 000000020000000000000005, so the standby sends garbage from 000000010000000000000005 to the cascading standby, instead of the correct WAL from file 000000020000000000000005. This also fixes a related bug where a partial WAL segment is restored from the archive and streamed to a cascading standby. The code assumed that when a WAL segment is copied from the archive, it can immediately be fully streamed to a cascading standby. However, if the segment is only partially filled, ie. has the right size, but only N first bytes contain valid WAL, that's not safe. That can happen if a partial WAL segment is manually copied to the archive, or if a partial WAL segment is archived because a server is started up on a new timeline within that segment. The cascading standby will get confused if the WAL it received is not valid, and will get stuck until it's restarted. This patch fixes that problem by not allowing WAL restored from the archive to be streamed to a cascading standby until it's been replayed, and thus validated.
* Split tuple struct defs from htup.h to htup_details.hAlvaro Herrera2012-08-30
| | | | | | | | | | | | This reduces unnecessary exposure of other headers through htup.h, which is very widely included by many files. I have chosen to move the function prototypes to the new file as well, because that means htup.h no longer needs to include tupdesc.h. In itself this doesn't have much effect in indirect inclusion of tupdesc.h throughout the tree, because it's also required by execnodes.h; but it's something to explore in the future, and it seemed best to do the htup.h change now while I'm busy with it.
* Split heapam_xlog.h from heapam.hAlvaro Herrera2012-08-28
| | | | | | | | | | | | The heapam XLog functions are used by other modules, not all of which are interested in the rest of the heapam API. With this, we let them get just the XLog stuff in which they are interested and not pollute them with unrelated includes. Also, since heapam.h no longer requires xlog.h, many files that do include heapam.h no longer get xlog.h automatically, including a few headers. This is useful because heapam.h is getting pulled in by execnodes.h, which is in turn included by a lot of files.
* Fix GiST buffering build bug, which caused "failed to re-find parent" errors.Heikki Linnakangas2012-08-16
| | | | | | | | | | | | | | | We use a hash table to track the parents of inner pages, but when inserting to a leaf page, the caller of gistbufferinginserttuples() must pass a correct block number of the leaf's parent page. Before gistProcessItup() descends to a child page, it checks if the downlink needs to be adjusted to accommodate the new tuple, and updates the downlink if necessary. However, updating the downlink might require splitting the page, which might move the downlink to a page to the right. gistProcessItup() doesn't realize that, so when it descends to the leaf page, it might pass an out-of-date parent block number as a result. Fix that by returning the block a tuple was inserted to from gistbufferinginserttuples(). This fixes the bug reported by Zdeněk Jílovec.
* Fix race conditions associated with SPGiST redirection tuples.Tom Lane2012-08-02
| | | | | | | | | | | | | | | The correct test for whether a redirection tuple is removable is whether tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding failed to protect index searches being done in concurrent transactions that have no XID. This mirrors the recent fix in btree's page recycling logic made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2. Also, WAL-log the newest XID of any removed redirection tuple on an index page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL replay. This protects against concurrent Hot Standby transactions possibly needing to see the redirection tuple(s). Per my query of 2012-03-12 and subsequent discussion.
* Tighten up includes in sinvaladt.h, twophase.h, proc.hAlvaro Herrera2012-06-25
| | | | | Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in proc.h with xlogdefs.h.
* Remove sanity test in XRecOffIsValid.Robert Haas2012-06-25
| | | | | | | | | Commit 061e7efb1b4c5b8a5d02122b7780531b8d5bf23d changed the rules for splitting xlog records across pages, but neglected to update this test. It's possible that there's some better action here than just removing the test completely, but this at least appears to get some of the things that are currently broken (like initdb on MacOS X) working again.
* Replace int2/int4 in C code with int16/int32Peter Eisentraut2012-06-25
| | | | | | | | | | The latter was already the dominant use, and it's preferable because in C the convention is that intXX means XX bits. Therefore, allowing mixed use of int2, int4, int8, int16, int32 is obviously confusing. Remove the typedefs for int2 and int4 for now. They don't seem to be widely used outside of the PostgreSQL source tree, and the few uses can probably be cleaned up by the time this ships.
* Use UINT64CONST for 64-bit integer constants.Heikki Linnakangas2012-06-24
| | | | | Peter Eisentraut advised me that UINT64CONST is the proper way to do that, not LL suffix.
* Use LL suffix for 64-bit constants.Heikki Linnakangas2012-06-24
| | | | | Per warning from buildfarm member 'locust'. At least I think this what's making it upset.