aboutsummaryrefslogtreecommitdiff
path: root/src/include/access/xlog_internal.h
Commit message (Collapse)AuthorAge
* Add some const decorations to prototypesPeter Eisentraut2017-11-10
| | | | Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
* Make WAL segment size configurable at initdb time.Andres Freund2017-09-19
| | | | | | | | | | | | | | | | | | | | | | | For performance reasons a larger segment size than the default 16MB can be useful. A larger segment size has two main benefits: Firstly, in setups using archiving, it makes it easier to write scripts that can keep up with higher amounts of WAL, secondly, the WAL has to be written and synced to disk less frequently. But at the same time large segment size are disadvantageous for smaller databases. So far the segment size had to be configured at compile time, often making it unrealistic to choose one fitting to a particularly load. Therefore change it to a initdb time setting. This includes a breaking changes to the xlogreader.h API, which now requires the current segment size to be configured. For that and similar reasons a number of binaries had to be taught how to recognize the current segment size. Author: Beena Emerson, editorialized by Andres Freund Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
* Fix comment.Tatsuo Ishii2017-08-01
| | | | | | | | | XLByteToSeg and XLByteToPrevSeg calculate only a segment number. The definition of these macros were modified by commit dfda6ebaec6763090fb78b458a979b558c50b39b but the comment remain unchanged. Patch by Yugo Nagata. Back patched to 9.3 and beyond.
* Phase 2 of pgindent updates.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Replace "transaction log" with "write-ahead log"Peter Eisentraut2017-05-12
| | | | | This makes documentation and error messages match the renaming of "xlog" to "wal" in APIs and file naming.
* Track the oldest XID that can be safely looked up in CLOG.Robert Haas2017-03-23
| | | | | | | | | | | | | | This provides infrastructure for looking up arbitrary, user-supplied XIDs without a risk of scary-looking failures from within the clog module. Normally, the oldest XID that can be safely looked up in CLOG is the same as the oldest XID that can reused without causing wraparound, and the latter is already tracked. However, while truncation is in progress, the values are different, so we must keep track of them separately. Craig Ringer, reviewed by Simon Riggs and by me. Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
* Fixes for single-page hash index vacuum.Robert Haas2017-03-20
| | | | | | | | | | | | | | Clear LH_PAGE_HAS_DEAD_TUPLES during replay, similar to what gets done for btree. Update hashdesc.c for xl_hash_vacuum_one_page. Oversights in commit 6977b8b7f4dfb40896ff5e2175cad7fdbda862eb spotted by Amit Kapila. Patch by Ashutosh Sharma. Bump WAL version. The original patch to make hash indexes write-ahead logged probably should have done this, and the single page vacuuming patch probably should have done it again, but better late than never. Discussion: http://postgr.es/m/CAA4eK1Kd=mJ9xreovcsh0qMiAj-QqCphHVQ_Lfau1DR9oVjASQ@mail.gmail.com
* Spelling fixesPeter Eisentraut2017-03-14
| | | | From: Josh Soref <jsoref@gmail.com>
* Rename user-facing tools with "xlog" in the name to say "wal".Robert Haas2017-02-09
| | | | | This means pg_receivexlog because pg_receivewal, pg_resetxlog becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
* Add WAL consistency checking facility.Robert Haas2017-02-08
| | | | | | | | | | | | | | When the new GUC wal_consistency_checking is set to a non-empty value, it triggers recording of additional full-page images, which are compared on the standby against the results of applying the WAL record (without regard to those full-page images). Allowable differences such as hints are masked out, and the resulting pages are compared; any difference results in a FATAL error on the standby. Kuntal Ghosh, based on earlier patches by Michael Paquier and Heikki Linnakangas. Extensively reviewed and revised by Michael Paquier and by me, with additional reviews and comments from Amit Kapila, Álvaro Herrera, Simon Riggs, and Peter Eisentraut.
* Fix race condition in reading commit timestampsAlvaro Herrera2017-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | If a user requests the commit timestamp for a transaction old enough that its data is concurrently being truncated away by vacuum at just the right time, they would receive an ugly internal file-not-found error message from slru.c rather than the expected NULL return value. In a primary server, the window for the race is very small: the lookup has to occur exactly between the two calls by vacuum, and there's not a lot that happens between them (mostly just a multixact truncate). In a standby server, however, the window is larger because the truncation is executed as soon as the WAL record for it is replayed, but the advance of the oldest-Xid is not executed until the next checkpoint record. To fix in the primary, simply reverse the order of operations in vac_truncate_clog. To fix in the standby, augment the WAL truncation record so that the standby is aware of the new oldest-XID value and can apply the update immediately. WAL version bumped because of this. No backpatch, because of the low importance of the bug and its rarity. Author: Craig Ringer Reviewed-By: Petr Jelínek, Peter Eisentraut Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Skip checkpoints, archiving on idle systems.Andres Freund2016-12-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Some background activity (like checkpoints, archive timeout, standby snapshots) is not supposed to happen on an idle system. Unfortunately so far it was not easy to determine when a system is idle, which defeated some of the attempts to avoid redundant activity on an idle system. To make that easier, allow to make individual WAL insertions as not being "important". By checking whether any important activity happened since the last time an activity was performed, it now is easy to check whether some action needs to be repeated. Use the new facility for checkpoints, archive timeout and standby snapshots. The lack of a facility causes some issues in older releases, but in my opinion the consequences (superflous checkpoints / archived segments) aren't grave enough to warrant backpatching. Author: Michael Paquier, editorialized by Andres Freund Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI Bug: #13685 Discussion: https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com Backpatch: -
* Rename "pg_xlog" directory to "pg_wal".Robert Haas2016-10-20
| | | | | | | | | | | | | | | | | | | | | "xlog" is not a particularly clear abbreviation for "write-ahead log", and it sometimes confuses users into believe that the contents of the "pg_xlog" directory are not critical data, leading to unpleasant consequences. So, rename the directory to "pg_wal". This patch modifies pg_upgrade and pg_basebackup to understand both the old and new directory layouts; the former is necessary given the purpose of the tool, while the latter merely avoids an unnecessary backward-compatibility break. We may wish to consider renaming other programs, switches, and functions which still use the old "xlog" naming to also refer to "wal". However, that's still under discussion, so let's do just this much for now. Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com Michael Paquier
* Clear all-frozen visibilitymap status when locking tuples.Andres Freund2016-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so correctly relies on not adding xids to the heap without also unsetting the visibilitymap flag. Tuple locking related code has not done so. To allow selectively resetting all-frozen - to avoid pessimizing heap_lock_tuple - allow to selectively reset the all-frozen with visibilitymap_clear(). To avoid having to use visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical section, have visibilitymap_clear() return whether any bits have been reset. There's a remaining issue (denoted by XXX): After the PageIsAllVisible() check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page status could theoretically change. Practically that currently seems impossible, because updaters will hold a page level pin already. Due to the next beta coming up, it seems better to get the required WAL magic bump done before resolving this issue. The added flags field fields to xl_heap_lock and xl_heap_lock_updated require bumping the WAL magic. Since there's already been a catversion bump since the last beta, that's not an issue. Reviewed-By: Robert Haas, Amit Kapila and Andres Freund Author: Masahiko Sawada, heavily revised by Andres Freund Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com Backpatch: -
* pg_visibility: Add pg_truncate_visibility_map function.Robert Haas2016-06-17
| | | | | | | | This requires some core changes as well so that we can properly WAL-log the truncation. Specifically, it changes the format of the XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC. Patch by me, reviewed but not fully endorsed by Andres Freund.
* pgindent run for 9.6Robert Haas2016-06-09
|
* Remove trailing commas in enums.Andres Freund2016-04-14
| | | | | These aren't valid C89. Found thanks to gcc's -Wc90-c99-compat. These exist in differing places in most supported branches.
* Add required database and origin filtering for logical messages.Andres Freund2016-04-13
| | | | | | | | | | | | | | | | | Logical messages, added in 3fe3511d05, during decoding failed to filter messages emitted in other databases and messages emitted "under" a replication origin the output plugin isn't interested in. Add tests to verify that both types of filtering actually work. While touching message.sql remove hunk obsoleted by d25379e. Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because 3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump catversion, but 7a542700d has done so since. Author: Petr Jelinek Reported-By: Andres Freund Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
* Add Generic WAL interfaceTeodor Sigaev2016-04-01
| | | | | | | | | | | | | | | This interface is designed to give an access to WAL for extensions which could implement new access method, for example. Previously it was impossible because restoring from custom WAL would need to access system catalog to find a redo custom function. This patch suggests generic way to describe changes on page with standart layout. Bump XLOG_PAGE_MAGIC because of new record type. Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and minor editorization by my Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby, Michael Paquier
* Add new replication mode synchronous_commit = 'remote_apply'.Robert Haas2016-03-29
| | | | | | | | | | | | | | | | | | | In this mode, the master waits for the transaction to be applied on the remote side, not just written to disk. That means that you can count on a transaction started on the standby to see all commits previously acknowledged by the master. To make this work, the standby sends a reply after replaying each commit record generated with synchronous_commit >= 'remote_apply'. This introduces a small inefficiency: the extra replies will be sent even by standbys that aren't the current synchronous standby. But previously-existing synchronous_commit levels make no attempt at all to optimize which replies are sent based on what the primary cares about, so this is no worse, and at least avoids any extra replies for people not using the feature at all. Thomas Munro, reviewed by Michael Paquier and by me. Some additional tweaks by me.
* Update copyright for 2016Bruce Momjian2016-01-02
| | | | Backpatch certain files through 9.1
* Remove legacy multixact truncation support.Andres Freund2015-09-26
| | | | | | | | | | | | | In 9.5 and master there is no need to support legacy truncation. This is just committed separately to make it easier to backpatch the WAL logged multixact truncation to 9.3 and 9.4 if we later decide to do so. I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to 0xD087 to avoid 9.5 reusing a value that has been in use on master while keeping the numbers increasing between major versions. Discussion: 20150621192409.GA4797@alap3.anarazel.de Backpatch: 9.5
* Fix bug introduced by microvacuum for GiSTTeodor Sigaev2015-09-17
| | | | | | | | | | | | | | | Commit 013ebc0a7b7ea9c1b1ab7a3d4dd75ea121ea8ba7 introduces microvacuum for GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while recovery code uses IndexPageTupleDelete in loop. This causes a difference in offset numbers of tuples to delete. Patch introduces usage of IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is deleted at once. That also slightly improve performance, because IndexPageMultiDelete is more effective. Patch changes WAL format, so bump wal page magic. Bug report from Jeff Janes Diagnostic and patch by Anastasia Lubennikova and me
* Make use of xlog_internal.h's macros in WAL-related utilities.Fujii Masao2015-07-02
| | | | | | | | | | | | | | Commit 179cdd09 added macros to check if a filename is a WAL segment or other such file. However there were still some instances of the strlen + strspn combination to check for that in WAL-related utilities like pg_archivecleanup. Those checks can be replaced with the macros. This patch makes use of the macros in those utilities and which would make the code a bit easier to read. Back-patch to 9.5. Michael Paquier
* At promotion, don't leave behind a partial segment on the old timeline.Heikki Linnakangas2015-05-22
| | | | | | | | | | | | | | | | | | With commit de768844, a copy of the partial segment was archived with the .partial suffix, but the original file was still left in pg_xlog, so it didn't actually solve the problems with archiving the partial segment that it was supposed to solve. With this patch, the partial segment is renamed rather than copied, so we only archive it with the .partial suffix. Also be more robust in detecting if the last segment is already being archived. Previously I used XLogArchiveIsBusy() for that, but that's not quite right. With archive_mode='always', there might be a .ready file for it, and we don't want to rename it to .partial in that case. The old segment is needed until we're fully committed to the new timeline, i.e. until we've written the end-of-recovery WAL record and updated the min recovery point and timeline in the control file. So move the renaming later in the startup sequence, after all that's been done.
* At promotion, archive last segment from old timeline with .partial suffix.Heikki Linnakangas2015-05-08
| | | | | | | | | | | | | | | | | | | | | | | Previously, we would archive the possible-incomplete WAL segment with its normal filename, but that causes trouble if the server owning that timeline is still running, and tries to archive the same segment later. It's not nice for the standby to trip up the master's archival like that. And it's pretty confusing, anyway, to have an incomplete segment in the archive that's indistinguishable from a normal, complete segment. To avoid such confusion, add a .partial suffix to the file. Or to be more precise, make a copy of the old segment under the .partial suffix, and archive that instead of the original file. pg_receivexlog also uses the .partial suffix for the same purpose, to tell apart incompletely streamed files from complete ones. There is no automatic mechanism to use the .partial files at recovery, so they will go unused, unless the administrator manually copies to them to the pg_xlog directory (and removes the .partial suffix). Recovery won't normally need the WAL - when recovering to the new timeline, it will find the same WAL on the first segment on the new timeline instead - but it nevertheless feels better to archive the file with the .partial suffix, for debugging purposes if nothing else.
* Add macros to check if a filename is a WAL segment or other such file.Heikki Linnakangas2015-05-08
| | | | | We had many instances of the strlen + strspn combination to check for that. This makes the code a bit easier to read.
* Introduce replication progress tracking infrastructure.Andres Freund2015-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When implementing a replication solution ontop of logical decoding, two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented here, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. E.g. by filtering all replayed changes out. Most of this could also be implemented in "userspace", e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all the former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 (commit 73c986add) changing the API is not a problem. For now the number of origins for which the replication progress can be tracked simultaneously is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer Discussion: 20150216002155.GI15326@awork2.anarazel.de, 20140923182422.GA15776@alap3.anarazel.de, 20131114172632.GE7522@alap2.anarazel.de
* Don't archive bogus recycled or preallocated files after timeline switch.Heikki Linnakangas2015-04-13
| | | | | | | | | | | | | | | | | | | After a timeline switch, we would leave behind recycled WAL segments that are in the future, but on the old timeline. After promotion, and after they become old enough to be recycled again, we would notice that they don't have a .ready or .done file, create a .ready file for them, and archive them. That's bogus, because the files contain garbage, recycled from an older timeline (or prealloced as zeros). We shouldn't archive such files. This could happen when we're following a timeline switch during replay, or when we switch to new timeline at end-of-recovery. To fix, whenever we switch to a new timeline, scan the data directory for WAL segments on the old timeline, but with a higher segment number, and remove them. Those don't belong to our timeline history, and are most likely bogus recycled or preallocated files. They could also be valid files that we streamed from the primary ahead of time, but in any case, they're not needed to recover to the new timeline.
* Merge the various forms of transaction commit & abort records.Andres Freund2015-03-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 465883b0a two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs
* Add GUC to enable compression of full page images stored in WAL.Fujii Masao2015-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* Keep track of transaction commit timestampsAlvaro Herrera2014-12-03
| | | | | | | | | | | | | | | | | | | | | Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut
* action_at_recovery_target recovery config optionSimon Riggs2014-11-25
| | | | | | | | | action_at_recovery_target = pause | promote | shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs
* Revamp the WAL record format.Heikki Linnakangas2014-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
* Move the backup-block logic from XLogInsert to a new file, xloginsert.c.Heikki Linnakangas2014-11-06
| | | | | | | | | | | | xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
* Switch to CRC-32C in WAL and other places.Heikki Linnakangas2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.
* Improve code around the recently added rm_identify rmgr callback.Andres Freund2014-09-22
| | | | | | | | | | | | | | | There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2: * append_init() in heapdesc.c was ugly and required that rm_identify return values are only valid till the next call. Instead just add a couple more switch() cases for the INIT_PAGE cases. Now the returned value will always be valid. * a couple rm_identify() callbacks missed masking xl_info with ~XLR_INFO_MASK. * pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar string. * append_init() was called when id=NULL - which should never actually happen. But it's better to be careful.
* Add rmgr callback to name xlog record types for display purposes.Andres Freund2014-09-19
| | | | | | | | | | | | | | | | | | | This is primarily useful for the upcoming pg_xlogdump --stats feature, but also allows to remove some duplicated code in the rmgr_desc routines. Due to the separation and harmonization, the output of dipsplayed records changes somewhat. But since this isn't enduser oriented content that's ok. It's potentially desirable to further change pg_xlogdump's display of records. It previously wasn't possible to show the record type separately from the description forcing it to be in the last column. But that's better done in a separate commit. Author: Abhijit Menon-Sen, slightly editorialized by me Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas Discussion: 20140604104716.GA3989@toroid.org
* Move log_newpage and log_newpage_buffer to xlog.c.Heikki Linnakangas2014-07-31
| | | | | | | | | | | log_newpage is used by many indexams, in addition to heap, but for historical reasons it's always been part of the heapam rmgr. Starting with 9.3, we have another WAL record type for logging an image of a page, XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to xlog.c, and switch to using the XLOG_FPI record type. Bump the WAL version number because the code to replay the old HEAP_NEWPAGE records is removed.
* Change the signature of rm_desc so that it's passed a XLogRecord.Heikki Linnakangas2014-06-14
| | | | Just feels more natural, and is more consistent with rm_redo.
* Adjust SP-GiST WAL record formats to reduce alignment padding.Heikki Linnakangas2014-06-05
| | | | | | | | | | | | | | | | | | The way the code was written, the padding was copied from uninitialized memory areas.. Because the structs are local variables in the code where the WAL records are constructed, making them larger and zeroing the padding bytes would not make the code very pretty, so rather than fixing this directly by zeroing out the padding bytes, it seems more clear to not try to align the tuples in the WAL records. The redo functions are taught to copy the tuple header to a local variable to avoid unaligned access. Stable-branches have the same problem, but we can't change the WAL format there, so fix in master only. Reading a few random extra bytes at the stack is harmless in practice, so it's not worth crafting a different back-patchable fix. Per reports from Kevin Grittner and Andres Freund, using clang static analyzer and Valgrind, respectively.
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Remove rm_safe_restartpoint machinery.Heikki Linnakangas2014-03-18
| | | | | | | | | It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.
* Make the handling of interrupted B-tree page splits more robust.Heikki Linnakangas2014-03-18
| | | | | | | | | | | | | | | | | | | | | | Splitting a page consists of two separate steps: splitting the child page, and inserting the downlink for the new right page to the parent. Previously, we handled the case that you crash in between those steps with a cleanup routine after the WAL recovery had finished, which finished the incomplete split. However, that doesn't help if the page split is interrupted but the database doesn't crash, so that you don't perform WAL recovery. That could happen for example if you run out of disk space. Remove the end-of-recovery cleanup step. Instead, when a page is split, the left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink is inserted to the parent, the flag is cleared again. If an insertion sees a page with the flag set, it knows that the split was interrupted for some reason, and inserts the missing downlink before proceeding. I used the same approach to fix GIN and GiST split algorithms earlier. This was the last WAL cleanup routine, so we could get rid of that whole machinery now, but I'll leave that for a separate patch. Reviewed by Peter Geoghegan.
* Fix race condition in B-tree page deletion.Heikki Linnakangas2014-03-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
* Change the way we mark tuples as frozen.Robert Haas2013-12-22
| | | | | | | | | | | | | | | | | | | | | | | | | | Instead of changing the tuple xmin to FrozenTransactionId, the combination of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never set together, is now defined as HEAP_XMIN_FROZEN. A variety of previous proposals to freeze tuples opportunistically before vacuum_freeze_min_age is reached have foundered on the objection that replacing xmin by FrozenTransactionId might hinder debugging efforts when things in this area go awry; this patch is intended to solve that problem by keeping the XID around (but largely ignoring the value to which it is set). Third-party code that checks for HEAP_XMIN_INVALID on tuples where HEAP_XMIN_COMMITTED might be set will be broken by this change. To fix, use the new accessor macros in htup_details.h rather than consulting the bits directly. HeapTupleHeaderGetXmin has been modified to return FrozenTransactionId when the infomask bits indicate that the tuple is frozen; use HeapTupleHeaderGetRawXmin when you already know that the tuple isn't marked commited or frozen, or want the raw value anyway. We currently do this in routines that display the xmin for user consumption, in tqual.c where it's known to be safe and important for the avoidance of extra cycles, and in the function-caching code for various procedural languages, which shouldn't invalidate the cache just because the tuple gets frozen. Robert Haas and Andres Freund
* Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers.Fujii Masao2013-12-21
| | | | Sawada Masahiko