aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/heap/rewriteheap.c
Commit message (Collapse)AuthorAge
* Replace RelationOpenSmgr() with RelationGetSmgr().Tom Lane2022-11-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a back-patch of the v15-era commit f10f0ae42 into older supported branches. The idea is to design out bugs in which an ill-timed relcache flush clears rel->rd_smgr partway through some code sequence that wasn't expecting that. We had another report today of a corner case that reliably crashes v14 under debug_discard_caches (nee CLOBBER_CACHE_ALWAYS), and therefore would crash once in a blue moon in the field. We're unlikely to get rid of all such code paths unless we adopt the more rigorous coding rules instituted by f10f0ae42. Therefore, even though this is a bit invasive, it's time to back-patch. Some comfort can be taken in the fact that f10f0ae42 has been in v15 for 16 months without problems. I left the RelationOpenSmgr macro present in the back branches, even though no core code should use it anymore, in order to not break third-party extensions in minor releases. Such extensions might opt to start using RelationGetSmgr instead, to reduce their code differential between v15 and earlier branches. This carries a hazard of failing to compile against headers from existing minor releases. However, once compiled the extension should work fine even with such releases, because RelationGetSmgr is a "static inline" function so it creates no link-time dependency. So depending on distribution practices, that might be an OK tradeoff. Per report from Spyridon Dimitrios Agathos. Original patch by Amul Sul. Discussion: https://postgr.es/m/CAFM5RaqdgyusQvmWkyPYaWMwoK5gigdtW-7HcgHgOeAw7mqJ_Q@mail.gmail.com Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
* fsync pg_logical/mappings in CheckPointLogicalRewriteHeap().Andres Freund2022-01-21
| | | | | | | | | | | While individual logical rewrite files were synced to disk, the directory was not. On some filesystems that could lead to loosing directory entries after a crash. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Author: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/867F2E29-2782-4869-970E-B984C6D35A8F@amazon.com Backpatch: 10-
* Revert "Skip WAL for new relfilenodes, under wal_level=minimal."Noah Misch2020-03-22
| | | | | | | | This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
* Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch2020-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
* Flush logical mapping files with fd opened for read/write at checkpointMichael Paquier2019-10-09
| | | | | | | | | | | | | | | The file descriptor was opened with read-only to fsync a regular file, which would cause EBADFD errors on some platforms. This is similar to the recent fix done by a586cc4b (which was broken by me with 82a5649), except that I noticed this issue while monitoring the backend code for similar mistakes. Backpatch to 9.4, as this has been introduced since logical decoding exists as of b89e151. Author: Michael Paquier Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20191006045548.GA14532@paquier.xyz Backpatch-through: 9.4
* Fix assorted inconsistencies.Amit Kapila2019-06-08
| | | | | | | | | | There were a number of issues in the recent commits which include typos, code and comments mismatch, leftover function declarations. Fix them. Reported-by: Alexander Lakhin Author: Alexander Lakhin, Amit Kapila and Amit Langote Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
* Initial pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
* Tighten use of OpenTransientFile and CloseTransientFileMichael Paquier2019-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | This fixes two sets of issues related to the use of transient files in the backend: 1) OpenTransientFile() has been used in some code paths with read-write flags while read-only is sufficient, so switch those calls to be read-only where necessary. These have been reported by Joe Conway. 2) When opening transient files, it is up to the caller to close the file descriptors opened. In error code paths, CloseTransientFile() gets called to clean up things before issuing an error. However in normal exit paths, a lot of callers of CloseTransientFile() never actually reported errors, which could leave a file descriptor open without knowing about it. This is an issue I complained about a couple of times, but never had the courage to write and submit a patch, so here we go. Note that one frontend code path is impacted by this commit so as an error is issued when fetching control file data, making backend and frontend to be treated consistently. Reported-by: Joe Conway, Michael Paquier Author: Michael Paquier Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
* Move remaining code from tqual.[ch] to heapam.h / heapam_visibility.c.Andres Freund2019-01-21
| | | | | | | | | | | | | | Given these routines are heap specific, and that there will be more generic visibility support in via table AM, it makes sense to move the prototypes to heapam.h (routines like HeapTupleSatisfiesVacuum will not be exposed in a generic fashion, because they are too storage specific). Similarly, the code in tqual.c is specific to heap, so moving it into access/heap/ makes sense. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Update copyright for 2019Bruce Momjian2019-01-02
| | | | Backpatch-through: certain files through 9.4
* Do not decode TOAST data for table rewritesTomas Vondra2018-11-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged using XLOG / FPI records, and thus (correctly) ignored in decoding. But the associated TOAST table is WAL-logged as plain INSERT records, and so was logically decoded and passed to reorder buffer. That has severe consequences with TOAST tables of non-trivial size. Firstly, reorder buffer has to keep all those changes, possibly spilling them to a file, incurring I/O costs and disk space. Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into a hash table, which got discarded only after processing the row from the main heap. But as the main heap is not decoded for rewrites, this never happened, so all the TOAST data accumulated in memory, resulting either in excessive memory consumption or OOM. The fix is simple, as commit e9edc1ba already introduced infrastructure (namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST tables, but it only applied it to system tables. So simply use it for all TOAST data in raw_heap_insert(). That would however solve only the memory consumption issue - the TOAST changes would still be decoded and added to the reorder buffer, and spilled to disk (although without TOAST tuple data, so much smaller). But we can solve that by tweaking DecodeInsert() to just ignore such INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag, instead of skipping them later in ReorderBufferCommit(). Review: Masahiko Sawada Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com Backpatch: 9.4-, where logical decoding was introduced
* PANIC on fsync() failure.Thomas Munro2018-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
* Use pg_pread() and pg_pwrite() for data files and WAL.Thomas Munro2018-11-07
| | | | | | | | | | | | | Cut down on system calls by doing random I/O using offset-based OS routines where available. Remove the code for tracking the 'virtual' seek position. The only reason left to call FileSeek() was to get the file's size, so provide a new function FileSize() instead. Author: Oskari Saarenmaa, Thomas Munro Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
* Fix logical decoding error when system table w/ toast is repeatedly rewritten.Andres Freund2018-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Repeatedly rewriting a mapped catalog table with VACUUM FULL or CLUSTER could cause logical decoding to fail with: ERROR, "could not map filenode \"%s\" to relation OID" To trigger the problem the rewritten catalog had to have live tuples with toasted columns. The problem was triggered as during catalog table rewrites the heap_insert() check that prevents logical decoding information to be emitted for system catalogs, failed to treat the new heap's toast table as a system catalog (because the new heap is not recognized as a catalog table via RelationIsLogicallyLogged()). The relmapper, in contrast to the normal catalog contents, does not contain historical information. After a single rewrite of a mapped table the new relation is known to the relmapper, but if the table is rewritten twice before logical decoding occurs, the relfilenode cannot be mapped to a relation anymore. Which then leads us to error out. This only happens for toast tables, because the main table contents aren't re-inserted with heap_insert(). The fix is simple, add a new heap_insert() flag that prevents logical decoding information from being emitted, and accept during decoding that there might not be tuple data for toast tables. Unfortunately that does not fix pre-existing logical decoding errors. Doing so would require not throwing an error when a filenode cannot be mapped to a relation during decoding, and that seems too likely to hide bugs. If it's crucial to fix decoding for an existing slot, temporarily changing the ERROR in ReorderBufferCommit() to a WARNING appears to be the best fix. Author: Andres Freund Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de Backpatch: 9.4-, where logical decoding was introduced
* Reset properly errno before calling write()Michael Paquier2018-08-05
| | | | | | | | | | | | 6cb3372 enforces errno to ENOSPC when less bytes than what is expected have been written when it is unset, though it forgot to properly reset errno before doing a system call to write(), causing errno to potentially come from a previous system call. Reported-by: Tom Lane Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
* Address set of issues with errno handlingMichael Paquier2018-06-25
| | | | | | | | | | | | | | | | | | | | System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
* Raise error when affecting tuple moved into different partition.Andres Freund2018-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an update moves a row between partitions (supported since 2f178441044b), our normal logic for following update chains in READ COMMITTED mode doesn't work anymore. Cross partition updates are modeled as an delete from the old and insert into the new partition. No ctid chain exists across partitions, and there's no convenient space to introduce that link. Not throwing an error in a partitioned context when one would have been thrown without partitioning is obviously problematic. This commit introduces infrastructure to detect when a tuple has been moved, not just plainly deleted. That allows to throw an error when encountering a deletion that's actually a move, while attempting to following a ctid chain. The row deleted as part of a cross partition update is marked by pointing it's t_ctid to an invalid block, instead of self as a normal update would. That was deemed to be the least invasive and most future proof way to represent the knowledge, given how few infomask bits are there to be recycled (there's also some locking issues with using infomask bits). External code following ctid chains should be updated to check for moved tuples. The most likely consequence of not doing so is a missed error. Author: Amul Sul, editorialized by me Reviewed-By: Amit Kapila, Pavan Deolasee, Andres Freund, Robert Haas Discussion: http://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com
* Update copyright for 2018Bruce Momjian2018-01-02
| | | | Backpatch-through: certain files through 9.3
* Perform a lot more sanity checks when freezing tuples.Andres Freund2017-12-14
| | | | | | | | | | | | | | | | The previous commit has shown that the sanity checks around freezing aren't strong enough. Strengthening them seems especially important because the existance of the bug has caused corruption that we don't want to make even worse during future vacuum cycles. The errors are emitted with ereport rather than elog, despite being "should never happen" messages, so a proper error code is emitted. To avoid superflous translations, mark messages as internal. Author: Andres Freund and Alvaro Herrera Reviewed-By: Alvaro Herrera, Michael Paquier Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de Backpatch: 9.3-
* Refactor new file permission handlingPeter Eisentraut2017-09-23
| | | | | | | | | | | | | | | | | | | The file handling functions from fd.c were called with a diverse mix of notations for the file permissions when they were opening new files. Almost all files created by the server should have the same permissions set. So change the API so that e.g. OpenTransientFile() automatically uses the standard permissions set, and OpenTransientFilePerm() is a new function that takes an explicit permissions set for the few cases where it is needed. This also saves an unnecessary argument for call sites that are just opening an existing file. While we're reviewing these APIs, get rid of the FileName typedef and use the standard const char * for the file name and mode_t for the file mode. This makes these functions match other file handling functions and removes an unnecessary layer of mysteriousness. We can also get rid of a few casts that way. Author: David Steele <david@pgmasters.net>
* Phase 2 of pgindent updates.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Initial pgindent run with pg_bsd_indent version 2.0.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new indent version includes numerous fixes thanks to Piotr Stefaniak. The main changes visible in this commit are: * Nicer formatting of function-pointer declarations. * No longer unexpectedly removes spaces in expressions using casts, sizeof, or offsetof. * No longer wants to add a space in "struct structname *varname", as well as some similar cases for const- or volatile-qualified pointers. * Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely. * Fixes bug where comments following declarations were sometimes placed with no space separating them from the code. * Fixes some odd decisions for comments following case labels. * Fixes some cases where comments following code were indented to less than the expected column 33. On the less good side, it now tends to put more whitespace around typedef names that are not listed in typedefs.list. This might encourage us to put more effort into typedef name collection; it's not really a bug in indent itself. There are more changes coming after this round, having to do with comment indentation and alignment of lines appearing within parentheses. I wanted to limit the size of the diffs to something that could be reviewed without one's eyes completely glazing over, so it seemed better to split up the changes as much as practical. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Fix new warnings from GCC 7Peter Eisentraut2017-04-17
| | | | | This addresses the new warning types -Wformat-truncation -Wformat-overflow that are part of -Wall, via -Wformat, in GCC 7.
* Create and use wait events for read, write, and fsync operations.Robert Haas2017-03-18
| | | | | | | | | | | | | Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and 6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from pg_stat_activity when a backend was stuck waiting for another backend, but it's also fairly common for a backend to be stuck waiting for an I/O. Add wait events for those operations, too. Rushabh Lathia, with further hacking by me. Reviewed and tested by Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed. Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
* Fix typos in comments.Heikki Linnakangas2017-02-06
| | | | | | | | | Backpatch to all supported versions, where applicable, to make backpatching of future fixes go more smoothly. Josh Soref Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Add macros to make AllocSetContextCreate() calls simpler and safer.Tom Lane2016-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls had typos in the context-sizing parameters. While none of these led to especially significant problems, they did create minor inefficiencies, and it's now clear that expecting people to copy-and-paste those calls accurately is not a great idea. Let's reduce the risk of future errors by introducing single macros that encapsulate the common use-cases. Three such macros are enough to cover all but two special-purpose contexts; those two calls can be left as-is, I think. While this patch doesn't in itself improve matters for third-party extensions, it doesn't break anything for them either, and they can gradually adopt the simplified notation over time. In passing, change TopMemoryContext to use the default allocation parameters. Formerly it could only be extended 8K at a time. That was probably reasonable when this code was written; but nowadays we create many more contexts than we did then, so that it's not unusual to have a couple hundred K in TopMemoryContext, even without considering various dubious code that sticks other things there. There seems no good reason not to let it use growing blocks like most other contexts. Back-patch to 9.6, mostly because that's still close enough to HEAD that it's easy to do so, and keeping the branches in sync can be expected to avoid some future back-patching pain. The bugs fixed by these changes don't seem to be significant enough to justify fixing them further back. Discussion: <21072.1472321324@sss.pgh.pa.us>
* Update copyright for 2016Bruce Momjian2016-01-02
| | | | Backpatch certain files through 9.1
* Fix misc typos.Heikki Linnakangas2015-09-05
| | | | Oskari Saarenmaa. Backpatch to stable branches where applicable.
* Collection of typo fixes.Heikki Linnakangas2015-05-20
| | | | | | | | | | | | | | | Use "a" and "an" correctly, mostly in comments. Two error messages were also fixed (they were just elogs, so no translation work required). Two function comments in pg_proc.h were also fixed. Etsuro Fujita reported one of these, but I found a lot more with grep. Also fix a few other typos spotted while grepping for the a/an typos. For example, "consists out of ..." -> "consists of ...". Plus a "though"/ "through" mixup reported by Euler Taveira. Many of these typos were in old code, which would be nice to backpatch to make future backpatching easier. But much of the code was new, and I didn't feel like crafting separate patches for each branch. So no backpatching.
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* Improve hash_create's API for selecting simple-binary-key hash functions.Tom Lane2014-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane
* Revamp the WAL record format.Heikki Linnakangas2014-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
* Move the backup-block logic from XLogInsert to a new file, xloginsert.c.Heikki Linnakangas2014-11-06
| | | | | | | | | | | | xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
* Assorted message fixes and improvementsPeter Eisentraut2014-09-05
|
* Rename logical decoding's pg_llog directory to pg_logical.Andres Freund2014-07-02
| | | | | | | | | | | | | | | The old name wasn't very descriptive as of actual contents of the directory, which are historical snapshots in the snapshots/ subdirectory and mappingdata for rewritten tuples in mappings/. There's been a fair amount of discussion what would be a good name. I'm settling for pg_logical because it's likely that further data around logical decoding and replication will need saving in the future. Also add the missing entry for the directory into storage.sgml's list of PGDATA contents. Bumps catversion as the data directories won't be compatible.
* Remove overeager assertion in logical_heap_begin_rewrite.Robert Haas2014-05-09
| | | | | | | It's legal to configure wal_level=logical and max_replication_slots=0 simultaneously. Andres Freund
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Fix broken logic in logical_heap_rewrite_flush_mappings().Tom Lane2014-04-22
| | | | | It's blatantly obvious that commit 4d0d607a454ee832574afd52a3c515099cc85eb3 wasn't tested. The leak's real enough, though.
* revert 4d0d607a454ee832574afd52a3c515099cc85eb3Bruce Momjian2014-04-22
| | | | Revert due to contrib/test_decoding regression failure
* release memory used while flushing logical mappingsBruce Momjian2014-04-22
| | | | Patch by Ants Aasma
* Fix typos in comments.Fujii Masao2014-03-17
| | | | Thom Brown
* Introduce logical decoding.Robert Haas2014-03-03
| | | | | | | | | | | | | | | | | | | | | | This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.
* Allow use of "z" flag in our printf calls, and use it where appropriate.Tom Lane2014-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
* Aggressively freeze tables when CLUSTER or VACUUM FULL rewrites them.Robert Haas2014-01-02
| | | | | | | | | We haven't wanted to do this in the past on the grounds that in rare cases the original xmin value will be needed for forensic purposes, but commit 37484ad2aacef5ec794f4dd3d5cf814475180a78 removes that objection, so now we can. Per extensive discussion, among many people, on pgsql-hackers.
* Don't include unused space in LOG_NEWPAGE records.Heikki Linnakangas2013-12-04
| | | | | This is the same trick we use when taking a full page image of a buffer passed to XLogInsert.
* Rename various "freeze multixact" variablesAlvaro Herrera2013-09-16
| | | | | | | | | It seems to make more sense to use "cutoff multixact" terminology throughout the backend code; "freeze" is associated with replacing of an Xid with FrozenTransactionId, which is not what we do for MultiXactIds. Andres Freund Some adjustments by Álvaro Herrera
* pgindent run for release 9.3Bruce Momjian2013-05-29
| | | | | This is the first run of the Perl-based pgindent script. Also update pgindent instructions.
* Fix checksums for CLUSTER, VACUUM FULL etc.Simon Riggs2013-04-07
| | | | | | | | | In CLUSTER, VACUUM FULL and ALTER TABLE SET TABLESPACE I erroneously set checksum before log_newpage, which sets the LSN and invalidates the checksum. So set checksum immediately *after* log_newpage. Bug report Fujii Masao, Fix and patch by Jeff Davis